Hi all, We’re running a meltano EL process between...
# best-practices
j
Hi all, We’re running a meltano EL process between a mysql database and BigQuery and I’d like to ask some help on which approach would be best to “force” a refresh of a single table/stream as it seems like it’s not something that is supported right out of the box (unless I’m missing something or recent developments have changed this). More info in 🧵 (apologies for the walls of text, I just wanted to provide as much context as possible)
Current setup • meltano 2.19.0 • tap-mysql (
transferwise
variant) • target-bigquery (
z3z1ma
variant) • Syncing ~80 MySQL tables hourly LOG_BASED replication method with deduplication and upsert into BQ • Orchestrated with Airflow For ease of use by other teams we’ve introduced a “full-refresh” job alongside the hourly job that runs the same
meltano run
command but with the
--full-refresh
parameter appended. Similar to this suggestion). We don’t do any additional state changes so the state_id is the same for both jobs. One of the downsides is that a full refresh of the MySQL database takes multiple hours atm (there might be some performance gains here to get the runtime down, but we haven’t looked into those yet) which means it’s not something we can do on a regular basis during development. This is one of the reasons why we would like to be able to full-refresh a single table. This, and other arguments, were nicely described in Option for full extraction (#2125) · Issues · Meltano and followed up in Option for extracting only a specific stream (#2155) · Issues · Meltano by @douwe_maan and @adam_roderick which lead me believe this use case is supported. However, when looking at the documentation and testing myself it seems like the
--select
parameter is not supported by the
meltano run
command and only by
meltano el(t)
. All of this lead us down the rabbithole on various approaches and we’d like some guidance as to which we should investigate further.
1. meltano el(t) As mentioned above, the
meltano elt
command seems to have this functionality although in our limited testing,
elt
seems to always create a new state. Perhaps we should look more into this command and it’s differences with
meltano run
?
2. full-refresh with select(_filter) We did find a way to select specific tables by setting environment variables as mentioned in this part of the docs. e.g.
Copy code
TAP_MYSQL__SELECT_FILTER='["table_1", "table_2"]' meltano run tap-mysql target-bigquery --full-refresh
Perhaps this is also the way this “extracting specific stream issue” functionality has evolved to work with other commands? The downside we found with this approach is when combining this feature with the
--full-refresh
command. The full-refresh will negate the previous state (which in this example could include bookmarks on table_3, table_4, etc) and at the end of the run, write the new state with only table_1 and table 2. Our subsequent hourly non-refresh runs would then think the other tables need a full sync as no bookmarks were found. The current workarounds we found were: --no-update-state
Copy code
TAP_MYSQL__SELECT_FILTER='["table_1", "table_2"]' meltano run tap-mysql target-bigquery --full-refresh --no-update-state
With this additional flag, the state will not be overwritten and bookmarks of tables not in the selection won’t be lost, but the bookmark of the “fully refreshed selected tables” might no longer be “correct” compared to the actual data. In our specific use case where we upsert data this isn’t really an issue, but it doesn’t seem like a robust approach. state move/merge To not lose bookmarks we could also move/rename the old state, full refresh the selected tables, have it write a new state and then merging both states to yield a fully correct state of the data. This sounds like it could work but does require some orchestration.
3. meta_data replication method Another approach could be to use another set of environment variables to modify the replication method of a specific table/run. We think this is what is being suggested here. e.g.
Copy code
TAP_MYSQL__METADATA='{"table_1": {"replication_method": "FULL_TABLE"}}'
# or
TAP_MYSQL__METADATA_TABLE_1_REPLICATION_KEY=FULL_TABLE
We believe this would cause a meltano run to fully extract the source table. In combination with the above-mentioned
select_filter
one could “fully refresh” specific tables although we have yet to test it out. We’re also not sure what would happen with the state/bookmarks when mixing replication methods.
4. modifying state to trigger refresh Another approach, suggested by @taylor, was to modify the state removing the bookmark of those specific tables. The next run would then refresh those tables. This seems a bit more “hack-ish” as it would involve parsing the state JSON and removing a specific key. I don’t think there is any builtin functionality to support this specific operation?
We haven’t explored any of these options any further and before diving any deeper into them we’d like some guidance as to what approach would be best or any documentation that would help us forward as things might have changed in the meantime. FYI some other references I found while digging through Slack • https://meltano.slack.com/archives/CMN8HELB0/p1679923692282169?thread_ts=1679794047.751059&cid=CMN8HELB0https://meltano.slack.com/archives/C013EKWA2Q1/p1649635963918609https://meltano.slack.com/archives/C013EKWA2Q1/p1644520624602999https://meltano.slack.com/archives/C01TCRBBJD7/p1637230720215400
d
@john_cabuena
meltano elt
with
--select
and
--full-refresh
can still do this, but you’ll need to explicitly provide a
--state-id
— were you doing that? @edgar_ramirez_mondragon Can you please comment on how this same behavior could be achieved with
meltano run
, if at all?
j
We haven’t had the time to try the
--state-id
yet. Looking into some of the docs it seems like
elt
might be better suited for this use case as we’re usually only syncing 1 tap/target and any additional steps would probably be managed by airflow. Thanks for the feedback @douwe_maan! Still curious to hear what the best approach is for those cases where
elt
is not an option
e
I don't think this can be accomplished with
meltano run
without manually merging states at the end of the run to update only the desired streams, perhaps using
meltano state merge
. The root cause is that the context during a
run
execution is missing the select-filter flag and so the state is marked as _complete_: https://github.com/meltano/meltano/blob/d9a9809614c18e31029eb7acb272e0578a101842/src/meltano/core/plugin/singer/target.py#L165-L166
u
@edgar_ramirez_mondragon Ah, do you think that’s worth fixing when the SELECT_FILTER env var is used?
e
@douwe_maan That or a CLI flag like
--merge-state/--overwrite-state
that tells Meltano what to do with state after
meltano run
.
d
@edgar_ramirez_mondragon Can you file an issue for that please?
e