john_cabuena
08/16/2023, 10:32 AMjohn_cabuena
08/16/2023, 10:33 AMtransferwise variant)
• target-bigquery (z3z1ma variant)
• Syncing ~80 MySQL tables hourly LOG_BASED replication method with deduplication and upsert into BQ
• Orchestrated with Airflow
For ease of use by other teams we’ve introduced a “full-refresh” job alongside the hourly job that runs the same meltano run command but with the --full-refresh parameter appended. Similar to this suggestion). We don’t do any additional state changes so the state_id is the same for both jobs.
One of the downsides is that a full refresh of the MySQL database takes multiple hours atm (there might be some performance gains here to get the runtime down, but we haven’t looked into those yet) which means it’s not something we can do on a regular basis during development. This is one of the reasons why we would like to be able to full-refresh a single table. This, and other arguments, were nicely described in Option for full extraction (#2125) · Issues · Meltano and followed up in Option for extracting only a specific stream (#2155) · Issues · Meltano by @douwe_maan and @adam_roderick which lead me believe this use case is supported.
However, when looking at the documentation and testing myself it seems like the --select parameter is not supported by the meltano run command and only by meltano el(t).
All of this lead us down the rabbithole on various approaches and we’d like some guidance as to which we should investigate further.john_cabuena
08/16/2023, 10:33 AMmeltano elt command seems to have this functionality although in our limited testing, elt seems to always create a new state. Perhaps we should look more into this command and it’s differences with meltano run?john_cabuena
08/16/2023, 10:34 AMTAP_MYSQL__SELECT_FILTER='["table_1", "table_2"]' meltano run tap-mysql target-bigquery --full-refresh
Perhaps this is also the way this “extracting specific stream issue” functionality has evolved to work with other commands?
The downside we found with this approach is when combining this feature with the --full-refresh command. The full-refresh will negate the previous state (which in this example could include bookmarks on table_3, table_4, etc) and at the end of the run, write the new state with only table_1 and table 2. Our subsequent hourly non-refresh runs would then think the other tables need a full sync as no bookmarks were found.
The current workarounds we found were:
--no-update-state
TAP_MYSQL__SELECT_FILTER='["table_1", "table_2"]' meltano run tap-mysql target-bigquery --full-refresh --no-update-state
With this additional flag, the state will not be overwritten and bookmarks of tables not in the selection won’t be lost, but the bookmark of the “fully refreshed selected tables” might no longer be “correct” compared to the actual data.
In our specific use case where we upsert data this isn’t really an issue, but it doesn’t seem like a robust approach.
state move/merge
To not lose bookmarks we could also move/rename the old state, full refresh the selected tables, have it write a new state and then merging both states to yield a fully correct state of the data. This sounds like it could work but does require some orchestration.john_cabuena
08/16/2023, 10:35 AMTAP_MYSQL__METADATA='{"table_1": {"replication_method": "FULL_TABLE"}}'
# or
TAP_MYSQL__METADATA_TABLE_1_REPLICATION_KEY=FULL_TABLE
We believe this would cause a meltano run to fully extract the source table. In combination with the above-mentioned select_filter one could “fully refresh” specific tables although we have yet to test it out. We’re also not sure what would happen with the state/bookmarks when mixing replication methods.john_cabuena
08/16/2023, 10:35 AMjohn_cabuena
08/16/2023, 10:37 AMdouwe_maan
08/16/2023, 10:05 PMmeltano elt with --select and --full-refresh can still do this, but you’ll need to explicitly provide a --state-id — were you doing that?
@edgar_ramirez_mondragon Can you please comment on how this same behavior could be achieved with meltano run, if at all?john_cabuena
08/16/2023, 10:27 PM--state-id yet. Looking into some of the docs it seems like elt might be better suited for this use case as we’re usually only syncing 1 tap/target and any additional steps would probably be managed by airflow. Thanks for the feedback @douwe_maan!
Still curious to hear what the best approach is for those cases where elt is not an optionedgar_ramirez_mondragon
08/16/2023, 11:31 PMmeltano run without manually merging states at the end of the run to update only the desired streams, perhaps using meltano state merge.
The root cause is that the context during a run execution is missing the select-filter flag and so the state is marked as _complete_: https://github.com/meltano/meltano/blob/d9a9809614c18e31029eb7acb272e0578a101842/src/meltano/core/plugin/singer/target.py#L165-L166user
08/16/2023, 11:32 PMedgar_ramirez_mondragon
08/16/2023, 11:39 PM--merge-state/--overwrite-state that tells Meltano what to do with state after meltano run.douwe_maan
08/16/2023, 11:40 PMedgar_ramirez_mondragon
08/17/2023, 12:00 AM