Hi all We re running a meltano EL process between a mysql da Meltano #best-practices

Hi all, We’re running a meltano EL process between...

john_cabuena

08/16/2023, 10:32 AM

Hi all, We’re running a meltano EL process between a mysql database and BigQuery and I’d like to ask some help on which approach would be best to “force” a refresh of a single table/stream as it seems like it’s not something that is supported right out of the box (unless I’m missing something or recent developments have changed this). More info in 🧵 (apologies for the walls of text, I just wanted to provide as much context as possible)

john_cabuena

08/16/2023, 10:33 AM

Current setup • meltano 2.19.0 • tap-mysql (

transferwise

variant) • target-bigquery (

z3z1ma

variant) • Syncing ~80 MySQL tables hourly LOG_BASED replication method with deduplication and upsert into BQ • Orchestrated with Airflow For ease of use by other teams we’ve introduced a “full-refresh” job alongside the hourly job that runs the same

meltano run

command but with the

--full-refresh

parameter appended. Similar to this suggestion). We don’t do any additional state changes so the state_id is the same for both jobs. One of the downsides is that a full refresh of the MySQL database takes multiple hours atm (there might be some performance gains here to get the runtime down, but we haven’t looked into those yet) which means it’s not something we can do on a regular basis during development. This is one of the reasons why we would like to be able to full-refresh a single table. This, and other arguments, were nicely described in Option for full extraction (#2125) · Issues · Meltano and followed up in Option for extracting only a specific stream (#2155) · Issues · Meltano by @douwe_maan and @adam_roderick which lead me believe this use case is supported. However, when looking at the documentation and testing myself it seems like the

--select

parameter is not supported by the

meltano run

command and only by

meltano el(t)

. All of this lead us down the rabbithole on various approaches and we’d like some guidance as to which we should investigate further.

john_cabuena

08/16/2023, 10:33 AM

1. meltano el(t) As mentioned above, the

meltano elt

command seems to have this functionality although in our limited testing,

elt

seems to always create a new state. Perhaps we should look more into this command and it’s differences with

meltano run

john_cabuena

08/16/2023, 10:34 AM

2. full-refresh with select(_filter) We did find a way to select specific tables by setting environment variables as mentioned in this part of the docs. e.g.

Copy code

TAP_MYSQL__SELECT_FILTER='["table_1", "table_2"]' meltano run tap-mysql target-bigquery --full-refresh

Perhaps this is also the way this “extracting specific stream issue” functionality has evolved to work with other commands? The downside we found with this approach is when combining this feature with the

--full-refresh

command. The full-refresh will negate the previous state (which in this example could include bookmarks on table_3, table_4, etc) and at the end of the run, write the new state with only table_1 and table 2. Our subsequent hourly non-refresh runs would then think the other tables need a full sync as no bookmarks were found. The current workarounds we found were: --no-update-state

Copy code

TAP_MYSQL__SELECT_FILTER='["table_1", "table_2"]' meltano run tap-mysql target-bigquery --full-refresh --no-update-state

With this additional flag, the state will not be overwritten and bookmarks of tables not in the selection won’t be lost, but the bookmark of the “fully refreshed selected tables” might no longer be “correct” compared to the actual data. In our specific use case where we upsert data this isn’t really an issue, but it doesn’t seem like a robust approach. state move/merge To not lose bookmarks we could also move/rename the old state, full refresh the selected tables, have it write a new state and then merging both states to yield a fully correct state of the data. This sounds like it could work but does require some orchestration.

john_cabuena

08/16/2023, 10:35 AM

3. meta_data replication method Another approach could be to use another set of environment variables to modify the replication method of a specific table/run. We think this is what is being suggested here. e.g.

Copy code

TAP_MYSQL__METADATA='{"table_1": {"replication_method": "FULL_TABLE"}}'
# or
TAP_MYSQL__METADATA_TABLE_1_REPLICATION_KEY=FULL_TABLE

We believe this would cause a meltano run to fully extract the source table. In combination with the above-mentioned

select_filter

one could “fully refresh” specific tables although we have yet to test it out. We’re also not sure what would happen with the state/bookmarks when mixing replication methods.

john_cabuena

08/16/2023, 10:35 AM

4. modifying state to trigger refresh Another approach, suggested by @taylor, was to modify the state removing the bookmark of those specific tables. The next run would then refresh those tables. This seems a bit more “hack-ish” as it would involve parsing the state JSON and removing a specific key. I don’t think there is any builtin functionality to support this specific operation?

john_cabuena

08/16/2023, 10:37 AM

We haven’t explored any of these options any further and before diving any deeper into them we’d like some guidance as to what approach would be best or any documentation that would help us forward as things might have changed in the meantime. FYI some other references I found while digging through Slack • https://meltano.slack.com/archives/CMN8HELB0/p1679923692282169?thread_ts=1679794047.751059&cid=CMN8HELB0 • https://meltano.slack.com/archives/C013EKWA2Q1/p1649635963918609 • https://meltano.slack.com/archives/C013EKWA2Q1/p1644520624602999 • https://meltano.slack.com/archives/C01TCRBBJD7/p1637230720215400

douwe_maan

08/16/2023, 10:05 PM

@john_cabuena

meltano elt

with

--select

and

--full-refresh

can still do this, but you’ll need to explicitly provide a

--state-id

— were you doing that? @edgar_ramirez_mondragon Can you please comment on how this same behavior could be achieved with

meltano run

, if at all?

john_cabuena

08/16/2023, 10:27 PM

We haven’t had the time to try the

--state-id

yet. Looking into some of the docs it seems like

elt

might be better suited for this use case as we’re usually only syncing 1 tap/target and any additional steps would probably be managed by airflow. Thanks for the feedback @douwe_maan! Still curious to hear what the best approach is for those cases where

elt

is not an option

edgar_ramirez_mondragon

08/16/2023, 11:31 PM

I don't think this can be accomplished with

meltano run

without manually merging states at the end of the run to update only the desired streams, perhaps using

meltano state merge

. The root cause is that the context during a

run

execution is missing the select-filter flag and so the state is marked as _complete_: https://github.com/meltano/meltano/blob/d9a9809614c18e31029eb7acb272e0578a101842/src/meltano/core/plugin/singer/target.py#L165-L166

user

08/16/2023, 11:32 PM

@edgar_ramirez_mondragon Ah, do you think that’s worth fixing when the SELECT_FILTER env var is used?

edgar_ramirez_mondragon

08/16/2023, 11:39 PM

@douwe_maan That or a CLI flag like

--merge-state/--overwrite-state

that tells Meltano what to do with state after

meltano run

douwe_maan

08/16/2023, 11:40 PM

@edgar_ramirez_mondragon Can you file an issue for that please?

edgar_ramirez_mondragon

08/17/2023, 12:00 AM

https://github.com/meltano/meltano/issues/8033

6 Views

Open in Slack

Previous Next