Hello everyone Greetings I m running into a pretty strange i Meltano #troubleshooting

Hello everyone! Greetings I’m running into a pret...

victor_macaubas

01/17/2025, 5:23 PM

Hello everyone! Greetings I’m running into a pretty strange issue and I’m wondering if anyone has encountered something similar. I’m running Meltano in Airflow, with two of my DAGs using the same extractor to read data from a MySQL database using log-based replication. Both pipelines are running from a Celery image in a container. I’m using two DAGs because the select streams are different, so I’m ingesting different tables in each DAG. These pipelines are near real-time and run every 5 minutes. Here’s where it gets weird: after a successful log based run, the next run starts doing a full load instead of log based. It’s like Meltano loses the state file. This issue is intermittent with no clear pattern, but it happens often enough to be a problem. Does anyone have any idea what might be causing this? The solution for the moment is to recover the last complete state from the

runs

table and update the

state

table with it. But it's happening more and more. Here are the logs of when it starts doing a full load:

Copy code

time=2025-01-17 05:13:20 name=tap_mysql level=INFO message=LOG_BASED stream prod-table_1 will resume its historical sync cmd_type=elb consumer=False job_name=prod:prod-to-snowflake:prod_nrt_extract_load name=prod producer=True run_id=8bf705b8-0e81-45d5-a71a-77bf7261754e stdio=stderr string_id=clockwork
time=2025-01-17 05:13:20 name=tap_mysql level=INFO message=LOG_BASED stream  prod-table_2 requires full historical sync cmd_type=elb consumer=False job_name=prod:prod-to-snowflake:prod_nrt_extract_load name=prod producer=True run_id=8bf705b8-0e81-45d5-a71a-

Meltano: 3.5.4 Python: 3.9 tap: pipelinewise mysql-tap Meltano backend: Postgres

Edgar Ramírez (Arch.dev)

01/17/2025, 10:20 PM

how are passing the postgres URI to Meltano?

victor_macaubas

01/20/2025, 11:46 AM

Hey there @Edgar Ramírez (Arch.dev) the postgres URI is being passed through a env variable. Meltano gets it from the aws secrets manager.

victor_macaubas

01/22/2025, 7:50 PM

Posting here as this might help someone: I was able to identify the cause of this weird behavior. The core issue was that we were running two DAGs with the same extractor. It appears that Meltano caches the state file in a folder inside the run folder. Since we were running both pipelines on the same worker, they shared the ephemeral storage. The folder is created following this structure:

.meltano/run/{extractor_name}/state.json

Since both dags used the same extractor, when both pipelines started at the same time, the state files in the run folder would get overwritten. This caused Meltano to perform a full sync for tables that it couldn’t find in the state file. This behavior became evident when looking at the logs with debug mode on:

--state', '/usr/local/airflow/.meltano/run/my_extractor/state.json'

The simplest solution was to create a separate extractor for each dag, that way we avoid the states being overwritten.

Edgar Ramírez (Arch.dev)

01/22/2025, 10:01 PM

Ah, so this similar to https://github.com/meltano/meltano/issues/8763, but for the state files

Edgar Ramírez (Arch.dev)

01/22/2025, 10:01 PM

I'm pressed to fix it sooner rather than later

victor_macaubas

01/23/2025, 12:02 PM

@Edgar Ramírez (Arch.dev) it would be possible to append something to the state file name? i.e:

.meltano/run/{extractor_name}/state_{state_id}.json

Edgar Ramírez (Arch.dev)

01/23/2025, 8:06 PM

Maybe, I'd have to try it out. See https://github.com/meltano/meltano/pull/8794 for an approach I tried (and failed) with catalog files.

5 Views

Open in Slack

Previous Next