Heads up that I think I ve found a fairly serious bug in the Meltano #troubleshooting

Heads-up that I think I’ve found a fairly serious ...

paul_tiplady

02/18/2022, 5:07 PM

Heads-up that I think I’ve found a fairly serious bug in the

tap-mysql

that’s bundled with Meltano: https://github.com/transferwise/pipelinewise-tap-mysql/issues/91 This was causing data-loss for me in my ELT pipeline. The summary is that their

FULL_TABLE

replication mode doesn’t actually do a full table extraction if you have an auto-incrementing PK, and instead defaults to an incremental extraction in that case. I’d appreciate more eyes on this one to double-check my logic, but if I’m right I think this is quite severe.

paul_tiplady

02/18/2022, 5:55 PM

As I’m digging further, I’m wondering if this is another Meltano state issue. It seems Meltano might not be honoring the tap’s

clear_bookmark

call and so these intermediate bits of state are persisting between runs — is this possible? (I’m on Meltano v1.69 FWIW). I think the tap is not strictly conforming to the spec, as regardless of the state that gets passed in, it should do a full sync on each run. But I think it’s being provoked by passing in initial state from a previous run that it’s not expecting.

aaronsteers

02/18/2022, 6:03 PM

Do I understand correctly that the tap is internally calling a

clear_bookmark()

method of sorts, intending to clear some part of the STATE, and Meltano is apparently merging with a prior STATE in a way that the

clear

operation is not having the intended affect?

aaronsteers

02/18/2022, 6:05 PM

I believe Meltano does attempt to merge STATES for failed/canceled executions but I do not think this should affect successfully completed executions. Is it possible for the cases you are seeing that one or more of the jobs are failing and/or being canceled?

aaronsteers

02/18/2022, 6:06 PM

Cc @ken_payne who did some research for us on this topic.

fred_reimer

02/18/2022, 6:06 PM

Wow, for sure if you set the mode to full table it should do full table. It looks to me that the tap is using the catalog information from the database itself and checking to see if it has an autoincrementing key, and if so uses incremental even though this is the full table utilities/function. It's trying to be too clever, IMO. Relevant lines in that file are 31 through 58.

aaronsteers

02/18/2022, 6:09 PM

It sounds like you are probably spot on with the root cause analysis. I read the issue you logged for them and it seems pretty well described. I'm sure if you had time to contribute a PR for them, that would be greatly appreciate.

paul_tiplady

02/18/2022, 6:09 PM

Yeah as I dig a bit more, it seems that Meltano may not be honoring the

clear_bookmark

— at least as the tap is expecting to see things. I’m not certain here, as looking at the logs from my pipeline it does look like the keys are deleted from the state that is being generated later in the run. But maybe if Meltano is doing a merge, rather than a replace, of the state dict, this would explain what I’m seeing? Regardless of whether Meltano is leaving stray state, I think the tap is definitely non-conformant with this behavior.

aaronsteers

02/18/2022, 6:10 PM

As a mitigation, does it seem like this might be worked around by surgically altering the state as a one time fix?

paul_tiplady

02/18/2022, 6:10 PM

Yep that’s the fix I applied last night to resolve this.

aaronsteers

02/18/2022, 6:10 PM

Okay, great. Glad to hear.

paul_tiplady

02/18/2022, 6:10 PM

And I suppose that also validates the hypothesis

aaronsteers

02/18/2022, 6:11 PM

👍

paul_tiplady

02/18/2022, 6:11 PM

Regardless of the logs being emitted with a “cleared bookmark”, the state in the DB after the run still has these “intermediate keys”.

aaronsteers

02/18/2022, 6:12 PM

And does this happen even if the job runs to completion without error? (As noted above, my expectation would be that

merge()

is called by Meltano only on failure, but would be good to confirm that.)

paul_tiplady

02/18/2022, 6:15 PM

Yes, a successful job run is not clearing the intermediate state.

paul_tiplady

02/18/2022, 6:16 PM

Actually let me check, being more precise, a successful run does not remove the intermediate state. Let me see if it’s being added on every run, or just when there’s a failure.

paul_tiplady

02/18/2022, 6:20 PM

It does look like these intermediate keys are being added to the

job.payload

even on a successful run. I don’t think I’ve had a pipeline failure since last night (30-minute interval) and yet they are in the latest state.

paul_tiplady

02/18/2022, 6:26 PM

Let me know if it would be helpful to open an issue in Gitlab to track the investigation on this side, even if it’s not a serious issue. In another state issue I was told that the state machinery was updated in 1.79 and I should try that version; I’ve not completed validation testing on that release so I haven’t tried upgrading yet.

paul_tiplady

02/24/2022, 1:26 AM

Raised https://gitlab.com/meltano/meltano/-/issues/3284 to track this discussion.

paul_tiplady

03/01/2022, 5:17 PM

@aaronsteers et. al., I have put together a fix in

pipelinewise-tap-mysql

that makes the FULL_TABLE sync less smart, and more conformant, so it always does a full table scan and no longer stashes a replication key in the state to do an incremental sync if you re-run it. I’d appreciate more eyes on this since it might break workflows that Meltano users are implicitly depending on: https://github.com/paultiplady/pipelinewise-tap-mysql/pull/4 Basically, the twist is that the LOG_BASED replication mode does a FULL_TABLE sync under the covers for its first sync (when it can’t see evidence in the state that the historical data has been fully sync’d previously). That FULL_TABLE was also depending on this “INCREMENTAL with pk replication key if you resume the extract” logic as well. I could see an argument for keeping the resumable extract logic for the LOG_BASED initial extract — but since this logic was commonized with FULL_TABLE, and LOG_BASED isn’t even in the Singer spec, I just opted to take the most direct route and rip out the resumability in both FULL_TABLE and LOG_BASED extract to solve my problem. Before I spend time getting the tests green to upstream this fix, I’d appreciate anyone here chiming in if they think that modifying the LOG_BASED initial full-table extract to abandon the state-based resumability will cause them problems. Worst-case, this could make extraction take longer for LOG_BASED if you hit an error in the middle of the initial full-table extraction since it will start at the beginning when you re-extract. Perhaps that’s what would be expected though? I’m not sure. It would be more effort to preserve this logic for LOG_BASED and I’m not sure I have a lot of appetite for that work given I just use FULL_TABLE, and the LOG_BASED unit tests are pretty gnarly. But if there’s a simple solution I’m open to considering it.

paul_tiplady

03/01/2022, 7:42 PM

Actually, please see https://github.com/transferwise/pipelinewise-tap-mysql/pull/96 where I opened a WIP PR with my current work. Probably best to have any discussion over there.

Open in Slack

Previous Next