Heads-up that I think I’ve found a fairly serious ...
# troubleshooting
p
Heads-up that I think I’ve found a fairly serious bug in the
tap-mysql
that’s bundled with Meltano: https://github.com/transferwise/pipelinewise-tap-mysql/issues/91 This was causing data-loss for me in my ELT pipeline. The summary is that their
FULL_TABLE
replication mode doesn’t actually do a full table extraction if you have an auto-incrementing PK, and instead defaults to an incremental extraction in that case. I’d appreciate more eyes on this one to double-check my logic, but if I’m right I think this is quite severe.
As I’m digging further, I’m wondering if this is another Meltano state issue. It seems Meltano might not be honoring the tap’s
clear_bookmark
call and so these intermediate bits of state are persisting between runs — is this possible? (I’m on Meltano v1.69 FWIW). I think the tap is not strictly conforming to the spec, as regardless of the state that gets passed in, it should do a full sync on each run. But I think it’s being provoked by passing in initial state from a previous run that it’s not expecting.
a
Do I understand correctly that the tap is internally calling a
clear_bookmark()
method of sorts, intending to clear some part of the STATE, and Meltano is apparently merging with a prior STATE in a way that the
clear
operation is not having the intended affect?
I believe Meltano does attempt to merge STATES for failed/canceled executions but I do not think this should affect successfully completed executions. Is it possible for the cases you are seeing that one or more of the jobs are failing and/or being canceled?
Cc @ken_payne who did some research for us on this topic.
f
Wow, for sure if you set the mode to full table it should do full table. It looks to me that the tap is using the catalog information from the database itself and checking to see if it has an autoincrementing key, and if so uses incremental even though this is the full table utilities/function. It's trying to be too clever, IMO. Relevant lines in that file are 31 through 58.
a
It sounds like you are probably spot on with the root cause analysis. I read the issue you logged for them and it seems pretty well described. I'm sure if you had time to contribute a PR for them, that would be greatly appreciate.
p
Yeah as I dig a bit more, it seems that Meltano may not be honoring the
clear_bookmark
— at least as the tap is expecting to see things. I’m not certain here, as looking at the logs from my pipeline it does look like the keys are deleted from the state that is being generated later in the run. But maybe if Meltano is doing a merge, rather than a replace, of the state dict, this would explain what I’m seeing? Regardless of whether Meltano is leaving stray state, I think the tap is definitely non-conformant with this behavior.
a
As a mitigation, does it seem like this might be worked around by surgically altering the state as a one time fix?
p
Yep that’s the fix I applied last night to resolve this.
a
Okay, great. Glad to hear.
p
And I suppose that also validates the hypothesis
a
👍
p
Regardless of the logs being emitted with a “cleared bookmark”, the state in the DB after the run still has these “intermediate keys”.
a
And does this happen even if the job runs to completion without error? (As noted above, my expectation would be that
merge()
is called by Meltano only on failure, but would be good to confirm that.)
p
Yes, a successful job run is not clearing the intermediate state.
Actually let me check, being more precise, a successful run does not remove the intermediate state. Let me see if it’s being added on every run, or just when there’s a failure.
It does look like these intermediate keys are being added to the
job.payload
even on a successful run. I don’t think I’ve had a pipeline failure since last night (30-minute interval) and yet they are in the latest state.
Let me know if it would be helpful to open an issue in Gitlab to track the investigation on this side, even if it’s not a serious issue. In another state issue I was told that the state machinery was updated in 1.79 and I should try that version; I’ve not completed validation testing on that release so I haven’t tried upgrading yet.
Raised https://gitlab.com/meltano/meltano/-/issues/3284 to track this discussion.
@aaronsteers et. al., I have put together a fix in
pipelinewise-tap-mysql
that makes the FULL_TABLE sync less smart, and more conformant, so it always does a full table scan and no longer stashes a replication key in the state to do an incremental sync if you re-run it. I’d appreciate more eyes on this since it might break workflows that Meltano users are implicitly depending on: https://github.com/paultiplady/pipelinewise-tap-mysql/pull/4 Basically, the twist is that the LOG_BASED replication mode does a FULL_TABLE sync under the covers for its first sync (when it can’t see evidence in the state that the historical data has been fully sync’d previously). That FULL_TABLE was also depending on this “INCREMENTAL with pk replication key if you resume the extract” logic as well. I could see an argument for keeping the resumable extract logic for the LOG_BASED initial extract — but since this logic was commonized with FULL_TABLE, and LOG_BASED isn’t even in the Singer spec, I just opted to take the most direct route and rip out the resumability in both FULL_TABLE and LOG_BASED extract to solve my problem. Before I spend time getting the tests green to upstream this fix, I’d appreciate anyone here chiming in if they think that modifying the LOG_BASED initial full-table extract to abandon the state-based resumability will cause them problems. Worst-case, this could make extraction take longer for LOG_BASED if you hit an error in the middle of the initial full-table extraction since it will start at the beginning when you re-extract. Perhaps that’s what would be expected though? I’m not sure. It would be more effort to preserve this logic for LOG_BASED and I’m not sure I have a lot of appetite for that work given I just use FULL_TABLE, and the LOG_BASED unit tests are pretty gnarly. But if there’s a simple solution I’m open to considering it.
Actually, please see https://github.com/transferwise/pipelinewise-tap-mysql/pull/96 where I opened a WIP PR with my current work. Probably best to have any discussion over there.