paul_tiplady
02/18/2022, 5:07 PMtap-mysql
that’s bundled with Meltano: https://github.com/transferwise/pipelinewise-tap-mysql/issues/91
This was causing data-loss for me in my ELT pipeline.
The summary is that their FULL_TABLE
replication mode doesn’t actually do a full table extraction if you have an auto-incrementing PK, and instead defaults to an incremental extraction in that case.
I’d appreciate more eyes on this one to double-check my logic, but if I’m right I think this is quite severe.paul_tiplady
02/18/2022, 5:55 PMclear_bookmark
call and so these intermediate bits of state are persisting between runs — is this possible? (I’m on Meltano v1.69 FWIW).
I think the tap is not strictly conforming to the spec, as regardless of the state that gets passed in, it should do a full sync on each run. But I think it’s being provoked by passing in initial state from a previous run that it’s not expecting.aaronsteers
02/18/2022, 6:03 PMclear_bookmark()
method of sorts, intending to clear some part of the STATE, and Meltano is apparently merging with a prior STATE in a way that the clear
operation is not having the intended affect?aaronsteers
02/18/2022, 6:05 PMaaronsteers
02/18/2022, 6:06 PMfred_reimer
02/18/2022, 6:06 PMaaronsteers
02/18/2022, 6:09 PMpaul_tiplady
02/18/2022, 6:09 PMclear_bookmark
— at least as the tap is expecting to see things. I’m not certain here, as looking at the logs from my pipeline it does look like the keys are deleted from the state that is being generated later in the run. But maybe if Meltano is doing a merge, rather than a replace, of the state dict, this would explain what I’m seeing?
Regardless of whether Meltano is leaving stray state, I think the tap is definitely non-conformant with this behavior.aaronsteers
02/18/2022, 6:10 PMpaul_tiplady
02/18/2022, 6:10 PMaaronsteers
02/18/2022, 6:10 PMpaul_tiplady
02/18/2022, 6:10 PMaaronsteers
02/18/2022, 6:11 PMpaul_tiplady
02/18/2022, 6:11 PMaaronsteers
02/18/2022, 6:12 PMmerge()
is called by Meltano only on failure, but would be good to confirm that.)paul_tiplady
02/18/2022, 6:15 PMpaul_tiplady
02/18/2022, 6:16 PMpaul_tiplady
02/18/2022, 6:20 PMjob.payload
even on a successful run. I don’t think I’ve had a pipeline failure since last night (30-minute interval) and yet they are in the latest state.paul_tiplady
02/18/2022, 6:26 PMpaul_tiplady
02/24/2022, 1:26 AMpaul_tiplady
03/01/2022, 5:17 PMpipelinewise-tap-mysql
that makes the FULL_TABLE sync less smart, and more conformant, so it always does a full table scan and no longer stashes a replication key in the state to do an incremental sync if you re-run it. I’d appreciate more eyes on this since it might break workflows that Meltano users are implicitly depending on:
https://github.com/paultiplady/pipelinewise-tap-mysql/pull/4
Basically, the twist is that the LOG_BASED replication mode does a FULL_TABLE sync under the covers for its first sync (when it can’t see evidence in the state that the historical data has been fully sync’d previously). That FULL_TABLE was also depending on this “INCREMENTAL with pk replication key if you resume the extract” logic as well. I could see an argument for keeping the resumable extract logic for the LOG_BASED initial extract — but since this logic was commonized with FULL_TABLE, and LOG_BASED isn’t even in the Singer spec, I just opted to take the most direct route and rip out the resumability in both FULL_TABLE and LOG_BASED extract to solve my problem.
Before I spend time getting the tests green to upstream this fix, I’d appreciate anyone here chiming in if they think that modifying the LOG_BASED initial full-table extract to abandon the state-based resumability will cause them problems. Worst-case, this could make extraction take longer for LOG_BASED if you hit an error in the middle of the initial full-table extraction since it will start at the beginning when you re-extract. Perhaps that’s what would be expected though? I’m not sure. It would be more effort to preserve this logic for LOG_BASED and I’m not sure I have a lot of appetite for that work given I just use FULL_TABLE, and the LOG_BASED unit tests are pretty gnarly. But if there’s a simple solution I’m open to considering it.paul_tiplady
03/01/2022, 7:42 PM