matt_cooley
12/16/2020, 5:50 AMtap-mysql
and target-postgres
(both pipelinewise variants) with replication-method=LOG_BASED
I have the following experience:
• First time it copies only the select:
columns (my understanding is this run is equivalent to replication-method=FULL_TABLE
)
• There’s a change to the upstream data (i.e. tap-mysql
)
• The next pipeline run copies every column (i.e. SELECT *
)
I’m pretty interested in the LOG_BASED
replication of explicitly selected columns. Not sure if this is a bug or something on my end, but I’ve been able to replicate a bunch of times on my end.
P.S. I’ve also been playing with FULL_TABLE
replication. It doesn’t recognize upstream deletes. Is that expected behavior? I didn’t think it was.
P.P.S. I’m guessing it’s gonna be Douwe responding. Thanks a bunch. Your responsiveness has been fantastic. Thank you.douwe_maan
12/16/2020, 4:18 PMThanks a bunch. Your responsiveness has been fantastic. Thank you.@matt_cooley Thanks for the kind words 😊
I’m pretty interested in theThis sounds like a bug in https://github.com/transferwise/pipelinewise-tap-mysql, or at least something to investigate further with their team.replication of explicitly selected columns. Not sure if this is a bug or something on my end, but I’ve been able to replicate a bunch of times on my end.LOG_BASED
property_is_selected
is referenced once here, in the log-based replication logic: https://github.com/transferwise/pipelinewise-tap-mysql/blob/master/tap_mysql/sync_strategies/binlog.py#L407 as well as here, in the generic discovery logic: https://github.com/transferwise/pipelinewise-tap-mysql/blob/5a9e043b2581d01229970f8b9debf3452a877515/tap_mysql/discover_utils.py#L268, and from a quick perusal of the source that seems like it would be sufficient, but the implementation is quite complex so I'm probably missing something.
It may help to enable debug mode (meltano --log-level=debug elt ...
) and see if you're getting different SCHEMA
messages on the second (log-based) run then you were on the first (full table) run, or if the new columns are only included in the RECORD
messages. That may help identify where the column selection is not being applied correctly.
P.S. I’ve also been playing withWith full table replication, the tap would not output records for deleted rows (naturally), but it'd still be up to the target to delete records that were synced previously and are missing from the latest batch of records. Some taps outputreplication. It doesn’t recognize upstream deletes. Is that expected behavior? I didn’t think it was.FULL_TABLE
ACTIVATE_VERSION
messages when running in FULL_TABLE mode, that the target can use to delete all records that were contained in the old version but not the new one (https://meltano.slack.com/archives/CFG3C3C66/p1601148878022700?thread_ts=1600699540.015800&cid=CFG3C3C66). datamill-co's target-postgres implements this: https://github.com/datamill-co/target-postgres/blob/383196ae873ea146f61f248e6c34dd1340d08d1f/target_postgres/postgres.py#L311, but it looks like pipelinewise's recognizes the messages but doesn't actually handle it correctly: https://github.com/transferwise/pipelinewise-target-postgres/blob/94e9c61300fc46f11988e52b29193dfea69d816e/target_postgres/__init__.py#L224-L230
With log-based replication, it looks like the tap outputs _SDC_DELETED_AT
columns (https://github.com/transferwise/pipelinewise-tap-mysql/blob/5a9e043b2581d01229970f8b9debf3452a877515/tap_mysql/sync_strategies/binlog.py#L324) that the target knows to delete when the hard_delete
setting is enabled: https://meltano.com/plugins/loaders/postgres--transferwise.html#hard-deletematt_cooley
12/16/2020, 4:36 PMLOG_BASED
replication.
• Transferwise’s FULL_TABLE
replication doesn’t seem to handle upstream deletes becuse it’s not handling the ACTIVATE_VERSIONS
messages.
I figured FULL_TABLE
was doing a truncation. Either way we want the ability for soft deletes in certain circumstances so a FULL_TABLE
wouldn’t always work for us.douwe_maan
12/16/2020, 4:44 PMFULL_TABLE
is a property on the tap, which cannot itself "do truncation". That needs to be done by the target, which doesn't actually know the tap's replication method and typically just upserts. Taps can use ACTIVATE_VERSION
to communicate to targets that everything up to that point should be discarded. In this case, the tap may be implementing FULL_TABLE and ACTIVATE_VERSION correctly, but the target doesn't handle ACTIVATE_VERSION
as it should (understandably, because it's very poorly documented).matt_cooley
12/16/2020, 5:05 PMFULL_TABLE
tells the TAP to scan the full table and send it to the TARGET to upsert. ACTIVATE_VERSIONS
does the truncation (or some sort of diff) to delete rows that were deleted upstream. That makes sense. Thanksdouwe_maan
12/16/2020, 5:39 PMmatt_cooley
12/29/2020, 12:29 AMLOG_BASED
replication method. It picks up new columns added upstream but doesn’t allow for explicitly selected columns. They rerun the `discover_catalog` each time instead of using looking at the catalog that was initially created. Just an FYI in case you were curiousdouwe_maan
12/30/2020, 9:07 PMinclusion: available
, not inclusion: automatic
, which makes Meltano believe that selected:true/false
will be respected.
We couldn't just fix that discovered claim though, because at that point the tap doesn't know yet which replication method is going to be used. So to fix the bug, I think the tap should start respecting the selected fields in the provided catalog here, at least as opt-in behavior when a setting is flipped.
Are you planning to file an issue for this in the tap repo?matt_cooley
12/30/2020, 11:39 PMmysql-tap
and it appears to be expected behavior on their part.douwe_maan
12/30/2020, 11:42 PMmatt_cooley
12/30/2020, 11:46 PMdouwe_maan
12/30/2020, 11:46 PMdouwe_maan
12/30/2020, 11:47 PMmatt_cooley
12/30/2020, 11:49 PMmatt_cooley
01/04/2021, 11:22 PMdouwe_maan
01/04/2021, 11:39 PMselected:true/false
, not select:true/false
😉matt_cooley
01/04/2021, 11:40 PMmatt_cooley
01/15/2021, 4:38 PMdouwe_maan
01/15/2021, 5:06 PMmatt_cooley
01/15/2021, 5:25 PM