matt_cooley
12/16/2020, 5:50 AMtap-mysql and target-postgres (both pipelinewise variants) with replication-method=LOG_BASED I have the following experience:
• First time it copies only the select: columns (my understanding is this run is equivalent to replication-method=FULL_TABLE)
• There’s a change to the upstream data (i.e. tap-mysql)
• The next pipeline run copies every column (i.e. SELECT *)
I’m pretty interested in the LOG_BASED replication of explicitly selected columns. Not sure if this is a bug or something on my end, but I’ve been able to replicate a bunch of times on my end.
P.S. I’ve also been playing with FULL_TABLE replication. It doesn’t recognize upstream deletes. Is that expected behavior? I didn’t think it was.
P.P.S. I’m guessing it’s gonna be Douwe responding. Thanks a bunch. Your responsiveness has been fantastic. Thank you.douwe_maan
12/16/2020, 4:18 PMThanks a bunch. Your responsiveness has been fantastic. Thank you.@matt_cooley Thanks for the kind words 😊
I’m pretty interested in theThis sounds like a bug in https://github.com/transferwise/pipelinewise-tap-mysql, or at least something to investigate further with their team.replication of explicitly selected columns. Not sure if this is a bug or something on my end, but I’ve been able to replicate a bunch of times on my end.LOG_BASED
property_is_selected is referenced once here, in the log-based replication logic: https://github.com/transferwise/pipelinewise-tap-mysql/blob/master/tap_mysql/sync_strategies/binlog.py#L407 as well as here, in the generic discovery logic: https://github.com/transferwise/pipelinewise-tap-mysql/blob/5a9e043b2581d01229970f8b9debf3452a877515/tap_mysql/discover_utils.py#L268, and from a quick perusal of the source that seems like it would be sufficient, but the implementation is quite complex so I'm probably missing something.
It may help to enable debug mode (meltano --log-level=debug elt ... ) and see if you're getting different SCHEMA messages on the second (log-based) run then you were on the first (full table) run, or if the new columns are only included in the RECORD messages. That may help identify where the column selection is not being applied correctly.
P.S. I’ve also been playing withWith full table replication, the tap would not output records for deleted rows (naturally), but it'd still be up to the target to delete records that were synced previously and are missing from the latest batch of records. Some taps outputreplication. It doesn’t recognize upstream deletes. Is that expected behavior? I didn’t think it was.FULL_TABLE
ACTIVATE_VERSION messages when running in FULL_TABLE mode, that the target can use to delete all records that were contained in the old version but not the new one (https://meltano.slack.com/archives/CFG3C3C66/p1601148878022700?thread_ts=1600699540.015800&cid=CFG3C3C66). datamill-co's target-postgres implements this: https://github.com/datamill-co/target-postgres/blob/383196ae873ea146f61f248e6c34dd1340d08d1f/target_postgres/postgres.py#L311, but it looks like pipelinewise's recognizes the messages but doesn't actually handle it correctly: https://github.com/transferwise/pipelinewise-target-postgres/blob/94e9c61300fc46f11988e52b29193dfea69d816e/target_postgres/__init__.py#L224-L230
With log-based replication, it looks like the tap outputs _SDC_DELETED_AT columns (https://github.com/transferwise/pipelinewise-tap-mysql/blob/5a9e043b2581d01229970f8b9debf3452a877515/tap_mysql/sync_strategies/binlog.py#L324) that the target knows to delete when the hard_delete setting is enabled: https://meltano.com/plugins/loaders/postgres--transferwise.html#hard-deletematt_cooley
12/16/2020, 4:36 PMLOG_BASED replication.
• Transferwise’s FULL_TABLE replication doesn’t seem to handle upstream deletes becuse it’s not handling the ACTIVATE_VERSIONS messages.
I figured FULL_TABLE was doing a truncation. Either way we want the ability for soft deletes in certain circumstances so a FULL_TABLE wouldn’t always work for us.douwe_maan
12/16/2020, 4:44 PMFULL_TABLE is a property on the tap, which cannot itself "do truncation". That needs to be done by the target, which doesn't actually know the tap's replication method and typically just upserts. Taps can use ACTIVATE_VERSION to communicate to targets that everything up to that point should be discarded. In this case, the tap may be implementing FULL_TABLE and ACTIVATE_VERSION correctly, but the target doesn't handle ACTIVATE_VERSION as it should (understandably, because it's very poorly documented).matt_cooley
12/16/2020, 5:05 PMFULL_TABLE tells the TAP to scan the full table and send it to the TARGET to upsert. ACTIVATE_VERSIONS does the truncation (or some sort of diff) to delete rows that were deleted upstream. That makes sense. Thanksdouwe_maan
12/16/2020, 5:39 PMmatt_cooley
12/29/2020, 12:29 AMLOG_BASED replication method. It picks up new columns added upstream but doesn’t allow for explicitly selected columns. They rerun the `discover_catalog` each time instead of using looking at the catalog that was initially created. Just an FYI in case you were curiousdouwe_maan
12/30/2020, 9:07 PMinclusion: available, not inclusion: automatic , which makes Meltano believe that selected:true/false will be respected.
We couldn't just fix that discovered claim though, because at that point the tap doesn't know yet which replication method is going to be used. So to fix the bug, I think the tap should start respecting the selected fields in the provided catalog here, at least as opt-in behavior when a setting is flipped.
Are you planning to file an issue for this in the tap repo?matt_cooley
12/30/2020, 11:39 PMmysql-tap and it appears to be expected behavior on their part.douwe_maan
12/30/2020, 11:42 PMmatt_cooley
12/30/2020, 11:46 PMdouwe_maan
12/30/2020, 11:46 PMdouwe_maan
12/30/2020, 11:47 PMmatt_cooley
12/30/2020, 11:49 PMmatt_cooley
01/04/2021, 11:22 PMdouwe_maan
01/04/2021, 11:39 PMselected:true/false , not select:true/false 😉matt_cooley
01/04/2021, 11:40 PMmatt_cooley
01/15/2021, 4:38 PMdouwe_maan
01/15/2021, 5:06 PMmatt_cooley
01/15/2021, 5:25 PM