Hi again When I run `tap mysql` and `target postgres` both p Meltano #plugins-general

Hi again! When I run `tap-mysql` and `target-postg...

matt_cooley

12/16/2020, 5:50 AM

Hi again! When I run

tap-mysql

and

target-postgres

(both pipelinewise variants) with

replication-method=LOG_BASED

I have the following experience: • First time it copies only the

select:

columns (my understanding is this run is equivalent to

replication-method=FULL_TABLE

) • There’s a change to the upstream data (i.e.

tap-mysql

) • The next pipeline run copies every column (i.e.

SELECT *

) I’m pretty interested in the

LOG_BASED

replication of explicitly selected columns. Not sure if this is a bug or something on my end, but I’ve been able to replicate a bunch of times on my end. P.S. I’ve also been playing with

FULL_TABLE

replication. It doesn’t recognize upstream deletes. Is that expected behavior? I didn’t think it was. P.P.S. I’m guessing it’s gonna be Douwe responding. Thanks a bunch. Your responsiveness has been fantastic. Thank you.

douwe_maan

12/16/2020, 4:18 PM

Thanks a bunch. Your responsiveness has been fantastic. Thank you.

@matt_cooley Thanks for the kind words 😊

I’m pretty interested in the
LOG_BASED
replication of explicitly selected columns. Not sure if this is a bug or something on my end, but I’ve been able to replicate a bunch of times on my end.

This sounds like a bug in https://github.com/transferwise/pipelinewise-tap-mysql, or at least something to investigate further with their team.

property_is_selected

is referenced once here, in the log-based replication logic: https://github.com/transferwise/pipelinewise-tap-mysql/blob/master/tap_mysql/sync_strategies/binlog.py#L407 as well as here, in the generic discovery logic: https://github.com/transferwise/pipelinewise-tap-mysql/blob/5a9e043b2581d01229970f8b9debf3452a877515/tap_mysql/discover_utils.py#L268, and from a quick perusal of the source that seems like it would be sufficient, but the implementation is quite complex so I'm probably missing something. It may help to enable debug mode (

meltano --log-level=debug elt ...

) and see if you're getting different

SCHEMA

messages on the second (log-based) run then you were on the first (full table) run, or if the new columns are only included in the

RECORD

messages. That may help identify where the column selection is not being applied correctly.

P.S. I’ve also been playing with
FULL_TABLE
replication. It doesn’t recognize upstream deletes. Is that expected behavior? I didn’t think it was.

With full table replication, the tap would not output records for deleted rows (naturally), but it'd still be up to the target to delete records that were synced previously and are missing from the latest batch of records. Some taps output

ACTIVATE_VERSION

messages when running in FULL_TABLE mode, that the target can use to delete all records that were contained in the old version but not the new one (https://meltano.slack.com/archives/CFG3C3C66/p1601148878022700?thread_ts=1600699540.015800&cid=CFG3C3C66). datamill-co's target-postgres implements this: https://github.com/datamill-co/target-postgres/blob/383196ae873ea146f61f248e6c34dd1340d08d1f/target_postgres/postgres.py#L311, but it looks like pipelinewise's recognizes the messages but doesn't actually handle it correctly: https://github.com/transferwise/pipelinewise-target-postgres/blob/94e9c61300fc46f11988e52b29193dfea69d816e/target_postgres/__init__.py#L224-L230 With log-based replication, it looks like the tap outputs

_SDC_DELETED_AT

columns (https://github.com/transferwise/pipelinewise-tap-mysql/blob/5a9e043b2581d01229970f8b9debf3452a877515/tap_mysql/sync_strategies/binlog.py#L324) that the target knows to delete when the

hard_delete

setting is enabled: https://meltano.com/plugins/loaders/postgres--transferwise.html#hard-delete

matt_cooley

12/16/2020, 4:36 PM

Interesting. So basically: • There’s likely a bug with Transferwise’s

LOG_BASED

replication. • Transferwise’s

FULL_TABLE

replication doesn’t seem to handle upstream deletes becuse it’s not handling the

ACTIVATE_VERSIONS

messages. I figured

FULL_TABLE

was doing a truncation. Either way we want the ability for soft deletes in certain circumstances so a

FULL_TABLE

wouldn’t always work for us.

douwe_maan

12/16/2020, 4:44 PM

@matt_cooley Correct on point 1. As for point 2, it's important to realize the different roles of the tap and target here:

FULL_TABLE

is a property on the tap, which cannot itself "do truncation". That needs to be done by the target, which doesn't actually know the tap's replication method and typically just upserts. Taps can use

ACTIVATE_VERSION

to communicate to targets that everything up to that point should be discarded. In this case, the tap may be implementing FULL_TABLE and ACTIVATE_VERSION correctly, but the target doesn't handle

ACTIVATE_VERSION

as it should (understandably, because it's very poorly documented).

matt_cooley

12/16/2020, 5:05 PM

Thanks, that makes sense. Got it. So

FULL_TABLE

tells the TAP to scan the full table and send it to the TARGET to upsert.

ACTIVATE_VERSIONS

does the truncation (or some sort of diff) to delete rows that were deleted upstream. That makes sense. Thanks

douwe_maan

12/16/2020, 5:39 PM

Exactly!

matt_cooley

12/29/2020, 12:29 AM

@douwe_maan Hi! I’ve done some more digging into this issue and it appears that piplinewise wrote a custom version of the

LOG_BASED

replication method. It picks up new columns added upstream but doesn’t allow for explicitly selected columns. They rerun the `discover_catalog` each time instead of using looking at the catalog that was initially created. Just an FYI in case you were curious

douwe_maan

12/30/2020, 9:07 PM

@matt_cooley Interesting, so field selection isn't respected when using log-based replication, which seems to be implemented that way intentionally, but looks to Meltano like a bug since the discovered catalog claims those properties as

inclusion: available

, not

inclusion: automatic

, which makes Meltano believe that

selected:true/false

will be respected. We couldn't just fix that discovered claim though, because at that point the tap doesn't know yet which replication method is going to be used. So to fix the bug, I think the tap should start respecting the selected fields in the provided catalog here, at least as opt-in behavior when a setting is flipped. Are you planning to file an issue for this in the tap repo?

matt_cooley

12/30/2020, 11:39 PM

I wasn’t planning on it since I found the same basic functionality in the Singer

mysql-tap

and it appears to be expected behavior on their part.

douwe_maan

12/30/2020, 11:42 PM

Ah, OK. I take it this behavior is acceptable for your use case, then? 🙂

matt_cooley

12/30/2020, 11:46 PM

Oh. Definitely not. That’s why we’re not using that tap. Fair point though. I’ll put it on my backlog to report it as an issue. I agree they should at least have a flag for it.

douwe_maan

12/30/2020, 11:46 PM

😄

douwe_maan

12/30/2020, 11:47 PM

Sounds good, thanks!

matt_cooley

12/30/2020, 11:49 PM

👍

matt_cooley

01/04/2021, 11:22 PM

Here’s the issue in case you’re curious https://github.com/transferwise/pipelinewise-tap-mysql/issues/50