Hi again! When I run `tap-mysql` and `target-postg...
# plugins-general
m
Hi again! When I run
tap-mysql
and
target-postgres
(both pipelinewise variants) with
replication-method=LOG_BASED
I have the following experience: • First time it copies only the
select:
columns (my understanding is this run is equivalent to
replication-method=FULL_TABLE
) • There’s a change to the upstream data (i.e.
tap-mysql
) • The next pipeline run copies every column (i.e.
SELECT *
) I’m pretty interested in the
LOG_BASED
replication of explicitly selected columns. Not sure if this is a bug or something on my end, but I’ve been able to replicate a bunch of times on my end. P.S. I’ve also been playing with
FULL_TABLE
replication. It doesn’t recognize upstream deletes. Is that expected behavior? I didn’t think it was. P.P.S. I’m guessing it’s gonna be Douwe responding. Thanks a bunch. Your responsiveness has been fantastic. Thank you.
d
Thanks a bunch. Your responsiveness has been fantastic. Thank you.
@matt_cooley Thanks for the kind words 😊
I’m pretty interested in the 
LOG_BASED
  replication of explicitly selected columns. Not sure if this is a bug or something on my end, but I’ve been able to replicate a bunch of times on my end.
This sounds like a bug in https://github.com/transferwise/pipelinewise-tap-mysql, or at least something to investigate further with their team.
property_is_selected
is referenced once here, in the log-based replication logic: https://github.com/transferwise/pipelinewise-tap-mysql/blob/master/tap_mysql/sync_strategies/binlog.py#L407 as well as here, in the generic discovery logic: https://github.com/transferwise/pipelinewise-tap-mysql/blob/5a9e043b2581d01229970f8b9debf3452a877515/tap_mysql/discover_utils.py#L268, and from a quick perusal of the source that seems like it would be sufficient, but the implementation is quite complex so I'm probably missing something. It may help to enable debug mode (
meltano --log-level=debug elt ...
) and see if you're getting different
SCHEMA
messages on the second (log-based) run then you were on the first (full table) run, or if the new columns are only included in the
RECORD
messages. That may help identify where the column selection is not being applied correctly.
P.S. I’ve also been playing with 
FULL_TABLE
 replication. It doesn’t recognize upstream deletes. Is that expected behavior? I didn’t think it was.
With full table replication, the tap would not output records for deleted rows (naturally), but it'd still be up to the target to delete records that were synced previously and are missing from the latest batch of records. Some taps output
ACTIVATE_VERSION
messages when running in FULL_TABLE mode, that the target can use to delete all records that were contained in the old version but not the new one (https://meltano.slack.com/archives/CFG3C3C66/p1601148878022700?thread_ts=1600699540.015800&cid=CFG3C3C66). datamill-co's target-postgres implements this: https://github.com/datamill-co/target-postgres/blob/383196ae873ea146f61f248e6c34dd1340d08d1f/target_postgres/postgres.py#L311, but it looks like pipelinewise's recognizes the messages but doesn't actually handle it correctly: https://github.com/transferwise/pipelinewise-target-postgres/blob/94e9c61300fc46f11988e52b29193dfea69d816e/target_postgres/__init__.py#L224-L230 With log-based replication, it looks like the tap outputs
_SDC_DELETED_AT
columns (https://github.com/transferwise/pipelinewise-tap-mysql/blob/5a9e043b2581d01229970f8b9debf3452a877515/tap_mysql/sync_strategies/binlog.py#L324) that the target knows to delete when the
hard_delete
setting is enabled: https://meltano.com/plugins/loaders/postgres--transferwise.html#hard-delete
m
Interesting. So basically: • There’s likely a bug with Transferwise’s
LOG_BASED
replication. • Transferwise’s
FULL_TABLE
replication doesn’t seem to handle upstream deletes becuse it’s not handling the
ACTIVATE_VERSIONS
messages. I figured
FULL_TABLE
was doing a truncation. Either way we want the ability for soft deletes in certain circumstances so a
FULL_TABLE
wouldn’t always work for us.
d
@matt_cooley Correct on point 1. As for point 2, it's important to realize the different roles of the tap and target here:
FULL_TABLE
is a property on the tap, which cannot itself "do truncation". That needs to be done by the target, which doesn't actually know the tap's replication method and typically just upserts. Taps can use
ACTIVATE_VERSION
to communicate to targets that everything up to that point should be discarded. In this case, the tap may be implementing FULL_TABLE and ACTIVATE_VERSION correctly, but the target doesn't handle
ACTIVATE_VERSION
as it should (understandably, because it's very poorly documented).
m
Thanks, that makes sense. Got it. So
FULL_TABLE
tells the TAP to scan the full table and send it to the TARGET to upsert.
ACTIVATE_VERSIONS
does the truncation (or some sort of diff) to delete rows that were deleted upstream. That makes sense. Thanks
d
Exactly!
m
@douwe_maan Hi! I’ve done some more digging into this issue and it appears that piplinewise wrote a custom version of the
LOG_BASED
replication method. It picks up new columns added upstream but doesn’t allow for explicitly selected columns. They rerun the `discover_catalog` each time instead of using looking at the catalog that was initially created. Just an FYI in case you were curious
d
@matt_cooley Interesting, so field selection isn't respected when using log-based replication, which seems to be implemented that way intentionally, but looks to Meltano like a bug since the discovered catalog claims those properties as
inclusion: available
, not
inclusion: automatic
, which makes Meltano believe that
selected:true/false
will be respected. We couldn't just fix that discovered claim though, because at that point the tap doesn't know yet which replication method is going to be used. So to fix the bug, I think the tap should start respecting the selected fields in the provided catalog here, at least as opt-in behavior when a setting is flipped. Are you planning to file an issue for this in the tap repo?
m
I wasn’t planning on it since I found the same basic functionality in the Singer
mysql-tap
and it appears to be expected behavior on their part.
d
Ah, OK. I take it this behavior is acceptable for your use case, then? 🙂
m
Oh. Definitely not. That’s why we’re not using that tap. Fair point though. I’ll put it on my backlog to report it as an issue. I agree they should at least have a flag for it.
d
😄
Sounds good, thanks!
m
👍
Here’s the issue in case you’re curious https://github.com/transferwise/pipelinewise-tap-mysql/issues/50
d
@matt_cooley Great! Note that it's
selected:true/false
, not
select:true/false
😉
m
Nice! I’ll see if I can make an edit 😳
I got a response! Not sure I entirely understand what they’re getting at. https://github.com/transferwise/pipelinewise-tap-mysql/issues/50#issuecomment-760093876
d
m
Thank you!