I ran into some unexpected weirdness this morning with hard Meltano #troubleshooting

I ran into some unexpected weirdness this morning ...

nick_hamlin

08/10/2021, 2:14 PM

I ran into some unexpected weirdness this morning with hard deletes that I’m wondering if others here might have insights into. The pipeline in question is using tap_mysql and target_redshift (both pipelinewise variants). Occasionally, records are hard deleted from the underlying mysql tables, and I’d thought that running

meltano elt

with the

--full-refresh

flag (or in any situation where

FULL TABLE

replication is used) would cause the corresponding redshift table to be blown away and recreated with the hard deleted records no longer there. However, it looks like target_redshift is instead just running UPDATE statements under the hood, which is causing the deleted records to stick around. Any suggestions on how to fix this? (I’ll add more details in a thread to prevent this already long message from getting longer)

nick_hamlin

08/10/2021, 2:16 PM

Pipelinewise_target_redshift has an option for

add_metadata_columns

, which I’ve turned on. In theory, it should set a value in the `_sdc_deleted_at`metadata column in situations where a hard delete has occurred (so that if one has also turned on the

hard_deletes

setting, it can use the values in that column to figure out which records to remove. I’d though that was likely the answer, but it doesn’t look like values are getting set in that metadata column anywhere

nick_hamlin

08/10/2021, 2:18 PM

Presumably, if those values where showing up the way the docs lead me to believe they should be, then I wouldn’t have to worry about whether the hard deletes were happening in redshift or not because I could use the presence of data in

_sdc_deleted_at

to filter out any deleted records in my staging layer in DBT easily enough

edward_smith

08/10/2021, 2:46 PM

Hi Nick! My understanding is that the

_sdc_deletetd_at

is only populated when using the

LOG_BASED

replication method.

nick_hamlin

08/10/2021, 3:04 PM

ah! well that explains that then, thanks!

nick_hamlin

08/10/2021, 3:05 PM

Given that, is there a way to get

FULL TABLE

replication to actually lead to “full refreshes” of a table such that the table in the target mirrors the state of the table in the source (without lingering hard deleted rows left hanging around)?

edward_smith

08/10/2021, 3:28 PM

When you do

FULL_TABLE

replication, you see it using

UPDATE

statements and the deleted rows continue to exist in the target?

edward_smith

08/10/2021, 3:29 PM

Also, do you know if it us using fast sync?

nick_hamlin

08/10/2021, 4:07 PM

Yep, that’s exactly right - I can see via the meltano logs how many inserts/updates it’s doing and rows that are hard deleted in the source table continue to appear in the target table. I can tell based on the other metadata columns when they were originally loaded in the past (but of course, I can’t tell when they were hard deleted since they’re literally gone from the underlying table)

nick_hamlin

08/10/2021, 4:07 PM

it is NOT using fast sync

nick_hamlin

08/10/2021, 4:07 PM

(though that’s something I’ve been meaning to look into)

aaronsteers

08/10/2021, 4:10 PM

@nick_hamlin - Do you have the option to filter out old records in the next step of your pipeline? For instance, using dbt? In the past, I've used

_sdc_batched_at

(or similar) to infer which have not been updated, and therefor which are detected to be deleted on the upstream source.

aaronsteers

08/10/2021, 4:12 PM

The merge update behavior is due to having key-properties defined. You can supress the key-properties definition in the tap to get append-only behavior, but you'd still end up inferring the deleted records using something like

_sdc_batched_at

nick_hamlin

08/10/2021, 4:12 PM

Yep, that was my backup plan if there wasn’t a better option - we already purge soft deleted rows in the staging layer in DBT anyway

aaronsteers

08/10/2021, 4:13 PM

For actual deletion of records, you'd need the unofficial ACTIVATE_VERSION extension. I don't know if the tap/target combination you mentioned supports that feature.

nick_hamlin

08/10/2021, 4:13 PM

Interesting - is key-properties a tap config thing? that’s not something that’s been on my radar

aaronsteers

08/10/2021, 4:14 PM

Yes, each stream in the tap can have a primary key defined (called

key_properties

) and targets will almost always use that as a merge update key. You can also suppress that default behavior in the catalog in order to basically "trick" your target into appending records instead of merging them.

aaronsteers

08/10/2021, 4:17 PM

I put an issue in queue to document this but hasn't floated to a high priority as of yet. Logged under the category of "non destructive loads" for in that case I wanted to have a full history retained of every past record version received from source: Documentation proposal: how to perform "non-destructive" data loads (#2524) · Issues · Meltano / Meltano · GitLab

nick_hamlin

08/10/2021, 4:25 PM

got it, that makes sense - thanks!

nick_hamlin

08/10/2021, 4:26 PM

Do you have any sense of whether or not the opposite is an option? e.g. “I only want fully destructive reloads”? (so to speak)

nick_hamlin

08/10/2021, 4:26 PM

I’m assuming the answer is no, but figured I’d confirm

aaronsteers

08/10/2021, 4:27 PM

Yes, and that's where ACTIVATE_VERSION comes in - that (again unofficial) extension intentionally purges all records which have a version number less than the version number being activated. For the net result of fully destroying the old/stale data.

aaronsteers

08/10/2021, 4:28 PM

Document unofficial Singer extension: `ACTIVATE_VERSION` (#2508) · Issues · Meltano / Meltano · GitLab

aaronsteers

08/10/2021, 4:41 PM

Looks like the pipelinewise tap-mysql does support the activate version message. I can see their target-redshift tolerates it but it doesn't look like it's actually implemented there in the redshift target.

nick_hamlin

08/10/2021, 5:04 PM

fantastic, thanks @aaronsteers and @edward_smith for your quick help on this!

2 Views

Open in Slack

Previous Next