Has anyone experienced duplications when using FULL TABLE re Meltano #troubleshooting

Has anyone experienced duplications when using FUL...

sean_glynn

04/13/2022, 1:59 PM

Has anyone experienced duplications when using FULL_TABLE replication on tap-postgres -> target-bigquery pipeline? Scenario: • T1: A record x gets loaded from postgres -> bigquery • T2: A record gets deleted on the source Postgres DB • T3: target-bigquery does not pick up on the source change and keeps the deleted record within BigQuery Does anyone have any recommendations on how to handle this?

thomas_briggs

04/13/2022, 2:07 PM

Is it possible to capture deletes when using FULL_TABLE replication? On each run the tap is going to pull all the rows from Postgres and push them to BQ... which is of course not going to include anything about the deleted row (because, well, it no longer exists!)

thomas_briggs

04/13/2022, 2:09 PM

I would think you'd need to truncate the table in BQ before each run if you wanted to make sure it contains only the latest rows.

sean_glynn

04/13/2022, 2:10 PM

I'm not sure how exactly we can capture deletes as the upstream PG instance is run by another team who just perform hard deletes. We are considering moving to LOG_BASED replication and using the soft delete column

sdc_deleted_at

to ignore deleted records.

sean_glynn

04/13/2022, 2:11 PM

Ah yes I think that would make sense, do you know if there's a meltano-esque way to truncate the destination table prior to loading it?

thomas_briggs

04/13/2022, 2:19 PM

The targets I've worked with (PG, Redshift, Snowflake) all have an option to hard delete rows (which itself is actually based on the sdc_deleted_at column...) I haven't used the BQ target but odds are it has that too. So if you can switch to LOG_BASED mode you can have the target remove the deleted records for you if you want.

sean_glynn

04/13/2022, 2:20 PM

Got it, thanks Thomas, we will push for a LOG_BASED replication mode, hopefully the upstream db will allow for it 🤞

thomas_briggs

04/13/2022, 2:22 PM

I don't know if there's a Meltano-ish way to have the destination table dropped before the pipeline is run... that's a great question, and you can't be the first to need that. Anyone have suggestions?

pat_nadolny

04/13/2022, 2:37 PM

@sean_glynn there is a mechanism called

ACTIVATE_VERSION

thats sometimes supported by taps/targets to solve this problem, you can read more here - https://hub.meltano.com/singer/docs#activate-version. Another option might be to use staging logic in dbt to select only the newest replicated data based on the metadata fields, would that work for you?

sean_glynn

04/13/2022, 3:21 PM

Interesting, thank you @pat_nadolny Yes we're going to add some filters on the dbt staging models to ignore the deleted records for now. Do you by any chance know the best way to test this

ACTIVATE_VERSION

feature? Just browsing through the tap/targets code I can see they have some form of support for activate_version messages. Is there a configuration change I can make on the meltano level to extract/load rows as

activate_version

messages?

visch

04/13/2022, 7:49 PM

@sean_glynn https://medium.com/@danielpdwalker/handling-hard-deleted-data-from-source-5578e67f5a0c @daniel_walker did a great write up!

3 Views

Open in Slack

Previous Next