Has anyone experienced duplications when using FUL...
# troubleshooting
s
Has anyone experienced duplications when using FULL_TABLE replication on tap-postgres -> target-bigquery pipeline? Scenario: • T1: A record x gets loaded from postgres -> bigquery • T2: A record gets deleted on the source Postgres DB • T3: target-bigquery does not pick up on the source change and keeps the deleted record within BigQuery Does anyone have any recommendations on how to handle this?
t
Is it possible to capture deletes when using FULL_TABLE replication? On each run the tap is going to pull all the rows from Postgres and push them to BQ... which is of course not going to include anything about the deleted row (because, well, it no longer exists!)
I would think you'd need to truncate the table in BQ before each run if you wanted to make sure it contains only the latest rows.
s
I'm not sure how exactly we can capture deletes as the upstream PG instance is run by another team who just perform hard deletes. We are considering moving to LOG_BASED replication and using the soft delete column
sdc_deleted_at
to ignore deleted records.
Ah yes I think that would make sense, do you know if there's a meltano-esque way to truncate the destination table prior to loading it?
t
The targets I've worked with (PG, Redshift, Snowflake) all have an option to hard delete rows (which itself is actually based on the sdc_deleted_at column...) I haven't used the BQ target but odds are it has that too. So if you can switch to LOG_BASED mode you can have the target remove the deleted records for you if you want.
s
Got it, thanks Thomas, we will push for a LOG_BASED replication mode, hopefully the upstream db will allow for it 🤞
t
I don't know if there's a Meltano-ish way to have the destination table dropped before the pipeline is run... that's a great question, and you can't be the first to need that. Anyone have suggestions?
p
@sean_glynn there is a mechanism called
ACTIVATE_VERSION
thats sometimes supported by taps/targets to solve this problem, you can read more here - https://hub.meltano.com/singer/docs#activate-version. Another option might be to use staging logic in dbt to select only the newest replicated data based on the metadata fields, would that work for you?
s
Interesting, thank you @pat_nadolny Yes we're going to add some filters on the dbt staging models to ignore the deleted records for now. Do you by any chance know the best way to test this
ACTIVATE_VERSION
feature? Just browsing through the tap/targets code I can see they have some form of support for activate_version messages. Is there a configuration change I can make on the meltano level to extract/load rows as
activate_version
messages?
v