I ran into some unexpected weirdness this morning ...
# troubleshooting
n
I ran into some unexpected weirdness this morning with hard deletes that I’m wondering if others here might have insights into. The pipeline in question is using tap_mysql and target_redshift (both pipelinewise variants). Occasionally, records are hard deleted from the underlying mysql tables, and I’d thought that running
meltano elt
with the
--full-refresh
flag (or in any situation where
FULL TABLE
replication is used) would cause the corresponding redshift table to be blown away and recreated with the hard deleted records no longer there. However, it looks like target_redshift is instead just running UPDATE statements under the hood, which is causing the deleted records to stick around. Any suggestions on how to fix this? (I’ll add more details in a thread to prevent this already long message from getting longer)
Pipelinewise_target_redshift has an option for
add_metadata_columns
, which I’ve turned on. In theory, it should set a value in the `_sdc_deleted_at`metadata column in situations where a hard delete has occurred (so that if one has also turned on the
hard_deletes
setting, it can use the values in that column to figure out which records to remove. I’d though that was likely the answer, but it doesn’t look like values are getting set in that metadata column anywhere
Presumably, if those values where showing up the way the docs lead me to believe they should be, then I wouldn’t have to worry about whether the hard deletes were happening in redshift or not because I could use the presence of data in
_sdc_deleted_at
to filter out any deleted records in my staging layer in DBT easily enough
e
Hi Nick! My understanding is that the
_sdc_deletetd_at
is only populated when using the
LOG_BASED
replication method.
n
ah! well that explains that then, thanks!
Given that, is there a way to get
FULL TABLE
replication to actually lead to “full refreshes” of a table such that the table in the target mirrors the state of the table in the source (without lingering hard deleted rows left hanging around)?
e
When you do
FULL_TABLE
replication, you see it using
UPDATE
statements and the deleted rows continue to exist in the target?
Also, do you know if it us using fast sync?
n
Yep, that’s exactly right - I can see via the meltano logs how many inserts/updates it’s doing and rows that are hard deleted in the source table continue to appear in the target table. I can tell based on the other metadata columns when they were originally loaded in the past (but of course, I can’t tell when they were hard deleted since they’re literally gone from the underlying table)
it is NOT using fast sync
(though that’s something I’ve been meaning to look into)
a
@nick_hamlin - Do you have the option to filter out old records in the next step of your pipeline? For instance, using dbt? In the past, I've used
_sdc_batched_at
(or similar) to infer which have not been updated, and therefor which are detected to be deleted on the upstream source.
The merge update behavior is due to having key-properties defined. You can supress the key-properties definition in the tap to get append-only behavior, but you'd still end up inferring the deleted records using something like
_sdc_batched_at
.
n
Yep, that was my backup plan if there wasn’t a better option - we already purge soft deleted rows in the staging layer in DBT anyway
a
For actual deletion of records, you'd need the unofficial ACTIVATE_VERSION extension. I don't know if the tap/target combination you mentioned supports that feature.
n
Interesting - is key-properties a tap config thing? that’s not something that’s been on my radar
a
Yes, each stream in the tap can have a primary key defined (called
key_properties
) and targets will almost always use that as a merge update key. You can also suppress that default behavior in the catalog in order to basically "trick" your target into appending records instead of merging them.
I put an issue in queue to document this but hasn't floated to a high priority as of yet. Logged under the category of "non destructive loads" for in that case I wanted to have a full history retained of every past record version received from source: Documentation proposal: how to perform "non-destructive" data loads (#2524) · Issues · Meltano / Meltano · GitLab
n
got it, that makes sense - thanks!
Do you have any sense of whether or not the opposite is an option? e.g. “I only want fully destructive reloads”? (so to speak)
I’m assuming the answer is no, but figured I’d confirm
a
Yes, and that's where ACTIVATE_VERSION comes in - that (again unofficial) extension intentionally purges all records which have a version number less than the version number being activated. For the net result of fully destroying the old/stale data.
Looks like the pipelinewise tap-mysql does support the activate version message. I can see their target-redshift tolerates it but it doesn't look like it's actually implemented there in the redshift target.
n
fantastic, thanks @aaronsteers and @edward_smith for your quick help on this!