Hello Best Practices, question for `target-parquet...
# best-practices
e
Hello Best Practices, question for
target-parquet
, when I was on
target-postgresql
duplicate data was being de-deduped automatically.. but with parquet target not at all.. if I want to "fix" this, is it about contributing to the target with a de-dupe flag.. or is it better / more proper practice... to fix my tap? Thanks for reference : https://github.com/estrategiahq/target-parquet
it feels like this is a TAP problem and I need to rectify it there
a
Hi, @emcp! This is a fun and important question and it comes up often with file-based targets like CSV, JSONL, and Parquet. I think I might log as a GutHub and/StackOverflow discussion. But to kick off the discussion, I'll put my two cents here: First, to clarify, is your desired behavior that the target Parquet be loaded with zero duplicate record histories when running INCREMENTAL sync, or just that - when running a FULL_TABLE sync, you don't want multiple copies of the entire dataset?
I should also ask, before getting too deep into the question, have you ruled out using dbt for deduping in your transformation layer? Most of the good solutions to deduping will eventually require some transformation logic - so I'm curious if you are using dbt for subsequent transformations or if you have another requirement.
e
I am pushing to parquet and then utilizing TRINO.. I tried looking for DBT based solutions but.. since I wrote the tap I think it's important to push the data in the best most straight forward manner according to what I would expect.. there is no incremental optional just a full on sync (I use the GUI so.. select options are ignored) .. DBT seems to require more code at the moment to implement with TRINO / Parquet on S3
I will lean to .. writing the tap core over again.. and deduping from there
desired behavior is.. I query the 3rd party API.. and theres zero value in duplicates.. if a user queries
A
and
AA
.. and get back an entry that matches both queries... it's not useful or necessary to know that.. all that matters is the response
reference : https://interactivebrokers.github.io/tws-api/introduction.html , The tap is a python client which calls a C++ Thrift Server.. I will de-dupe probably at the thrift server level
I'll give a full tour Wednesday in Stockholm .. including the Meltano latest and greatest 2.8.x.. excited to share a bit how much headache this has all saved me as a structured data n00b
a
Lots here I want to dive into more... but one more clarification question? Do you care about deleted prior records that no longer exist? And/or prior versions of records that changed? I think 'no' to the latter question but less sure of the first.
e
Records deleted actually REALLY matter.. but the vendor does not provide that data.. hence it is up to me to keep a constant record of data coming out of the vendor
I query on a weekly basis for the latest fresh data.. and have to sort it out myself (did a company go bankrupt and delist? did it simply just change tickers? ie GOOGL becoming ABC ..etc...
so yes I absolutely care about prior data... the way I fix it is I append the datetime stamp which the data was retrieved
so every row.. I have the timestamp in UTC when it was pulled
there should never be a duplicate of those.. hence the thought, I have a bug somewhere in my C++ (not surprising and I am thinking to try out RUST thrift server going forward)
a
Thanks for confirming. So, because you can only pull as a full table sync and because you do need to observe what was present in prior runs but missing in the current, the "full table append" and "full table upsert" do seem like your best choices. And that leads us back to your initial ask about the Parquet target not supporting upserts/merges based on primary key... Certainly the Parquet target could have a merge upsert feature added, but in a file based system, it's a little tricky to update the existing files. What the target would likely need to do is to create a new dataset that contains the merge upsert result and then swap the new and old files. The file swap itself may still create a blip of "downtime" but for some scenarios that could be acceptable. (Perhaps there are creative ways of resolving - like uploading the new dataset on top of the old, but it's just a different problem than using a database with table locks and transactions.)
If you can get incremental data from the source, you'd have a smaller dataset, but you likely may still end up with two types of duplicates: • Duplicates from old/newer versions of the same record. • Duplicates from "ties" on the bookmark key, because most taps use a "greater than or equal to" logic to ensure no records are every missed.
Is this helpful at all? I'm curious if any of those routes sounds more/less attractive for your use case.
e
I think the wrinkle here is.. once I have the parquet file.. I utilize a micro service to upload it to an S3 bucket
and the target will have no knowledge or connection to any of that prior data.. it's merely a dumb pipe to dump whatever the latest full table sync gives me from the vendor
and the data is always unique due to the appending of whatever the datetime stamp at time of triggering was
this source is ... basically Interactive Brokers.. they won't give old data.. there's probably some dataset you buy for that in their catalog.. but I am just sort of learning the ropes and downloading the data as it comes (snapshot) .. and appending it to a downstream RAW S3 bucket.. from there.. I plan to either use DBT, Airflow + Python, etc... to create my analytics tables for subsequent queries on the same vendor. You need to know what the valid ticker symbols are at any given moment.. in order to then go and query the news