So when I create a custom tap and define my primary keys for Meltano #random

So when I create a custom tap and define my primar...

Ian OLeary

04/23/2024, 2:15 PM

So when I create a custom tap and define my primary keys for a table and load data into a target such as snowflake, does meltano use those primary keys to prevent duplicate insertion in the upsert/merge logic that is declared in the snowflake "COPY" command that gets executed upon a pipeline run? Like for example say i have data for 50/52 weeks and i run a full refresh, will it only insert those last 2 weeks of data since the other batches were basically just duplicates? - Assuming I set primary keys correctly and those first 50 weeks of data is the exact same?

Andy Carter

04/23/2024, 3:28 PM

I think meltano doesn't know what data is already in Snowflake, it just issues the upsert command for all the data extracted from the tap (regardless of whether it is already in your target), and then Snowflake handles the rest. But not sure that your question...

Ian OLeary

04/23/2024, 3:48 PM

I know meltano doesn't know what data is in the target and only really stores the replication key for the pipeline, but I'm wondering if the primary key/keys column names for a table are just passed to the Snowflake statement being executed. Like a: INSERT INTO target_table t FROM batch_table b WHERE NOT EXISTS ( SELECT 1 FROM target_table t WHERE t.primary_key = b.primary_key );

Ian OLeary

04/23/2024, 3:51 PM

Because that seems to be the case for one of my pipelines, since the row count for the source is like 800 rows (internal users) and it only inserts a few rows at a time and ignores the duplicates. I'm pretty sure that's what happens but I just want to make sure

Andy Carter

04/24/2024, 2:50 PM

Could it be that you are using a state bookmark in meltano, so only the latest changed rows are actually being extracted from the tap? Do you see the same behaviour if you do

meltano run tap target --full-refresh

Ian OLeary

04/25/2024, 12:47 PM

I ran a full refresh of my tap and it looks like snowflake accounted for the duplicates (didn't insert duplicates). I had no state set since I was running the job fresh in the docker locally.

4 Views

Open in Slack

Previous Next