So when I create a custom tap and define my primar...
# random
i
So when I create a custom tap and define my primary keys for a table and load data into a target such as snowflake, does meltano use those primary keys to prevent duplicate insertion in the upsert/merge logic that is declared in the snowflake "COPY" command that gets executed upon a pipeline run? Like for example say i have data for 50/52 weeks and i run a full refresh, will it only insert those last 2 weeks of data since the other batches were basically just duplicates? - Assuming I set primary keys correctly and those first 50 weeks of data is the exact same?
a
I think meltano doesn't know what data is already in Snowflake, it just issues the upsert command for all the data extracted from the tap (regardless of whether it is already in your target), and then Snowflake handles the rest. But not sure that your question...
i
I know meltano doesn't know what data is in the target and only really stores the replication key for the pipeline, but I'm wondering if the primary key/keys column names for a table are just passed to the Snowflake statement being executed. Like a: INSERT INTO target_table t FROM batch_table b WHERE NOT EXISTS ( SELECT 1 FROM target_table t WHERE t.primary_key = b.primary_key );
Because that seems to be the case for one of my pipelines, since the row count for the source is like 800 rows (internal users) and it only inserts a few rows at a time and ignores the duplicates. I'm pretty sure that's what happens but I just want to make sure
a
Could it be that you are using a state bookmark in meltano, so only the latest changed rows are actually being extracted from the tap? Do you see the same behaviour if you do
meltano run tap target --full-refresh
i
I ran a full refresh of my tap and it looks like snowflake accounted for the duplicates (didn't insert duplicates). I had no state set since I was running the job fresh in the docker locally.