Hihi, I'm just getting started and have a very sim...
# getting-started
w
Hihi, I'm just getting started and have a very simple question. I have a CSV tap and a jsonl loader just to test, but it's not working like I'd expect. I've configured the CSV tap to identify the key/id column, but each time I run the pipeline it re-adds every row to the output creating duplicates. I thought that it would store which keys had been extracted in a state file and not load them again; I also thought that doing a --full-refresh would clear the jsonl output and re-load all the keys, but again it just appends them. Am I missing something? How do I get non-duplicating extract-loads?
a
Hi, @will_gittoes! And welcome! 1. Regarding duplicates. The Singer Spec promises that all records will land at least once, but not necessarily exactly once. For streams with a primary key, most database targets will dedupe by using a merge upsert operation. (I can provide more information on this and additional mitigations, but the biggest win would be from using a DB target instead of the JSONL file based one.) 2. Regarding clearing the old results after completing a full table sync, the feature that delivers that capability is called "activate version" - meaning activation of the new table "version" by removing records from prior syncs. Almost all common database targets have this feature but the JSONL target does not (to my knowledge at least). I hope this info is helpful. Do you have a specific destination database in mind?
w
Hi AJ, thanks so much for replying! My actual destinations will be a mix of various data warehouses (redshift, snowflake) which should be fine, and google sheets which is more problematic, since that also doesn't seem to de-dupe.
I'm also keen on having reports outputted in CSV and other flat file formats, which ideally will also be key-aware. The best case is if the state file from the input will ensure only new records from the tap are processed (tap-dependant, I understand why that might work for postgres tap but not CSV 😅) and then are appended once to the flat file (being deduped by key). From your description of "activate version", it seems like there might be some sort of capability model? Is there a way to know which targets will let me de-dupe? And does the SDK (Singer or Meltano) allow building this in myself?
t
Can a CSV tap store state?? I guess I can imagine replication-mode: INCREMENTAL but surely not a distinct list of keys.
Also be aware that --full-refresh doesn't clear the data in any RDBMS target I'm aware of... that has always required manually dropping the table. It also doesn't play particularly nicely with pipelines with many tables - you can't --full-refresh a single table, unfortunately.
All of which is to say, what you're asking for is possible, but it requires some TLC. 😉