craig_astill
07/21/2023, 1:18 PMtap-spreadsheets-anywhere
and loading into Redshift (target-redshift
). We're happy with our Meltano/DBT pipeline, but now the Data Analysts want to make edits to the CSV's. eg. keep headings but correct some data for consistency/mistakes.
Trying to find in the docs the neatest way to re-export these edited files. From past-experiments where the creation time was changed on a file, the whole file was re-extracted = duplicate data.
At the moment my options are:
• Dump target schemas and re-export (either by deleting state or using the `--full-refresh`/`--force` flags)/transform.
• Discover a more graceful solution in Meltano to discover changed rows and override.
Thoughts?user
07/21/2023, 3:28 PMFrom past-experiments where the creation time was changed on a file, the whole file was re-extracted = duplicate data.I havent used tap-spreadsheets-anywhere for this use case specifically so cant say for sure but this sounds like expected behavior. The state likely tracks the modified timestamp of the file so if you modify the file then the whole thing gets reloaded, I doubt theres a mechanism to only load rows that are modified. Usually people will have a staging deduplication step in their warehouse using dbt to keep only the latest copy of records from a particular file. Also targets usually have insert/update setting so if you have it in update mode with a PK then the result wouldn't create duplicates in your destination table, although you would still have spent the time processing the unchanged rows. Does that make sense? Feel free to share more details if I misunderstood
craig_astill
07/21/2023, 4:01 PMpeter_s
07/21/2023, 4:43 PMpipelinewise-target-snowflake
) — and probably any target — gives each row a slightly different value for _sdc_batched_at
(the loading timestamp) so if you do allow multiple versions of a file to be loaded, then it’s not straightforward to implement the de-duplication step within the warehouse, and anyway it would be better to de-duplicate based on file creation time than on loading time, . Best would be if tap-spreadsheets-anywhere
would implement an option to add a column containing the file’s creation / modification time.user
07/21/2023, 6:45 PMgives each row a slightly different value for@peter_s I created https://github.com/meltano/sdk/issues/1787 to add support or a static ID that would solve this challenge. I've also run into this where the batched at timestamp is slightly different so you have to use inexact grouping(the loading timestamp) so if you do allow multiple versions of a file to be loaded, then it’s not straightforward to implement the de-duplication step within the warehouse_sdc_batched_at