We re playing around with some CSV s in S3 and extracting th Meltano #getting-started

We're playing around with some CSV's in S3 and ext...

craig_astill

07/21/2023, 1:18 PM

We're playing around with some CSV's in S3 and extracting them with

tap-spreadsheets-anywhere

and loading into Redshift (

target-redshift

). We're happy with our Meltano/DBT pipeline, but now the Data Analysts want to make edits to the CSV's. eg. keep headings but correct some data for consistency/mistakes. Trying to find in the docs the neatest way to re-export these edited files. From past-experiments where the creation time was changed on a file, the whole file was re-extracted = duplicate data. At the moment my options are: • Dump target schemas and re-export (either by deleting state or using the `--full-refresh`/`--force` flags)/transform. • Discover a more graceful solution in Meltano to discover changed rows and override. Thoughts?

user

07/21/2023, 3:28 PM

From past-experiments where the creation time was changed on a file, the whole file was re-extracted = duplicate data.

I havent used tap-spreadsheets-anywhere for this use case specifically so cant say for sure but this sounds like expected behavior. The state likely tracks the modified timestamp of the file so if you modify the file then the whole thing gets reloaded, I doubt theres a mechanism to only load rows that are modified. Usually people will have a staging deduplication step in their warehouse using dbt to keep only the latest copy of records from a particular file. Also targets usually have insert/update setting so if you have it in update mode with a PK then the result wouldn't create duplicates in your destination table, although you would still have spent the time processing the unchanged rows. Does that make sense? Feel free to share more details if I misunderstood

craig_astill

07/21/2023, 4:01 PM

Yeah it makes sense. Will float the DBT idea under the more data-heavy team members to check their opinions, vs just living with the nuclear churn until the source files are stabilised. Thanks for the sanity check around the expected behaviour!

peter_s

07/21/2023, 4:43 PM

I use the PK solution, mostly in case the same version of a file needs to be reloaded for any reason (e.g. state was lost, or it’s a huge file / huge set of files that failed loading part way through), but this does generally also prevent having multiple versions loaded. A couple of caveats: • If the new version of the file has fewer rows than the old version, then the old version won’t get completely overwritten. I detect this case with a dbt generic test which checks whether a file’s last row is also its most recently loaded. • My target (

pipelinewise-target-snowflake

) — and probably any target — gives each row a slightly different value for

_sdc_batched_at

(the loading timestamp) so if you do allow multiple versions of a file to be loaded, then it’s not straightforward to implement the de-duplication step within the warehouse, and anyway it would be better to de-duplicate based on file creation time than on loading time, . Best would be if

tap-spreadsheets-anywhere

would implement an option to add a column containing the file’s creation / modification time.

user

07/21/2023, 6:45 PM

gives each row a slightly different value for
_sdc_batched_at
(the loading timestamp) so if you do allow multiple versions of a file to be loaded, then it’s not straightforward to implement the de-duplication step within the warehouse

@peter_s I created https://github.com/meltano/sdk/issues/1787 to add support or a static ID that would solve this challenge. I've also run into this where the batched at timestamp is slightly different so you have to use inexact grouping

Open in Slack

Previous Next