Hi all, looking at getting into meltano, but have ...
# getting-started
a
Hi all, looking at getting into meltano, but have a couple areas where I don't understand how meltano manages state for incremental. Say I have a google analytics instance to connect to, and I want to export jsonl to local just to keep things simple. I want to do an incremental load, and run the sync every hour to update some dashboards. • Does the data from today get appended to an already existing json file? • When the 10am run for today kicks in, how does it know to replace all the current data for today in the json file?
e
Hi Andy,
Does the data from today get appended to an already existing json file?
That’s the default behavior. But target-jsonl also supports creating a new, timestamped file: https://hub.meltano.com/loaders/target-jsonl#do_timestamp_file-setting
When the 10am run for today kicks in, how does it know to replace all the current data for today in the json file?
target-jsonl will not update existing data, only append or create a new file. Other loaders, like target-postgres do upsert data based on the primary keys.
a
Thanks, I'm beginning to understand what meltano does and doesn't do, particularly around the 'at least once' principle and how that works with the saved state. Also If I am using a blob storage target, then my query engine needs to handle the deduplication. Maybe the jsonl wasn't the most typical target to start off with.
e
Yep, you got it
a
Is there a good example of using a query engine over json data in S3/Azure to handle the deduping? Feels like it would be a lot of boilerplate.
e
Not sure if there are any examples out there of deduping right over object storage, but dbt has a macro that can be used once the data is in the dwh. We used to be on Athena internally, so maybe @pat_nadolny has some ideas.
p
It depends on some settings too because the target-snowflake I'm using does a merge so no duplicates get added although you can opt for append only by not requiring a PK https://hub.meltano.com/loaders/target-snowflake#primary_key_required-setting. I think for Athena that was not the case so we used row number logic based on the newest record to arrive to dedup, I've kept that logic but I dont think I need it anymore.
That dbt macro is cool though, didnt know that existed
a
Thanks @pat_nadolny so you just handle it in your dbt staging model, makes sense.