Hi all looking at getting into meltano but have a couple are Meltano #getting-started

Hi all, looking at getting into meltano, but have ...

Andy Carter

02/16/2023, 9:06 AM

Hi all, looking at getting into meltano, but have a couple areas where I don't understand how meltano manages state for incremental. Say I have a google analytics instance to connect to, and I want to export jsonl to local just to keep things simple. I want to do an incremental load, and run the sync every hour to update some dashboards. • Does the data from today get appended to an already existing json file? • When the 10am run for today kicks in, how does it know to replace all the current data for today in the json file?

edgar_ramirez_mondragon

02/16/2023, 6:41 PM

Hi Andy,

Does the data from today get appended to an already existing json file?

That’s the default behavior. But target-jsonl also supports creating a new, timestamped file: https://hub.meltano.com/loaders/target-jsonl#do_timestamp_file-setting

When the 10am run for today kicks in, how does it know to replace all the current data for today in the json file?

target-jsonl will not update existing data, only append or create a new file. Other loaders, like target-postgres do upsert data based on the primary keys.

Andy Carter

02/16/2023, 9:45 PM

Thanks, I'm beginning to understand what meltano does and doesn't do, particularly around the 'at least once' principle and how that works with the saved state. Also If I am using a blob storage target, then my query engine needs to handle the deduplication. Maybe the jsonl wasn't the most typical target to start off with.

edgar_ramirez_mondragon

02/16/2023, 9:47 PM

Yep, you got it

Andy Carter

02/16/2023, 9:47 PM

Is there a good example of using a query engine over json data in S3/Azure to handle the deduping? Feels like it would be a lot of boilerplate.

edgar_ramirez_mondragon

02/16/2023, 10:21 PM

Not sure if there are any examples out there of deduping right over object storage, but dbt has a macro that can be used once the data is in the dwh. We used to be on Athena internally, so maybe @pat_nadolny has some ideas.

pat_nadolny

02/16/2023, 10:35 PM

It depends on some settings too because the target-snowflake I'm using does a merge so no duplicates get added although you can opt for append only by not requiring a PK https://hub.meltano.com/loaders/target-snowflake#primary_key_required-setting. I think for Athena that was not the case so we used row number logic based on the newest record to arrive to dedup, I've kept that logic but I dont think I need it anymore.

pat_nadolny

02/16/2023, 10:35 PM

That dbt macro is cool though, didnt know that existed

Andy Carter

02/17/2023, 9:25 AM

Thanks @pat_nadolny so you just handle it in your dbt staging model, makes sense.

Open in Slack

Previous Next