Hey all what s the current best practice for identifying cre Meltano #best-practices

Hey all, what's the current best practice for iden...

connor_lough

08/11/2022, 8:48 PM

Hey all, what's the current best practice for identifying created rows from a specific run? For example, I've got Airflow running every 4 hours and 'FULL_TABLE'-ing into my warehouse. I want to create a view that only shows users the most recent run.

pat_nadolny

08/11/2022, 8:55 PM

Most targets allow you to add metadata columns about the sync like https://hub.meltano.com/loaders/target-snowflake#add-metadata-columns also linking to Stitch docs https://www.stitchdata.com/docs/replication/loading/understanding-destination-schemas#sdc-columns. Then you can filter for the most recent timestamp

connor_lough

08/11/2022, 8:58 PM

Huh, I've been adding the metadata columns in my meltano.yml. Only _sdc_extracted_at, _sdc_batched_at, and _sdc_deleted have shown up so far "These columns are applicable to all tables and integration types. Unless noted, every column in this list will be present in every integration table created by Stitch." -Stitch

pat_nadolny

08/11/2022, 9:00 PM

I think those should be enough for what youre trying to do! If you filter for the max extracted or batched at field, does it give you your subset?

connor_lough

08/11/2022, 9:05 PM

Not quite For example, I just ran the example job and it ran for 20 min. So I've got a range of times that altogether make up the rows from that job

pat_nadolny

08/11/2022, 9:29 PM

ah yeah that doesnt look like it solves the problem then 🤔. It probably depends on the data but what if you took the max batched at timestamp for each record that would give you the most up to date copy but theres a chance some are still from a previous sync if they were hard deleted 🤷

connor_lough

08/11/2022, 9:35 PM

my thoughts exactly! at the moment I just make sure the job fires at the top of the hour and takes less than an hour...

connor_lough

08/11/2022, 9:35 PM

so I can do a max(date_trunc(hour,_SDC_BATCHED_AT))

pat_nadolny

08/11/2022, 9:46 PM

that seems like a reasonable way to do it! As long as you'd be aware of a job taking longer than expected, like by an alert of some sort

aaronsteers

08/12/2022, 6:00 PM

I've done the same as ☝️ with date level logic, on a source getting one load per day.

aaronsteers

08/12/2022, 6:06 PM

You can calculate a recency rank with something like this:

Copy code

ROW_RUMBER() OVER (
    PARTITION BY mytable.unique_id
    ORDER BY _sdc_batched_at DESC
) AS recency_rank  # filter by '== 1' to remove all dupes

pat_nadolny

08/12/2022, 7:11 PM

One other thought I had is that you could use a mapper to add a new column to your stream with a static sync ID. Maybe you set a

SYNC_ID

environment variable based on the current timestamp then every record in that batch gets the same static sync ID vs sdc_batched_at that might be close but not the same across each record in the same sync.

Open in Slack

Previous Next