Hey all, what's the current best practice for iden...
# best-practices
c
Hey all, what's the current best practice for identifying created rows from a specific run? For example, I've got Airflow running every 4 hours and 'FULL_TABLE'-ing into my warehouse. I want to create a view that only shows users the most recent run.
p
Most targets allow you to add metadata columns about the sync like https://hub.meltano.com/loaders/target-snowflake#add-metadata-columns also linking to Stitch docs https://www.stitchdata.com/docs/replication/loading/understanding-destination-schemas#sdc-columns. Then you can filter for the most recent timestamp
c
Huh, I've been adding the metadata columns in my meltano.yml. Only _sdc_extracted_at, _sdc_batched_at, and _sdc_deleted have shown up so far "These columns are applicable to all tables and integration types. Unless noted, every column in this list will be present in every integration table created by Stitch." -Stitch
p
I think those should be enough for what youre trying to do! If you filter for the max extracted or batched at field, does it give you your subset?
c
Not quite For example, I just ran the example job and it ran for 20 min. So I've got a range of times that altogether make up the rows from that job
p
ah yeah that doesnt look like it solves the problem then 🤔. It probably depends on the data but what if you took the max batched at timestamp for each record that would give you the most up to date copy but theres a chance some are still from a previous sync if they were hard deleted 🤷
c
my thoughts exactly! at the moment I just make sure the job fires at the top of the hour and takes less than an hour...
so I can do a max(date_trunc(hour,_SDC_BATCHED_AT))
p
that seems like a reasonable way to do it! As long as you'd be aware of a job taking longer than expected, like by an alert of some sort
a
I've done the same as ☝️ with date level logic, on a source getting one load per day.
You can calculate a recency rank with something like this:
Copy code
ROW_RUMBER() OVER (
    PARTITION BY mytable.unique_id
    ORDER BY _sdc_batched_at DESC
) AS recency_rank  # filter by '== 1' to remove all dupes
p
One other thought I had is that you could use a mapper to add a new column to your stream with a static sync ID. Maybe you set a
SYNC_ID
environment variable based on the current timestamp then every record in that batch gets the same static sync ID vs sdc_batched_at that might be close but not the same across each record in the same sync.