Hi guys, Here at Quantile we are using Meltano to...
# best-practices
p
Hi guys, Here at Quantile we are using Meltano to build a Data Warehouse (DWH) for a client. Since one of our sources allegedly makes use of a hard delete (i.e. when a row is deleted in the UI, it is deleted in the DB), we would to prevent showing this record in the dashboards. Therefore, we are running a
full-refresh
every day. To give a bit more context about my question, let me rephrase this. Imagine the source returns 10 rows today, it might return 9 rows tomorrow. If we are using the
full-refresh
option, this only excludes using the
state
. Therefore, it will return all the (9) rows again tomorrow. The "problem" we have is that the "deleted" row, still exists in the database (in our case Snowflake). When performing the full refresh it obviously doesn't do anything with the existing "deleted" row. We want to prevent that this "deleted" row is not included in the BI dashboards. Therefore, we are thinking of deleting the "old" data and replacing it with the "new" data. I would love to include this using Meltano, but I cant find any sources how to perform this. I am curious what best practices you guys would advise us to handle this situation?
s
Just for some context: I've just collected some practices and would first of all recommend not to do that. https://meltano.com/blog/5-helpful-extract-load-practices-for-high-quality-raw-data/#2_Deduplicate_data_at_a_level_beyond_the_raw_level. Usually the better option is to use time stamps and filter in your staging layer.
So in your example, if I understand it correctly, you would simply do the full fresh, include the metadata, and then in your staging model would filter for latest date (always, on all models). If you're using dbt, then a macro would probably do that job for you. @pat_nadolny (?)
And I think @visch (I might be wrong) talked about the second option, full refreshes always, that would be for you to do full refresh extracts, then empty your table (or some temp to switch after everything works out fine), and then load the data.
p
@Sven Balnojan thank you for the fast response. I will quietly wait for the responses of the tagged people/others 🙂 Also, thanks for the best practices article, just read through it. Some useful tips for us!
v
The "right" way to do this is to look at ACTIVATE VERSION singer messages. If you want the full detail on what this does checkout https://github.com/meltano/meltano/issues/2463 note lots of things don't support this so you will have to do some work to get this working. Regarding snowflake costs and things it may be worth it, but I don't understand enough about snowflake costs to know For my use cases I do a lot of full table syncs to a localpostgres DB, so what I do is just drop the tables every run so something like
meltano run dbt:drop_source_tables tap-name target-name
p
Thank you @visch for the advice and tip! 😄
p
Heres some docs on "activate version" https://hub.meltano.com/singer/docs#activate-version. I think the dbt approach suggested is a good one though, using the load timestamp allows you to filter for only the mosts recent full table sync. Like Derek's example, you can also consider a source data cleanup process with retention policy (e.g. 7, 14, 30 days) depending on your use case, so your source tables dont grow too large with redundant sets of data
j
Yeah I’ve always just added time stamps and done my own “start date” and “end date” in my staging layer post ingestion. If storage is cheap, just hold entire copies of tables and build a view on top, if it’s more expensive you can do a MERGE. I’ve used both strategies effectively.
And +1 to thinking about retention if you keep full copies
p
Thank you all for the replies, we will think about this! I think that workign with ACTIVATE VERSION would be the best practice, but since this is not implemented everywhere, I tend to go for the DBT filtering in the staging layer.
t
Can you use LOG_BASED replication? That simplifies this quite a bit...
p
Nope, the data does not contain logs. Therefore, we need to do a full replication each time
t
Just to be clear, LOG_BASED replication doesn't depend on logs in the data, it depends on the transaction logs produced by the database. Whether that's an option depends on both the database and the configuration though.
p
It is not supported by database, it doesnt have transaction logs
t
Bummer.
n
Adding my vote to the “remove the records early in the DBT staging layer” solution, which is what we do for this situation
j
In general I find it bad practice to assume that you don’t get dupes in staging. Nearly every time I’ve made that assumption I’ve had to handle it downstream anyway (after something breaking or worse bad data getting to prod)
s
@jacob_matson I think we agree with you, as linked by the article above 😄 But "staging" is a terrible word... So our (Meltano) recommendation: 1. Have duplicates land in your landing area (for raw raw data) 2. Filter them at your staging level (the level after the raw raw landing area, as referred to by dbt). So yes, the assumption should always be that there will be duplicates! Because there will. Always. even if you think you have a primary key....
j
sorry at a former co we called it "persistent staging area" that held the "raw raw" data. force of habit lol
n
“raw raw data” has great
dataset_final_FINAL_v2_USE_THIS.xlsx
vibes
p
same I've heard staging for the raw "source" landing area and also the cleaned up/deduped area 🙄
j
we actually had a pretty nice way to talking about different levels of clean, which i like to use. psa (persistent staging) -> prep -> conformed -> secured