Hi guys Here at Quantile we are using Meltano to build a Dat Meltano #best-practices

Hi guys, Here at Quantile we are using Meltano to...

paulo_rijnberg

12/19/2022, 1:31 PM

Hi guys, Here at Quantile we are using Meltano to build a Data Warehouse (DWH) for a client. Since one of our sources allegedly makes use of a hard delete (i.e. when a row is deleted in the UI, it is deleted in the DB), we would to prevent showing this record in the dashboards. Therefore, we are running a

full-refresh

every day. To give a bit more context about my question, let me rephrase this. Imagine the source returns 10 rows today, it might return 9 rows tomorrow. If we are using the

full-refresh

option, this only excludes using the

state

. Therefore, it will return all the (9) rows again tomorrow. The "problem" we have is that the "deleted" row, still exists in the database (in our case Snowflake). When performing the full refresh it obviously doesn't do anything with the existing "deleted" row. We want to prevent that this "deleted" row is not included in the BI dashboards. Therefore, we are thinking of deleting the "old" data and replacing it with the "new" data. I would love to include this using Meltano, but I cant find any sources how to perform this. I am curious what best practices you guys would advise us to handle this situation?

Sven Balnojan

12/19/2022, 1:38 PM

Just for some context: I've just collected some practices and would first of all recommend not to do that. https://meltano.com/blog/5-helpful-extract-load-practices-for-high-quality-raw-data/#2_Deduplicate_data_at_a_level_beyond_the_raw_level. Usually the better option is to use time stamps and filter in your staging layer.

Sven Balnojan

12/19/2022, 1:40 PM

So in your example, if I understand it correctly, you would simply do the full fresh, include the metadata, and then in your staging model would filter for latest date (always, on all models). If you're using dbt, then a macro would probably do that job for you. @pat_nadolny (?)

Sven Balnojan

12/19/2022, 1:43 PM

And I think @visch (I might be wrong) talked about the second option, full refreshes always, that would be for you to do full refresh extracts, then empty your table (or some temp to switch after everything works out fine), and then load the data.

paulo_rijnberg

12/19/2022, 1:47 PM

@Sven Balnojan thank you for the fast response. I will quietly wait for the responses of the tagged people/others 🙂 Also, thanks for the best practices article, just read through it. Some useful tips for us!

visch

12/19/2022, 2:16 PM

The "right" way to do this is to look at ACTIVATE VERSION singer messages. If you want the full detail on what this does checkout https://github.com/meltano/meltano/issues/2463 note lots of things don't support this so you will have to do some work to get this working. Regarding snowflake costs and things it may be worth it, but I don't understand enough about snowflake costs to know For my use cases I do a lot of full table syncs to a localpostgres DB, so what I do is just drop the tables every run so something like

meltano run dbt:drop_source_tables tap-name target-name

paulo_rijnberg

12/19/2022, 2:20 PM

Thank you @visch for the advice and tip! 😄

pat_nadolny

12/19/2022, 2:37 PM

Heres some docs on "activate version" https://hub.meltano.com/singer/docs#activate-version. I think the dbt approach suggested is a good one though, using the load timestamp allows you to filter for only the mosts recent full table sync. Like Derek's example, you can also consider a source data cleanup process with retention policy (e.g. 7, 14, 30 days) depending on your use case, so your source tables dont grow too large with redundant sets of data

jacob_matson

12/19/2022, 4:15 PM

Yeah I’ve always just added time stamps and done my own “start date” and “end date” in my staging layer post ingestion. If storage is cheap, just hold entire copies of tables and build a view on top, if it’s more expensive you can do a MERGE. I’ve used both strategies effectively.

jacob_matson

12/19/2022, 4:16 PM

And +1 to thinking about retention if you keep full copies

paulo_rijnberg

12/20/2022, 3:11 PM

Thank you all for the replies, we will think about this! I think that workign with ACTIVATE VERSION would be the best practice, but since this is not implemented everywhere, I tend to go for the DBT filtering in the staging layer.

thomas_briggs

12/20/2022, 4:41 PM

Can you use LOG_BASED replication? That simplifies this quite a bit...

paulo_rijnberg

12/20/2022, 4:42 PM

Nope, the data does not contain logs. Therefore, we need to do a full replication each time

thomas_briggs

12/20/2022, 4:43 PM

Just to be clear, LOG_BASED replication doesn't depend on logs in the data, it depends on the transaction logs produced by the database. Whether that's an option depends on both the database and the configuration though.

paulo_rijnberg

12/20/2022, 5:27 PM

It is not supported by database, it doesnt have transaction logs

thomas_briggs

12/20/2022, 5:33 PM

Bummer.

nick_hamlin

12/20/2022, 7:01 PM

Adding my vote to the “remove the records early in the DBT staging layer” solution, which is what we do for this situation

jacob_matson

12/20/2022, 7:23 PM

In general I find it bad practice to assume that you don’t get dupes in staging. Nearly every time I’ve made that assumption I’ve had to handle it downstream anyway (after something breaking or worse bad data getting to prod)

Sven Balnojan

12/20/2022, 8:23 PM

@jacob_matson I think we agree with you, as linked by the article above 😄 But "staging" is a terrible word... So our (Meltano) recommendation: 1. Have duplicates land in your landing area (for raw raw data) 2. Filter them at your staging level (the level after the raw raw landing area, as referred to by dbt). So yes, the assumption should always be that there will be duplicates! Because there will. Always. even if you think you have a primary key....

jacob_matson

12/20/2022, 8:28 PM

sorry at a former co we called it "persistent staging area" that held the "raw raw" data. force of habit lol

nick_hamlin

12/20/2022, 8:30 PM

“raw raw data” has great

dataset_final_FINAL_v2_USE_THIS.xlsx

vibes

pat_nadolny

12/20/2022, 8:30 PM

same I've heard staging for the raw "source" landing area and also the cleaned up/deduped area 🙄

jacob_matson

12/20/2022, 8:35 PM

we actually had a pretty nice way to talking about different levels of clean, which i like to use. psa (persistent staging) -> prep -> conformed -> secured

Open in Slack

Previous Next