Howdy Meltano community melty bouncy We care about your know Meltano #best-practices

Howdy Meltano community! :melty-bouncy: We care ab...

user

11/08/2022, 1:29 PM

Howdy Meltano community! melty bouncy We care about your knowledge & opinions. What are you EL (and possibly T) best practices? Let's brainstorm!

Sven Balnojan

11/08/2022, 1:30 PM

@taylor thanks for the inspiration for this one! Feel free to brainstorm 🙂

visch

11/08/2022, 1:38 PM

Yes to both 🤷

visch

11/08/2022, 1:38 PM

With DB's if you care about space turn on compression

visch

11/08/2022, 1:40 PM

I tend to do, delete raw data table, import to raw data table. My stage env pulls from raw. So stage is immutable and I don't deduplicate raw data. My use cases aren't the normal analyics use cases though so 🤷

Sven Balnojan

11/08/2022, 1:42 PM

@visch btw. you can add options to the brainstorm.

Sven Balnojan

11/08/2022, 7:46 PM

@pat_nadolny (once you're back) @aaronsteers/ @edgar_ramirez_mondragon @rune_andersen @Henning Holgersen, you're welcome to add more brainstorming ideas here - just use the "Add option" button!

Sven Balnojan

11/09/2022, 10:47 AM

Btw. @visch I used to do just that in a company internal context. It has some powerful pros, including the simplicity & beauty of doing it that way - it also helps with GDPR and these kinds of regulations in my experience 🙂 (And some downsides, due to size and load on the sourced database for instance we had to switch a few imports to incremental and only do a weekly full refresh.)

Sven Balnojan

11/10/2022, 12:13 PM

Alright I'll drop a few more I have in mind: 1. Data sparsity - Ingest just what you need, not more. (counterintuitive?) 2. Always add a unique identifier to the ingested rows (one identifier for the run!). Timestamps work well, as do uuids. 3. Try to organize imported data by larger domains.

Sven Balnojan

11/11/2022, 10:04 AM

And one final one from me: 4. Have assumptions on raw data and test them (on the raw ingested data), but don't let the ingestion fail if the assumptions fail! (Raw data is as it is, your assumptions might be wrong)

alexander_butler

11/15/2022, 5:05 AM

4. Have assumptions on raw data and test them (on the raw ingested data), but don't let the ingestion fail if the assumptions fail! (Raw data is as it is, your assumptions might be wrong)

@Sven Balnojan @visch I think this is an important question. Personally I think it makes more sense in most cases barring lots of noise / quantity to load all data and express the business logic that would've been a test as essentially a dead letter queue model during staging. Then you can test for any records in that model with just a

select *

yet you expressively declared business logic that says you dont want that stuff mucking up your downstream in the immediate-term but you DO want to act on it. You also dont want to take down downstream reporting (probably) What do you think?

alexander_butler

11/15/2022, 5:06 AM

So a

stg_stripe__payments

and a

stg_stripe__payments_quarantine

or some such

Sven Balnojan

11/15/2022, 7:16 AM

Haha I love the table naming and I like the approach. Especially the raw stage and the quarantine stage afterwards. So I have the option to either go to the source team and clear things up (in case bad data truly came through and will be cleaned up at the source level) OR to adapt my business logic. Your approach is very actionable. The alternative way would be to just use the plain tests and results and push the results into a CI system, and halt the transformation if the assumption tests break.

visch

11/15/2022, 1:23 PM

Really depends on what the end goal is for that @alexander_butler. Generally yes raw data should flow into Raw data and not be altered (unless there's actually something wrong with the raw data itself, in that case reload all your data) The quarantine thing can make sense in some cases but it all depends on what you're after. For a lot of the things I do the stg data needs to have every record inside of it, so a quarantine isn't an option. For me a failure is much much better than not having all the data in stage

pat_nadolny

11/15/2022, 2:50 PM

I think both are valid approaches so I second what Derek said it totally depends on your needs, trading off timeliness with accuracy/completeness. For a solid amount of use cases incomplete data in the consumption model/fact tables is worse than delayed data. Like if 1/4 of your sales records fail and go into quarantine but the pipeline continues to completion then consumers of that data will either suspect issues and might lose trust in the data or they could take action on inaccurate data. But on the other hand theres valid use cases where having partial data available as soon as possible is better than no data

alexander_butler

11/15/2022, 7:18 PM

Ideally dashboards have some sort of indicator of upstream quarantine queues being non zero with a nice warning sign or yellow light But then we get more and more into esoteric idealism 😄 And even then I agree, its the proverbial "It depends" answer that reigns supreme But I do think data quality and transformation integrity are two separate categories of tests. The latter is subject to unit tests and proofs of correctness which are always valid to run and halt the pipeline on failure (which theoretically should be caught by CI early), the former which makes no sense to run in CI is always more of a slog since it involves processes and communication with stakeholders/system owners; and if you test something you know will fail like fields the team is bad at populating in the source system, then it defeats the purpose. You need to discuss with the key stakeholders preference on data quality flags, which float between 1 extreme of halt everything and the polar opposite of

case when

statements to just patch it in staging but push the original to quarantine There could be a dbt test of some sort which looks at a models upstreams, appends

_quarantine

, checks if the relation exists, then performs action pass or fail based on business specification on non-zero result

Open in Slack

Previous Next