Howdy Meltano community! :melty-bouncy: We care ab...
# best-practices
u
Howdy Meltano community! melty bouncy We care about your knowledge & opinions. What are you EL (and possibly T) best practices? Let's brainstorm!
s
@taylor thanks for the inspiration for this one! Feel free to brainstorm 🙂
v
Yes to both 🤷
With DB's if you care about space turn on compression
I tend to do, delete raw data table, import to raw data table. My stage env pulls from raw. So stage is immutable and I don't deduplicate raw data. My use cases aren't the normal analyics use cases though so 🤷
s
@visch btw. you can add options to the brainstorm.
@pat_nadolny (once you're back) @aaronsteers/ @edgar_ramirez_mondragon @rune_andersen @Henning Holgersen, you're welcome to add more brainstorming ideas here - just use the "Add option" button!
Btw. @visch I used to do just that in a company internal context. It has some powerful pros, including the simplicity & beauty of doing it that way - it also helps with GDPR and these kinds of regulations in my experience 🙂 (And some downsides, due to size and load on the sourced database for instance we had to switch a few imports to incremental and only do a weekly full refresh.)
Alright I'll drop a few more I have in mind: 1. Data sparsity - Ingest just what you need, not more. (counterintuitive?) 2. Always add a unique identifier to the ingested rows (one identifier for the run!). Timestamps work well, as do uuids. 3. Try to organize imported data by larger domains.
And one final one from me: 4. Have assumptions on raw data and test them (on the raw ingested data), but don't let the ingestion fail if the assumptions fail! (Raw data is as it is, your assumptions might be wrong)
a
4. Have assumptions on raw data and test them (on the raw ingested data), but don't let the ingestion fail if the assumptions fail! (Raw data is as it is, your assumptions might be wrong)
@Sven Balnojan @visch I think this is an important question. Personally I think it makes more sense in most cases barring lots of noise / quantity to load all data and express the business logic that would've been a test as essentially a dead letter queue model during staging. Then you can test for any records in that model with just a
select *
yet you expressively declared business logic that says you dont want that stuff mucking up your downstream in the immediate-term but you DO want to act on it. You also dont want to take down downstream reporting (probably) What do you think?
So a
stg_stripe__payments
and a
stg_stripe__payments_quarantine
or some such
s
Haha I love the table naming and I like the approach. Especially the raw stage and the quarantine stage afterwards. So I have the option to either go to the source team and clear things up (in case bad data truly came through and will be cleaned up at the source level) OR to adapt my business logic. Your approach is very actionable. The alternative way would be to just use the plain tests and results and push the results into a CI system, and halt the transformation if the assumption tests break.
v
Really depends on what the end goal is for that @alexander_butler. Generally yes raw data should flow into Raw data and not be altered (unless there's actually something wrong with the raw data itself, in that case reload all your data) The quarantine thing can make sense in some cases but it all depends on what you're after. For a lot of the things I do the stg data needs to have every record inside of it, so a quarantine isn't an option. For me a failure is much much better than not having all the data in stage
p
I think both are valid approaches so I second what Derek said it totally depends on your needs, trading off timeliness with accuracy/completeness. For a solid amount of use cases incomplete data in the consumption model/fact tables is worse than delayed data. Like if 1/4 of your sales records fail and go into quarantine but the pipeline continues to completion then consumers of that data will either suspect issues and might lose trust in the data or they could take action on inaccurate data. But on the other hand theres valid use cases where having partial data available as soon as possible is better than no data
a
Ideally dashboards have some sort of indicator of upstream quarantine queues being non zero with a nice warning sign or yellow light But then we get more and more into esoteric idealism 😄 And even then I agree, its the proverbial "It depends" answer that reigns supreme But I do think data quality and transformation integrity are two separate categories of tests. The latter is subject to unit tests and proofs of correctness which are always valid to run and halt the pipeline on failure (which theoretically should be caught by CI early), the former which makes no sense to run in CI is always more of a slog since it involves processes and communication with stakeholders/system owners; and if you test something you know will fail like fields the team is bad at populating in the source system, then it defeats the purpose. You need to discuss with the key stakeholders preference on data quality flags, which float between 1 extreme of halt everything and the polar opposite of
case when
statements to just patch it in staging but push the original to quarantine There could be a dbt test of some sort which looks at a models upstreams, appends
_quarantine
, checks if the relation exists, then performs action pass or fail based on business specification on non-zero result