Does anyone have any best practices around running dbt in pr Meltano #best-practices

Does anyone have any best practices around running...

jobert_abma

03/01/2022, 10:48 PM

Does anyone have any best practices around running dbt in production? We’re starting to see that it’s becoming difficult to manage dependencies by running specific models for individual pipelines and are exploring our options. One option, even according to dbt themselves, is that you’d start with a simple

dbt run

on a schedule and exclude heavier models later on as you identify them. What have others done?

taylor

03/02/2022, 3:15 PM

That was the process we went through at GitLab. Started with one big DAG and then started to peel things off as the run became too long. Towards the end of my time there we landed on a pattern where we’d couple the extract and load with the building of dbt source tables at a defined interval (daily or every 6 hours) and then downstream dags could pull from those source tables and be guaranteed to have data that’s only X hours stale depending on requirements. I’d suggest working backwards too from what you’re trying to achieve on the BI / decision side see if you can carve out some of the dependencies in the dag

jobert_abma

03/02/2022, 3:59 PM

OK, thanks, we’ll check that out!

pat_nadolny

03/02/2022, 7:06 PM

I’m really interested in this topic and was just talking with Taylor about it because I don’t think there’s an easy answer. As I thought through this it feels like theres a few patterns I always hear and they sort of progress in this order: 1. Simply run the full DAG - it seems like most everyone grows out of this relatively quickly 2. Full DAG except for staging models that are decoupled and run with selection criteria. Sometimes source data is better to refresh more frequently or alternatively cant be refreshed as fast as others (i.e. weekly ftp which feeds a model thats also fed by a source updated hourly) 3. Manually defined DAG subsets using selection rules ◦ Define explicitly using select operators e.g.

--select mart_x.model_y

--select mart_x.*

◦ Source based - using graph operators e.g.

--select stage.google_analytics+

◦ Result based - using graph operators e.g.

--select +mart_x.model_y

4. The above strategy but introduce tags to make even more precise selections I had almost the same progression as Taylor already mentioned. I'll note that once you decouple staging/source models you'll need to have freshness tests with monitoring/alerting to know when things are stale. Once you get to step 3 youve now broken up the dag and are trying to piece it back together which is good for scaling but is essentially forfeiting a major benefit of dbt which is that it manages your DAGs. You have to keep track of making sure everything has been selected at some point and that the theres no timing issues between decoupled dependencies.

taylor

03/02/2022, 7:42 PM

Yeah - you’ve basically outlined the full path Pat 😄 B/c from there you get into very specific stakeholder requirements. You could have users trigger an end-to-end refresh of data from source to dashboard, but who picks up the compute cost if that’s a heavy run? Do users know that my trigger a resync they could be spending $100 in Snowflake compute? Should they care? It’s basically a miniature version of the centralization / decentralization (bundle/unbundle) conversation and there’s no “right” answer.

Open in Slack

Previous Next