Hey everyone! I was wondering how you guys managed...
# best-practices
s
Hey everyone! I was wondering how you guys managed your ELT schedule using DBT & Meltano? Currently, we have multiple EL jobs every hour, and these will then update the dbt models that are affected by that data (ex: EL hubspot data into your warehouse, then update all models dependent on hubspot). I have begun being worried about concurrency issues recently, as well as work duplication. I was therefore curious how you guys managed your dbt jobs after the loading of data was done? • Per stream? • All at once? • Another option?
v
Best way with dbt is to run all of the models at the same time. Maybe once a day or maybe every hour. One project I have going really really wanted them to run independently and at a stream level. So that's what we do now. Basically you need to run with docker so you don't hit any file locking issues with meltano catalogs. DBT wise we used locks https://stackoverflow.com/questions/70383118/using-dbt-and-meltano-how-can-i-prevent-multiple-dbt-job-runs-from-conflicting basically wait until the last DBT job is done, you can definitely do a lock per stream as well but this worked for us. Highly recommend staying away from this though as on top of the extra work you are pushing a bunch of stuff to your orchestration layer and you have to now manage that somehow specifically freshness tests, how do I make sure we actually ran all of the tables in the last day now becomes harder than just running DBT run
t
Damn I wish I'd found that a couple months ago lol. I built a meltano "utility" to do the locking. Derek's solution is better for dbt, I think, but mine could be used for non-dbt things. Hopefully at some point this just gets built into the EDK and can be done automatically though.
v
I like your idea @thomas_briggs I'm not sure if the olap dbs have locking mechanics (I'm a noob!)
t
There are a couple different ways to do it (get a DB lock) with MySQL, for better or worse. I think the same is true with Oracle but it's been a long time since I've worked with it. (Sadly, I liked Oracle.) Anyway, having Meltano enforce the lock is nice because it doesn't depend on the DB or utility being run but it does clutter up the pipeline. 😕 And I have no idea how it plays with orchestrators, docker images, etc... I'm doing everything with shell scripts and cron so I can probably get away with some things others can't. 😛
v
I find it super interesting we both hit the same issue and solved it differently we are probably not the only ones!
t
Agreed!
s
Thanks guys, both 2 awesome ideas!
p
We have dbt staging models for each source that we run as part of the EL pipeline. Since these depend on just that one source there’s no worry of concurrency. All other dbt models are run later as part of a single job. In our case it’s kicked off by a sensor condition that checks if all the EL processes have run but you could get away with a fixed schedule in most cases.
Shameless plug for my talk from Coalesce last year lol https://www.youtube.com/live/eAeIFFY5818?feature=share
v
@prratek_ramchandani you used the word sensor so I"m guessing you're using dagster, and then probably running each job in a sepearte container which bypasses the locking problem as well! The concurency problem I was solving is with running the same dbt project in the same directory at the same time. I haven't hit concurrentcy problems at the database level 🤷 , maybe that's what @Stéphane Burwash was referring to! If so then that video is probably what you're after!
p
Yeah you’re spot on. And I was definitely thinking about database concurrency problems, primary accidental concurrent writes to the same table.
s
@prratek_ramchandani that's a great way to operate, thanks!