Hey everyone I was wondering how you guys managed your ELT s Meltano #best-practices

Hey everyone! I was wondering how you guys managed...

Stéphane Burwash

02/12/2023, 5:20 PM

Hey everyone! I was wondering how you guys managed your ELT schedule using DBT & Meltano? Currently, we have multiple EL jobs every hour, and these will then update the dbt models that are affected by that data (ex: EL hubspot data into your warehouse, then update all models dependent on hubspot). I have begun being worried about concurrency issues recently, as well as work duplication. I was therefore curious how you guys managed your dbt jobs after the loading of data was done? • Per stream? • All at once? • Another option?

visch

02/13/2023, 12:09 PM

Best way with dbt is to run all of the models at the same time. Maybe once a day or maybe every hour. One project I have going really really wanted them to run independently and at a stream level. So that's what we do now. Basically you need to run with docker so you don't hit any file locking issues with meltano catalogs. DBT wise we used locks https://stackoverflow.com/questions/70383118/using-dbt-and-meltano-how-can-i-prevent-multiple-dbt-job-runs-from-conflicting basically wait until the last DBT job is done, you can definitely do a lock per stream as well but this worked for us. Highly recommend staying away from this though as on top of the extra work you are pushing a bunch of stuff to your orchestration layer and you have to now manage that somehow specifically freshness tests, how do I make sure we actually ran all of the tables in the last day now becomes harder than just running DBT run

thomas_briggs

02/13/2023, 1:40 PM

Damn I wish I'd found that a couple months ago lol. I built a meltano "utility" to do the locking. Derek's solution is better for dbt, I think, but mine could be used for non-dbt things. Hopefully at some point this just gets built into the EDK and can be done automatically though.

visch

02/13/2023, 1:42 PM

I like your idea @thomas_briggs I'm not sure if the olap dbs have locking mechanics (I'm a noob!)

thomas_briggs

02/13/2023, 1:45 PM

There are a couple different ways to do it (get a DB lock) with MySQL, for better or worse. I think the same is true with Oracle but it's been a long time since I've worked with it. (Sadly, I liked Oracle.) Anyway, having Meltano enforce the lock is nice because it doesn't depend on the DB or utility being run but it does clutter up the pipeline. 😕 And I have no idea how it plays with orchestrators, docker images, etc... I'm doing everything with shell scripts and cron so I can probably get away with some things others can't. 😛

visch

02/13/2023, 1:46 PM

I find it super interesting we both hit the same issue and solved it differently we are probably not the only ones!

thomas_briggs

02/13/2023, 1:47 PM

Agreed!

Stéphane Burwash

02/13/2023, 3:48 PM

Thanks guys, both 2 awesome ideas!

prratek_ramchandani

02/24/2023, 3:49 AM

We have dbt staging models for each source that we run as part of the EL pipeline. Since these depend on just that one source there’s no worry of concurrency. All other dbt models are run later as part of a single job. In our case it’s kicked off by a sensor condition that checks if all the EL processes have run but you could get away with a fixed schedule in most cases.

prratek_ramchandani

02/24/2023, 3:50 AM

Shameless plug for my talk from Coalesce last year lol https://www.youtube.com/live/eAeIFFY5818?feature=share

visch

02/24/2023, 1:13 PM

@prratek_ramchandani you used the word sensor so I"m guessing you're using dagster, and then probably running each job in a sepearte container which bypasses the locking problem as well! The concurency problem I was solving is with running the same dbt project in the same directory at the same time. I haven't hit concurrentcy problems at the database level 🤷 , maybe that's what @Stéphane Burwash was referring to! If so then that video is probably what you're after!

prratek_ramchandani

02/24/2023, 1:25 PM

Yeah you’re spot on. And I was definitely thinking about database concurrency problems, primary accidental concurrent writes to the same table.

Stéphane Burwash

02/24/2023, 9:44 PM

@prratek_ramchandani that's a great way to operate, thanks!

Open in Slack

Previous Next