What are best practices regarding tools for Meltan...
# best-practices
p
What are best practices regarding tools for Meltano orchestration / scheduling? I see frequent mention of Docker and of Airflow / Meltano Cloud / Dagster / Prefect. Right now I just run my EL jobs directly inside scheduled GitHub Actions, and it seems to work fine (though requires respecting the 6-hour runner timeout). What am I missing?
s
Depends on what you need 😉 I'm a big fan of "simple is best". Unless your use-case is different from mine, you don't really need a complex dag architecture when running meltano (basically they're all mono-tasks with a
meltano run
command). This gives you flexibility on your orchestrator choice. Here are a few options • Meltano could comes with orchestration services out of the box, so you can always use that. (Is a paid option though) • A self-hosted airflow stack is always good, but comes with it's issues (scalability & requires maintenance) • You could also graduate to a hosted but self-managed solution such as MWAA, or, if your jobs are really large, you could go fully hosted with something like astronomer Basically, my 2 cents are: • It's *not the choice of orchestrator (*brand) that's important, since your dag structure is fairly basic. • What important is: ◦ How scalable / fault tolerant do you want your architecture to be? ◦ What type of effort/time/money to do you want to put into creating / maintaining your orchestration stack • And a bit of advice: maintaining an orchestration stack is a pain in the ass 🍑 and can be a huge time-sync, so take that into account if you want to create your full self-managed version. Hope this helps 😄
p
Thanks for your thoughts on this! I also absolutely believe in simplicity. And I’ve been burned before dealing with self-managed Airflow. GitHub Actions is probably the extreme when it comes to simple: It’s not even an additional system to use, since we’re already GitHub users. Which is why I’m wondering why others don’t seem to use it for EL. Sounds like scale is one reason (perhaps because of the 6-hour runner timeout? presumably not because of instance size, since larger runners are now available). Fault tolerance is an issue I hadn’t thought about. I’m curious what problems tend to occur in that regard. Perhaps automatic retries are needed because jobs fail due to some extraction source being down?
s
For me it was more being able to push changes without it affecting existing jobs. There's also the timeout -> I have some 10h jobs, so this would sadly not work for me. At a certain scale, their's also factors such as observability which become important
p
I see, thanks, makes sense.