Hi Meltano team wave I m currently working on evaluating som Meltano #infra-deployment

Hi Meltano team :wave: ! I'm currently working on...

joao_paulo_amaral

06/22/2022, 2:10 PM

Hi Meltano team 👋 ! I'm currently working on evaluating some ETL solutions and I have some questions about Meltano architecture: • Is it possible to use Meltano in an already running Airflow instance without installing it in the same container and invoke with Meltano)? If yes, does this architecture require to use of a DockerOperator or KubernetesPodOperator or it's possible to call the Meltano running in another container/pod? • To run the Meltano in a high availability solution we need to share the data files between the running instances (considering a cluster running only docker) and use a load balancer or there is a better approach? How to scale up if I am not allowed to run docker-in-docker solution? Thanks in advance!

ken_payne

06/23/2022, 2:27 PM

Hi @joao_paulo_amaral 👋 Thanks for checking out Meltano! We wrote a blog post on how we choose to run Meltano here that might be of interest. To answer your questions: • The Meltano architecture is fairly straight-forward; the 3 main components are a database backend (postgres) for handling state (e.g. bookmarks for incremental replication of Singer taps), a compute environment capable of executing tasks via the

meltano run

CLI and (optionally) a long-running instance of the Meltano UI web service. As Meltano is typically executed for short-lived jobs, and given the UI is optional at this stage, there isn't too much to consider in terms of HA of Meltano; as long as your scheduler, state database and workers are available, Meltano will be able to execute. • A minimal Airflow install would include installing

meltano

into your Airflow environment (ideally with

pipx

) and calling

meltano run

using the Airflow BashOperator. For ourselves, we use the KubernetesPodOperator to launch a pod from a dedicated Meltano image, to avoid handling dependencies between Airflow and Meltano in the same environment. We provide a DAG Generator as part of the Airflow file bundle that can automatically create Airflow DAGs from your Meltano schedules (using the

meltano schedule list --format=json

command). • Right now there is no easy way to execute Meltano run commands to a remote service/container via an API from Airflow, though this is definitely a paradigm we are thinking about. We'd love to get your thoughts and feedback if this is something that you'd like to see built! Hope that helps 🙂

joao_paulo_amaral

06/23/2022, 4:13 PM

Great! Thank you so much for the answers @ken_payne!

michel_ebner

07/15/2022, 6:39 AM

Hi @ken_payne + @joao_paulo_amaral, Thanks for the question and the usefull response! if I understood you correctly @ken_payne, I would have something like in my drawing below (not the best at drawing 🥲): • Airflow and Meltano have their own DB. • There are long running pods for Airflow and the Meltano UI (UI ONLY). • The meltano execution happens in the PodOperators using my custome meltano image and the have access to the DB, which is then visible in the long running UI as they use the same DB Now some questions regarding this: • Where and when do I automatically generate and give access to the meltano DAGs to Airflow? Using a PersistenVolume which is accessed by Meltano long running pod (prod) and airflow? • Should I still redeploy the meltano long running pod on every image change?

michel_ebner

08/10/2022, 8:40 AM

Can anyone else validate the drawing ? 😅 I want to have an opinion before setting it up, but I want to set it up soon

ken_payne

08/10/2022, 11:10 AM

Hey Michel! Sorry I missed your message - I had some 🌴 recently and am still catching up.

Can anyone else validate the drawing ?

Looks good! This is how we run Meltano in our own Squared project 🙂

Where and when do I automatically generate and give access to the meltano DAGs to Airflow?

There are a couple of ways to do this. The default way, using the DAG Generator shipped with

files-airflow

, is to install

meltano

into your Airflow environment so that the DAG Generator can call

meltano schedule list

to create DAGs from schedules. Another possibility is to save the json output of that

list

command into your project during CI/CD. This is the approach we take in the Squared project, using a modified DAG Generator in our

orchestrate/dags

folder, as we find it is faster to read a cached json than to regenerate one on every DAG refresh (Airflow refreshes DAGs every 60s).

Should I still redeploy the meltano long running pod on every image change?

We redeploy our entire stack (Airflow, Meltano and Superset) on every image change, for simplicity and repeatability. If you have a larger team or many changes per day (making the deploy time painful), an approach with persistent volumes and a git sidecar (or similar syncing method) might make more sense.

michel_ebner

08/10/2022, 12:04 PM

Hi Ken, hope you had some good time off 😄 Thanks for the full reply, I will try to make something out of it and setup our infrastructure accordingly. If I have some time I will try to document it, so others can use it ! Would be nice to have something to work with.

4 Views

Open in Slack

Previous Next