Hi Meltano team :wave: ! I'm currently working on...
# infra-deployment
j
Hi Meltano team šŸ‘‹ ! I'm currently working on evaluating some ETL solutions and I have some questions about Meltano architecture: • Is it possible to use Meltano in an already running Airflow instance without installing it in the same container and invoke with Meltano)? If yes, does this architecture require to use of a DockerOperator or KubernetesPodOperator or it's possible to call the Meltano running in another container/pod? • To run the Meltano in a high availability solution we need to share the data files between the running instances (considering a cluster running only docker) and use a load balancer or there is a better approach? How to scale up if I am not allowed to run docker-in-docker solution? Thanks in advance!
k
Hi @joao_paulo_amaral šŸ‘‹ Thanks for checking out Meltano! We wrote a blog post on how we choose to run Meltano here that might be of interest. To answer your questions: • The Meltano architecture is fairly straight-forward; the 3 main components are a database backend (postgres) for handling state (e.g. bookmarks for incremental replication of Singer taps), a compute environment capable of executing tasks via the
meltano run
CLI and (optionally) a long-running instance of the Meltano UI web service. As Meltano is typically executed for short-lived jobs, and given the UI is optional at this stage, there isn't too much to consider in terms of HA of Meltano; as long as your scheduler, state database and workers are available, Meltano will be able to execute. • A minimal Airflow install would include installing
meltano
into your Airflow environment (ideally with
pipx
) and calling
meltano run
using the Airflow BashOperator. For ourselves, we use the KubernetesPodOperator to launch a pod from a dedicated Meltano image, to avoid handling dependencies between Airflow and Meltano in the same environment. We provide a DAG Generator as part of the Airflow file bundle that can automatically create Airflow DAGs from your Meltano schedules (using the
meltano schedule list --format=json
command). • Right now there is no easy way to execute Meltano run commands to a remote service/container via an API from Airflow, though this is definitely a paradigm we are thinking about. We'd love to get your thoughts and feedback if this is something that you'd like to see built! Hope that helps šŸ™‚
j
Great! Thank you so much for the answers @ken_payne!
m
Hi @ken_payne + @joao_paulo_amaral, Thanks for the question and the usefull response! if I understood you correctly @ken_payne, I would have something like in my drawing below (not the best at drawing 🄲): • Airflow and Meltano have their own DB. • There are long running pods for Airflow and the Meltano UI (UI ONLY). • The meltano execution happens in the PodOperators using my custome meltano image and the have access to the DB, which is then visible in the long running UI as they use the same DB Now some questions regarding this: • Where and when do I automatically generate and give access to the meltano DAGs to Airflow? Using a PersistenVolume which is accessed by Meltano long running pod (prod) and airflow? • Should I still redeploy the meltano long running pod on every image change?
Can anyone else validate the drawing ? šŸ˜… I want to have an opinion before setting it up, but I want to set it up soon
k
Hey Michel! Sorry I missed your message - I had some 🌓 recently and am still catching up.
Can anyone else validate the drawing ?
Looks good! This is how we run Meltano in our own Squared project šŸ™‚
Where and when do I automatically generate and give access to the meltano DAGs to Airflow?
There are a couple of ways to do this. The default way, using the DAG Generator shipped with
files-airflow
, is to install
meltano
into your Airflow environment so that the DAG Generator can call
meltano schedule list
to create DAGs from schedules. Another possibility is to save the json output of that
list
command into your project during CI/CD. This is the approach we take in the Squared project, using a modified DAG Generator in our
orchestrate/dags
folder, as we find it is faster to read a cached json than to regenerate one on every DAG refresh (Airflow refreshes DAGs every 60s).
Should I still redeploy the meltano long running pod on every image change?
We redeploy our entire stack (Airflow, Meltano and Superset) on every image change, for simplicity and repeatability. If you have a larger team or many changes per day (making the deploy time painful), an approach with persistent volumes and a git sidecar (or similar syncing method) might make more sense.
m
Hi Ken, hope you had some good time off šŸ˜„ Thanks for the full reply, I will try to make something out of it and setup our infrastructure accordingly. If I have some time I will try to document it, so others can use it ! Would be nice to have something to work with.