New Meltano user here, I'm currently setting up a ...
# getting-started
c
New Meltano user here, I'm currently setting up a PoC to try and run Meltano on AWS MWAA with DAGS that execute the ECSOperator to run my Meltano project as a Docker image. What I seem to not be understanding is how do I generate DAGS as assets that I can drop into my MWAA S3 bucket at development time using the meltano schedule add? I was expecting this to create files in the orchestrator/dags folder. Is this expectation correct? I've tried with both the file-airflow plugin and the full airflow plugin and in both cases I can see the python dag generation code but no additional DAG files are created? I realize this might be a naive question and if so let me know, all the tooling in this space is new to me and this appears to be the only gap I've come across, thanks in advance.
v
There's some folks here who are way more into airflow than me and will answer better. The airflow extension https://github.com/meltano/airflow-ext will create a dag for you that was designed to run with an airflow instance pointing to the folder that's generated. There's no reason you couldn't run another script to run
aws s3 cp
to copy the dag to the S3 bucket you're talking about
I'm sure the extension could be extended to add an s3 sender or something, or someone could make another extension for sending a file path to s3?
For the airflow extension to generate the dag it's the
initialize
command so
meltano run airflow:initialize
It does use the schedules you have defined to make these , code is here https://github.com/meltano/airflow-ext/blob/main/files_airflow_ext/orchestrate/meltano.py
c
I have a Meltano MWAA deployment in production that has been running for a couple of months. However, the way I set it up is that I have Meltano installed using MWAA’s startup script into a Python virtual environment on the scheduler and worker nodes. (I’ll add a bit more detail. Even though my approach is different, some of this might still work for you.)
the Meltano Airflow integration relies on a
orchestrate/dags/meltano.py
Python program which programmatically generates the DAG definitions upon invocation by relying on your complete meltano.yml (I’ve broken mine up across several files for ease of maintenance) - basically the Airflow scheduler (I think that is the correct term) periodically executes
meltano.py
and expects a JSON response that describes the DAGs, which it dutifully generates. I have had to heavily modify my
meltano.py
to work with the virtualenv I host Meltano in on MWAA
so you could for example invoke
meltano.py
’s
create_dags()
and capture its output on your local machine, based on your meltano.yml and them make a Python program that has
create_dags():
defined to return the captured output. Then as long as your meltano.yml on your ECS instances is similar/hopefully the same as what you used to generate the JSON output - this Python program can then live in your S3 bucket for MWAA’s DAGs
you’ll likely need to modify the generated local JSON output - for me, the DAGs use BashOperator because I have a functioning Meltano environment on the scheduler and workers - and you’ll need ECSOperator in place of that I imagine
“on your local machine” can be the build system, of course, and modifying the JSON to substitute BashOperator with EcsOperator, etc. can be part of that pipeline
I do not recommend going down my path of using MWAA’s startup.sh script to venv meltano - it means that in order to deploy an updated meltano.yml for new DAG definitions I need to deploy a new startup.sh script which takes… 10 to 30 minutes in MWAA. I could get around some of this by using symlinks, I just haven’t done so yet since things haven’t gotten to the “this sucks so much I need to fix it” stage yet and I have higher priorities. (Airflow and MWAA were new concepts to me when I started on this project in May so mistakes were made!)
c
Thankyou both for the response it helps clear up my understanding. I was wanting to avoid having Meltano installed in MWAA and only have it in my Docker image launched in ECS. I'm thinking I can possibly generate the JSON at build time for my environments and then adapt the DAG generator to use the JSON as a source from S3 instead of it trying to run Meltano in MWAA and have it change the operator to use ECS. Will see how I go 😅