Hi guys, <https://meltano.com/docs/production.html...
# infra-deployment
j
Hi guys, https://meltano.com/docs/production.html#saas-hosting-options I see that is an option to deploy meltano inside services like astronomer.io and Google Cloud Composer. Has anyone tried to deploy on https://aws.amazon.com/managed-workflows-for-apache-airflow/ ?
f
I have not yet, but thought about it. I'm already using Airflow on a EKS cluster with separate jobs in pods kicked off by the KubernetesExecutor. The only change necessary as far as I can tell is to generate "real" DAGs as separate files instead of using the DAG generator included with Meltano. That just scans the output of the meltano schedule list --format=json or something like that. That really won't work with a full Airflow install or something like MWAA. Should be too hard to do though, just finishing up migrating everything to a new cluster, then I'll try that.
j
Yeah, we've been trying to install meltano on MWAA but no success so far The docker image available on github to run locally doesn't work properly too, so it's been hard to find out a solution. What we think that might work is to install meltano as a custom plugin and run it inside a PythonVirtualenvOperator dag(We've tried a little bit but no success too)
f
Here's what I'm doing currently. Not with MWAA, but it may help. The Airflow scheduler and web were installed with meltano install. They actual meltano.yml file is rendered by HashiCorp Vault, so I can update the Vault secret and have a new file rendered. The DAG generator automatically picks up new schedules from the file. There are no loaders or extractors in this meltano.yml file, only the orchestrator and schedules. Liveness probe can check dates on files to determine whether the web or scheduler needs a kick if required. For the worker pod template, I created an initcontainer that has an env that is set from the dag_id label, which is populated by the DAG ID (schedule name prefixed with meltano_) by Airflow (actually the DAG generator). This init container simply saves the dag_id in a file. I do this because Vault Agent can't use env vars, but can use content of a file. So I run Vault Agent on the worker pod also, and that pulls secrets from a path specific to the DAG ID (job). Changed up the commands run on the worker image to a shell file that pulls from a git repo (named after the job). This meltano.yml file has the extractor, loader, orchestrator, and schedule for the one job. Runs install, creates links for the shared EFS filesystem for logs and output, then runs
meltano invoke $*
. Like I said, all I need to do is to produce actual DAGs that Airflow (MWAA) will call to run the job on a pod (there's AWS docs for doing this), instead of using the Meltano DAG generator, and everything should work. You might be able to use some of this, so hope it helps.
r
I would not recommend Composer or MWAA. I would either deploy Airflow yourself to a managed Kubernetes service, or go the Astronomer route.
f
Yea, I looked into MWAA a bit more and it does not look like it is the right choice. Thanks for the feedback!
s
hey @ricky_renner - curious why you don't recommend MWAA? I just started with it and it definitely seems janky, but I am also new to Airflow, which also seems janky.
r
@stephen_bailey my thoughts are that MWAA and GCP's Composer are just attempts for Amazon and Google to cash out on some open source tools while not supporting those open source projects, which is just ridiculous. They pitch this "managed Airflow" product, but since they don't really support it very well, it's just as hard as just hosting Airflow yourself. I would recommend going the Astronomer route for managed Airflow 1000% of the time.
s
great to know, thanks ricky. that's definitely the vibe i get from it. very low effort into making it something better