Hello is there any guide or documentation on deploying Melta Meltano #best-practices

Hello, is there any guide or documentation on depl...

jean_sahlberg

06/01/2022, 1:55 PM

Hello, is there any guide or documentation on deploying Meltano in AWS env.? Today we are running it using an EC2 instance, but I don’t think this is the right way to do it…

aaronsteers

06/01/2022, 5:11 PM

Hi, @jean_sahlberg. I'll let others chime in with their experiences, but you may also find some tips and tricks over in the #C01E3MUUB3J channel.

aaronsteers

06/01/2022, 5:11 PM

Short answer is that there are lots of different ways to deploy as of now: EKS and ECS are two of the most common approaches, I think.

jean_sahlberg

06/01/2022, 5:25 PM

hi, thank you @aaronsteers! I’ve found this, but it is outdated… We will probably use EKS, but I was wondering if there is a recommendation by Meltano itself on how it should be deployed.

steve_clarke

06/10/2022, 5:22 AM

We are using Meltano in a container environment in AWS. Specifically we use Fargate for a serverless ECS implementation running the Meltano elt processes on demand. We have taken the pattern of using the AWS implementation of Airflow MWAA for our front-end with a custom written DAG which picks up a schedule file (which is yaml based) with all our jobs to run. This is a service which is constantly running and will initiate jobs (including Meltano jobs based on the defined schedule). The schedule file will state the tap and target, schedule, and task definition for ecs to use time etc. It creates a nice pipeline and can call other components like dbt - we use dbt cloud. When a Meltano Job is initiated by Airflow it uses the ECSOperator to start a container as a ECS task to run Meltano. The appropriate meltano cli command is sent from Airflow to run in the container. The container is ephemeral (by nature of using Fargate), it runs piping the data to the target and is then automatically destroyed once the job is complete. For state we use a serverless version of Postgress (Aurora) to hold the state of each job. Logs are available in CloudWatch and also the Airflow Logs which initiated the job. Whenever there is a new job added, the schedule.yml file is deployed to the DAGS directory by our CI/CD pipeline and Airflow will automatically pick it up and generate the appropriate DAG entry. For passing in Environment specific settings we run

chamber

to hydrate the environment variables in the container. It reads for a given path all related environment variables for the tap and target before calling via the --exec command meltano. The environment variables are stored securely in AWS SSM Parameter store (including the tables or objects to select). This makes our meltano.yml file very lite just defining the taps and targets available (but no config as it exists in the environment variables). The thing I like with this architecture being task based is you can have one or many containers kicked off for each elt process - it scales horizontally as they are separate tasks. We have pre-created a number of ECS Task definitions with different compute resources CPU/Memory and given tee-shirt sizing names for them small|medium|large|x-large. In that way when we call a particular job we can give just the right resources required for the job. There are many different approaches for deployment, this made sense for us based already using dbt cloud and a preference for a services offering re Airflow, and serverless computing if possible to minimise patching. If there are no jobs running then the only service running is Airflow Service (which is a AWS offering via MWAA).

jean_sahlberg

06/10/2022, 6:40 PM

@steve_clarke thank you so much for the detailed explanation of your AWS environment using Meltano! this is exactly what we are aiming for, a very clean and scalable solution!

2 Views

Open in Slack

Previous Next