Has anyone deployed their Meltano build to Google ...
# infra-deployment
m
Has anyone deployed their Meltano build to Google Kubernetes Engine (GKE)?
d
@niall_woodward You maybe?
n
EKS over here! I can help you with the pure Kubernetes questions maybe @michael_cooper
m
Thanks! We've been struggling a lot with this. So hopefully there isn't too much difference between EKS and GKE. 1. Do you have just one pod that houses Meltano, or do you have multiple pods running Meltano within your cluster? 2. Are you changing the
MELTANO_DATABASE_URI
at all or just allowing it to run the default sqlite? 3. Are you pointing your Airflow database to an external database or are you mounting volumes for Airflow?
n
Sorry for slow reply on this. Our deployment consists of a meltano webserver k8 deployment (pod), an airflow webserver deployment, and an airflow scheduler deployment. I'm using airflow's KubernetesPodOperator to create new pods for each job. I have one database for meltano, and one for airflow. I'm also using an EFS volume for both meltano and airflow logs.
e
@niall_woodward How are you handling resource limits for scheduled meltano ELT jobs in this type setup? If Airflow simply schedules an extractor/loader job as a pod on a k8s cluster…doesn’t it run the risk of exceeding available cpu/mem limits and being terminated mid-job?
d
@ken_payne Is what @eric_simmerman brought up here related to https://gitlab.com/meltano/meltano/-/issues/2364#note_472161190?
k
Thanks Douwe! Not quite - I believe the performance issues we were seeing relate to how Kubernetes does CPU throttling once a pod is allocated and running. @eric_simmerman kubernetes is specifically designed to manage the ‘physical’ resources for you. It depends a bit on exactly how you choose to deploy a cluster, but kube is broadly responsible for only scheduling new jobs according to available resources on the cluster (provided you set resource requirements and optional limits for your tasks) and for adding new nodes if the cluster starts to run out of capacity. Kube will refuse to schedule new tasks if there aren’t enough remaining resources. This causes the Kubernetes Airflow Operator to raise an exception, so we use standard Airflow retries to make sure all jobs do eventually get scheduled in cases where the cluster is particularly busy or needs to scale out. This rarely happens in practice - at most our jobs spend a few minutes ‘pending’ before launching. We use EKS on AWS, which takes most of the heavy lifting out of deploying, managing and scaling kube. We also use spot instances, so we are somewhat more likely to fail than otherwise but again rely on Airflow retries to kick off failed launch attempts. The singer framework (in how it handles state and bookmarks) is pretty resilient to mid-job failures too in our experience 👍