Thanks Douwe! Not quite - I believe the performance issues we were seeing relate to how Kubernetes does CPU throttling once a pod is allocated and running. @eric_simmerman kubernetes is specifically designed to manage the ‘physical’ resources for you. It depends a bit on exactly how you choose to deploy a cluster, but kube is broadly responsible for only scheduling new jobs according to available resources on the cluster (provided you set resource requirements and optional limits for your tasks) and for adding new nodes if the cluster starts to run out of capacity. Kube will refuse to schedule new tasks if there aren’t enough remaining resources. This causes the Kubernetes Airflow Operator to raise an exception, so we use standard Airflow retries to make sure all jobs do eventually get scheduled in cases where the cluster is particularly busy or needs to scale out. This rarely happens in practice - at most our jobs spend a few minutes ‘pending’ before launching. We use EKS on AWS, which takes most of the heavy lifting out of deploying, managing and scaling kube. We also use spot instances, so we are somewhat more likely to fail than otherwise but again rely on Airflow retries to kick off failed launch attempts. The singer framework (in how it handles state and bookmarks) is pretty resilient to mid-job failures too in our experience 👍