Is anyone using the k8s executor with AirFlow and ...
# infra-deployment
f
Is anyone using the k8s executor with AirFlow and a "git repo" workflow? Meaning, you have a pretty blank container for the worker and you clone the repo down from your git server when the job executes, do a meltano install, then kick off the invoke all with a script? If so, can you post the Dockerfile for your container and/or the script, or provide some pointers?
k
I have never used the git sidecar approach with airflow before, preferring to use the baking of Docker images (tagged with a commit id) to 'version' deployments. It isn't perfect as it requires Airflow itself to restart on each deployment of the built image, but we always liked having the ability to build and test Airflow and our DAGs together (especially to catch dependency issues introduced by changes or additions in DAGs) before releasing. Have toyed with draining techniques in the past to smooth over the deployment interruptions but never persisted long enough to reach a stable solution 😅 Would be super interested to hear if you find a way to make the git sidecar work whilst preserving the "known good build" benefits 🙂
t
At GitLab we had a watcher that periodically cloned the repo to get new dags https://gitlab.com/gitlab-data/data-image/-/blob/master/airflow_image/manifests/deployment.yaml#L175 It worked but I really don’t know if it was the best way to do this. It gave us some trouble trying to upgrade Airflow as well.
f
I have a concept in mind, but don't know enough about AirFlow to know whether it would work or not. Basically, I need to know if it is possible to have one AirFlow orchestrator serve multiple projects, that are kept in separate git repos. That is what the end-goal is, to separate projects into their own git repos so that they are not so tightly coupled together, but have them use a central orchestrator. I suppose the key is in [here](https://meltano.com/docs/orchestration.html#using-an-existing-airflow-installation) somewhere. I think I need a way to do two things: 1) schedule a job "offline" and inject that into AirFlow, and 2) create the dags for the actual job and share the with AirFlow. Unless I'm missing something, all we are specifying is the name, extractor, loader, and interval/schedule for a job (at a minimum). I need projects that will add and manage jobs that it knows of itself (in a specific pipeline project), but not overwrite or delete jobs that other projects create/manage. Is that possible?
I think I have this mostly working. Now I just need to update /project/orchestrate/dags and create a dag for each project (instead of pulling them from the single meltano.yml). I'll probably use a watcher sidecar like gltlab. As far as the "known good build" - there is nothing in the dag that says anything about build versions. Your individual project repos will need to match up with what they name the tap and target, and schedule, but other than that they can be versioned independently. I don't have the ability to select a specific commit/branch/tag, but there is an easy way to do that. I get secrets from vault, including the gitlab depoy key to be able to pull the repo, before pulling. So I can just include a ref variable there to pull a specific ref. They only thing AirFlow is doing is kicking off the task (and checking for success or not). It passes
Copy code
INFO - Running: ['airflow', 'tasks', 'run', 'meltano_my_job_id', 'extract_load', '2021-11-30T22:25:35.373904+00:00', '--job-id', '807', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/meltano.py', '--cfg-path', '/tmp/tmpuriwn0ko', '--error-file', '/tmp/tmphxdcks1b']
That is all that is passed to meltano invoke in the container. I just modified it to pull the appropriate repo for the job_id so things can stay separate and we don't need to reboot the core infrastructure every time a job is added. I just need the watcher to update the dags directory on the airflow scheduler pod and it should be done.