[thread from Office Hours] This is at least the 3r...
# infra-deployment
k
[thread from Office Hours] This is at least the 3rd mention of 'dynamic pipelines' that I have seen in recent weeks, and I think it warrants documenting as an architecture pattern šŸ¤” Will open an issue, but for @marc_garcia_sastre and @keegan_mccallum I will describe the pattern to the extent I know about it below 🧵
The high-level use case is a large number of similar pipelines, corresponding to tenants in a multi-tenant environment. As I understand it, the simplest solution from a meltano perspective is the following: • A Meltano project containing the typical (per-tenant) pipeline configs. This may be a single extractor and a single loader - e.g. Postgres to Redshift - or a collection of pipelines common to all tenants. • Common configuration (e.g. a common Postgres username, select criteria etc.) can be added using the CLI or by editing the
meltano.yml
directly. • Secrets and tenant specific config is injected using environment variables. (pro tip - use the
meltano config <plugin> list
to get the available config variables and their expected env var names). • Custom Orchestrator code (similar to the default Airflow DAG Generator) does the work to run
n
instances of
meltano elt
setting the environment variables per
n
tenants. • Note: be sure to generate and set a
MELTANO_JOB_ID
per tenant so that
meltano elt
bookmarks are correctly handled. As @visch points out in Office Hours, this pattern pushes most of the complexity over to the orchestrator, which has to understand what config to inject for which tenant (in addition to scheduling).
m
@ken_payne this is amazing, just what we were looking for šŸ‘
v
Pro Tip I forgot all about that one! The config list ā¤ļø Thanks for diving into the orchestrator secrets thing, I've used and rolled my own orchestrator / secret setups in tools that no one here uses so I was trying to see if there's some way of doing this that I wasn't aware of. The newest Orchestrators with Secret managers handle most of what I"m talking about without too much grief and let you stay secure (no debug runs every time to get env variables šŸ˜„ , love the poor mans advice though )
k
You're both welcome, hope it helps! For the orchestrator piece, the simplest approach in Airflow is to iterate over a list of tenants, creating a DAG per pipeline per tenant (our default example iterates over a list of schedules). The per-tenant config can be managed in a number of ways - Airflow has its own secrets backend that may be useful, or they can be retrieved from services like AWS Parameter Store, Hashicorp Vault etc. or even a custom database by your Airflow DAG generator. Note: airflow refreshes its 'DAG Bag' quite often (every 5 mins by default šŸ˜…) so be a little careful with load on remote secret systems/apis (especially if they charge per api request!). The refresh interval is configurable.
For the advanced use case, where pipelines are not common to all tenants but in stead a subset of pipelines apply depending on tiers/services used/some custom app logic, the same approach can be used with an additional step by the DAG generator (or equivalent) to look up the appropriate pipelines to create before iterating over them creating DAGs. A 'pipeline' is simply an invocation of
meltano elt <extractor> <loader> ...
and so as long as the corpus of supported plugins are correctly added to your project, the orchestrator can mix and match them without issue šŸ™‚
One final note on Airflow - you can choose to either create a DAG per pipeline or per tenant (with multiple parallel tasks). You would choose the former if per-tenant pipelines need to be executed on different schedules (as schedule is configured at DAG level) or the latter if all pipelines for a given tenant can run on the same schedule (i.e. as subtasks in the same DAG) šŸ˜… Hope that is clear.
e
Thanks @ken_payne! Very helpful @tom_mcgrail