thread from Office Hours This is at least the 3rd mention o Meltano #infra-deployment

[thread from Office Hours] This is at least the 3r...

ken_payne

10/06/2021, 4:25 PM

[thread from Office Hours] This is at least the 3rd mention of 'dynamic pipelines' that I have seen in recent weeks, and I think it warrants documenting as an architecture pattern 🤔 Will open an issue, but for @marc_garcia_sastre and @keegan_mccallum I will describe the pattern to the extent I know about it below 🧵

ken_payne

10/06/2021, 4:56 PM

The high-level use case is a large number of similar pipelines, corresponding to tenants in a multi-tenant environment. As I understand it, the simplest solution from a meltano perspective is the following: • A Meltano project containing the typical (per-tenant) pipeline configs. This may be a single extractor and a single loader - e.g. Postgres to Redshift - or a collection of pipelines common to all tenants. • Common configuration (e.g. a common Postgres username, select criteria etc.) can be added using the CLI or by editing the

meltano.yml

directly. • Secrets and tenant specific config is injected using environment variables. (pro tip - use the

meltano config <plugin> list

to get the available config variables and their expected env var names). • Custom Orchestrator code (similar to the default Airflow DAG Generator) does the work to run

instances of

meltano elt

setting the environment variables per

tenants. • Note: be sure to generate and set a

MELTANO_JOB_ID

per tenant so that

meltano elt

bookmarks are correctly handled. As @visch points out in Office Hours, this pattern pushes most of the complexity over to the orchestrator, which has to understand what config to inject for which tenant (in addition to scheduling).

marc_garcia_sastre

10/06/2021, 4:57 PM

@ken_payne this is amazing, just what we were looking for 👏

visch

10/06/2021, 5:00 PM

Pro Tip I forgot all about that one! The config list ❤️ Thanks for diving into the orchestrator secrets thing, I've used and rolled my own orchestrator / secret setups in tools that no one here uses so I was trying to see if there's some way of doing this that I wasn't aware of. The newest Orchestrators with Secret managers handle most of what I"m talking about without too much grief and let you stay secure (no debug runs every time to get env variables 😄 , love the poor mans advice though )

ken_payne

10/06/2021, 5:07 PM

You're both welcome, hope it helps! For the orchestrator piece, the simplest approach in Airflow is to iterate over a list of tenants, creating a DAG per pipeline per tenant (our default example iterates over a list of schedules). The per-tenant config can be managed in a number of ways - Airflow has its own secrets backend that may be useful, or they can be retrieved from services like AWS Parameter Store, Hashicorp Vault etc. or even a custom database by your Airflow DAG generator. Note: airflow refreshes its 'DAG Bag' quite often (every 5 mins by default 😅) so be a little careful with load on remote secret systems/apis (especially if they charge per api request!). The refresh interval is configurable.

ken_payne

10/06/2021, 5:13 PM

For the advanced use case, where pipelines are not common to all tenants but in stead a subset of pipelines apply depending on tiers/services used/some custom app logic, the same approach can be used with an additional step by the DAG generator (or equivalent) to look up the appropriate pipelines to create before iterating over them creating DAGs. A 'pipeline' is simply an invocation of

meltano elt <extractor> <loader> ...

and so as long as the corpus of supported plugins are correctly added to your project, the orchestrator can mix and match them without issue 🙂

ken_payne

10/06/2021, 5:15 PM

One final note on Airflow - you can choose to either create a DAG per pipeline or per tenant (with multiple parallel tasks). You would choose the former if per-tenant pipelines need to be executed on different schedules (as schedule is configured at DAG level) or the latter if all pipelines for a given tenant can run on the same schedule (i.e. as subtasks in the same DAG) 😅 Hope that is clear.

edward_ryan

11/08/2021, 9:49 PM

Thanks @ken_payne! Very helpful @tom_mcgrail

Open in Slack

Previous Next