A q regarding how you guys set up a containerized production Meltano #best-practices

A q regarding how you guys set up a containerized ...

par_degerman

08/24/2021, 7:56 AM

A q regarding how you guys set up a containerized production setup. We will have a slightly large number of pipelines running (in the range of 100-200). Most of these update daily, some less frequent. All pipelines tap from either MySQL or Postgres databases that are distributed around the world and all send data to our data warehouse in Google BigQuery. As some of the databases we tap are located in secured on-prem facilities, we envision that taps (and pipelines) need to run inside of these facilities for those cases. Therefore I foresee that we will deploy a number of containers running the pipelines in a distributed fashion. Now, for the questions: • Is it possible to have distributed pipeline runners all connected to a single database? • Is there a way to have a read only UI instance running centrally that show status and logs of the distributed pipelines?

ken_payne

08/24/2021, 1:09 PM

Hey Pär, sounds like a really interesting use case! Thanks for sharing. On the meltano side, it is absolutely possible to have distributed pipeline runners all connected to a centralised database provided they have connectivity. This could be a database with a public IP (not recommended), via a SSH bastion host (and a tunnel on each worker, similar to this example) or by VPN. As you have a mix of cloud and on-prem, you may want to use a combination of connectivity approaches to make use of convenient peering features on the cloud platforms. It is also possible to run a deployed project in read-only mode. However I would say at this point that the meltano UI is not yet as fully featured as you might like for comprehensive pipeline logging and diagnostics, and solving connectivity to the meltano db is only part of the challenge. Meltano relies on an 'orchestrator' that actually schedules meltano CLI commands to run. Airflow is currently supported, with Dagster in the works, and both have their own considerations. I am less familiar with Dagster, but Airflow at least assumes a centralised deployment. However using platforms like celery or kubernetes (and the KubernetesPodOperator) on Airflow, it would be possible to have a single meltano project with a central airflow deployment and scheduler assigning tasks to workers in each distributed environment provided connectivity was available to either the queue from distributed celery workers or the master nodes of distributed kubernetes clusters (in order to receive tasks from the scheduler). Most of the work to accomplish this would be in a customised 'DAG generator' in your meltano project folder. Details here and an example default here. I think it would be great to open an issue with your requirements - there are one or two already for documenting a 'standard' deployment, but none that I could find that specifically addresses your distributed and mixed cloud/on-prem requirements 🙂

par_degerman

08/24/2021, 1:44 PM

Hi @ken_payne 👋 Thanks for the answers - exactly what I wanted to hear. As for the orchestration, we will most likely initially take the approach to solve this using local cron jobs. It is low tech, but would work as a start. For the central meltano db, I've already set up a Google Cloud SQL database for this which you connect to using the Google CloudSQL Proxy. This works well and it's straightforward to allow access using a service account for meltano (btw, I run the

cloud_sql_proxy

in a container next to meltano using docker compose). One gotcha I ran into is that

meltano ui

require a local project folder. I have up until now simply used a Dockerfile that sets up the base project (which is shared between all meltano runners as well as the UI). However, the runners make slight changes at runtime (I do a series of

meltano select

when starting the container, before starting the pipeline with

meltano elt

), so the project directories are not entirely identical. Will this be a problem?

ken_payne

08/24/2021, 2:22 PM

Ah interesting. Are your

select

commands dynamic? Do they change from one run to the next? My only concern with doing that would be with state management. Some well-know targets/loaders have a bug whereby they do not emit state for unselected streams even if state is present for those streams in the received state message 🤦‍♂️ This would only be a problem if you were dynamically removing and re-adding streams on a regular basis and only with some targets. In general the approach I have used in the past is to 'bake' select criteria into our

meltano.yml

project file under the

select

extra key in extractor plugin config, making use of

wildcards to catch streams matching specific criteria (schemas, prefixes etc.). This key is documented here. Each deployment has a commit id used as the docker tag and changes are pushed through CI/CD with new commit ids. All that runs in production is

meltano run ...

. Would be interested if your use case doesn't fit this model 🙂

edgar_ramirez_mondragon

08/24/2021, 3:11 PM

It is also possible to use environment variables to change settings at runtime like

Copy code

TAP_MYSQL__SELECT='["my_stream.*"]' meltano elt tap-mysql target-bigquery

par_degerman

08/24/2021, 5:08 PM

No, they don't change between runs except when we occasionally realize that we need to load more tables and/or columns into the warehouse. The reason I decided to put the select statements into the script I run as entrypoint was out of laziness... On each instance we have a large number of tables each consisting of what feels like a gazillion columns. The instances are similar, but the databases differ ever so slightly. I didn't feel like writing selects that would be generic enough to cover all variants so I settled on a basic set using wildcards to cover differences. And then manually tuning each tap to optimize. But hey, I'm a total newbie to meltano, so I might as well do this the wrong way. The way I implemented it up until now is that I have a Dockerfile that essentialy does this:

Copy code

FROM meltano/meltano:latest 
ARG PROJECT_ID

WORKDIR /meltano
RUN meltano init $PROJECT_ID && cd $PROJECT_ID && \
    meltano add extractor tap-mysql && \
    meltano config tap-mysql set host cloud-sql-proxy && \
    meltano config tap-mysql set user meltano && \
    meltano config tap-mysql set _metadata "*" replication-method LOG_BASED && \
    meltano add loader target-bigquery && \
    DATASET_ID=$(echo $PROJECT_ID | tr '-' '_') && \
    meltano config target-bigquery set dataset_id $DATASET_ID && \
    meltano config target-bigquery set project_id data-analytics && \
    meltano config target-bigquery set location EU

COPY entrypoint.sh .
RUN chmod u+x entrypoint.sh

ENV PROJECT_ID=$PROJECT_ID
ENTRYPOINT [ "./entrypoint.sh" ]

And an entrypoint.sh that looks like this:

Copy code

#!/bin/sh
set -x

if [ -z "$PROJECT_ID" ]; then
	echo FATAL: Project is not defined, exiting...
	exit 1
fi

cd $PROJECT_ID
if [ ! $? -eq 0 ]; then
	echo FATAL: Cannot change into project directory, exiting...
	exit 1
fi

echo "Starting ETL pipeline $PIPELINE_NAME..."
cd $PIPELINE_NAME

meltano select tap-mysql "*-Event" "*"
meltano select tap-mysql "*-EventDataKey" "*"

meltano elt tap-mysql target-bigquery --job_id=$PIPELINE_NAME

exec "$@"
#EOF

Again, as a total n00b, I don't know if this is not according to best practices. I'm eager to learn and hear your input

par_degerman

08/24/2021, 5:12 PM

This mean that the meltano project is being "reset" for each run, but since the database is pointing to an external postgres one, state is preserved. Right? 🤔

ken_payne

08/24/2021, 5:22 PM

Yes, so this is effectively configuring your plugins at docker build time and then adding some extra config at run time. There is no reason I can think of to not do that, though it is maybe a bit unusual 🙂 Those cli commands result in changes to your

meltano.yml

which, provided secrets are injected using env vars (rather than hard coded), is safe to check in to version control. If nothing else, that does mean you can inspect and modify the yml without needing to connect to a running docker image. But as you are mostly adding new streams, you shouldn't have any trouble with state clearing, so all should just work as expected for you 👏

par_degerman

08/25/2021, 6:15 AM

Interesting. Since this is the "unusual" approach. What would a more common way of doing it be? Perhaps: 1. Storing the meltano.yml file in source control 2. Retrieving that as part of the docker build and finishing off the build with a

meltano install

? 3. Issuing the

meltano elt

command when starting the container

ken_payne

08/25/2021, 7:33 AM

Yep, thats exactly right 🙂 As Edgar mentioned, settings can be overridden using env vars which is useful for secrets injection and any other dynamic settings you wish to configure at container run time.

2 Views

Open in Slack

Previous Next