A q regarding how you guys set up a containerized ...
# best-practices
p
A q regarding how you guys set up a containerized production setup. We will have a slightly large number of pipelines running (in the range of 100-200). Most of these update daily, some less frequent. All pipelines tap from either MySQL or Postgres databases that are distributed around the world and all send data to our data warehouse in Google BigQuery. As some of the databases we tap are located in secured on-prem facilities, we envision that taps (and pipelines) need to run inside of these facilities for those cases. Therefore I foresee that we will deploy a number of containers running the pipelines in a distributed fashion. Now, for the questions: โ€ข Is it possible to have distributed pipeline runners all connected to a single database? โ€ข Is there a way to have a read only UI instance running centrally that show status and logs of the distributed pipelines?
k
Hey Pรคr, sounds like a really interesting use case! Thanks for sharing. On the meltano side, it is absolutely possible to have distributed pipeline runners all connected to a centralised database provided they have connectivity. This could be a database with a public IP (not recommended), via a SSH bastion host (and a tunnel on each worker, similar to this example) or by VPN. As you have a mix of cloud and on-prem, you may want to use a combination of connectivity approaches to make use of convenient peering features on the cloud platforms. It is also possible to run a deployed project in read-only mode. However I would say at this point that the meltano UI is not yet as fully featured as you might like for comprehensive pipeline logging and diagnostics, and solving connectivity to the meltano db is only part of the challenge. Meltano relies on an 'orchestrator' that actually schedules meltano CLI commands to run. Airflow is currently supported, with Dagster in the works, and both have their own considerations. I am less familiar with Dagster, but Airflow at least assumes a centralised deployment. However using platforms like celery or kubernetes (and the KubernetesPodOperator) on Airflow, it would be possible to have a single meltano project with a central airflow deployment and scheduler assigning tasks to workers in each distributed environment provided connectivity was available to either the queue from distributed celery workers or the master nodes of distributed kubernetes clusters (in order to receive tasks from the scheduler). Most of the work to accomplish this would be in a customised 'DAG generator' in your meltano project folder. Details here and an example default here. I think it would be great to open an issue with your requirements - there are one or two already for documenting a 'standard' deployment, but none that I could find that specifically addresses your distributed and mixed cloud/on-prem requirements ๐Ÿ™‚
p
Hi @ken_payne ๐Ÿ‘‹ Thanks for the answers - exactly what I wanted to hear. As for the orchestration, we will most likely initially take the approach to solve this using local cron jobs. It is low tech, but would work as a start. For the central meltano db, I've already set up a Google Cloud SQL database for this which you connect to using the Google CloudSQL Proxy. This works well and it's straightforward to allow access using a service account for meltano (btw, I run the
cloud_sql_proxy
in a container next to meltano using docker compose). One gotcha I ran into is that
meltano ui
require a local project folder. I have up until now simply used a Dockerfile that sets up the base project (which is shared between all meltano runners as well as the UI). However, the runners make slight changes at runtime (I do a series of
meltano select
when starting the container, before starting the pipeline with
meltano elt
), so the project directories are not entirely identical. Will this be a problem?
k
Ah interesting. Are your
select
commands dynamic? Do they change from one run to the next? My only concern with doing that would be with state management. Some well-know targets/loaders have a bug whereby they do not emit state for unselected streams even if state is present for those streams in the received state message ๐Ÿคฆโ€โ™‚๏ธ This would only be a problem if you were dynamically removing and re-adding streams on a regular basis and only with some targets. In general the approach I have used in the past is to 'bake' select criteria into our
meltano.yml
project file under the
select
extra key in extractor plugin config, making use of
*
wildcards to catch streams matching specific criteria (schemas, prefixes etc.). This key is documented here. Each deployment has a commit id used as the docker tag and changes are pushed through CI/CD with new commit ids. All that runs in production is
meltano run ...
. Would be interested if your use case doesn't fit this model ๐Ÿ™‚
e
It is also possible to use environment variables to change settings at runtime like
Copy code
TAP_MYSQL__SELECT='["my_stream.*"]' meltano elt tap-mysql target-bigquery
p
No, they don't change between runs except when we occasionally realize that we need to load more tables and/or columns into the warehouse. The reason I decided to put the select statements into the script I run as entrypoint was out of laziness... On each instance we have a large number of tables each consisting of what feels like a gazillion columns. The instances are similar, but the databases differ ever so slightly. I didn't feel like writing selects that would be generic enough to cover all variants so I settled on a basic set using wildcards to cover differences. And then manually tuning each tap to optimize. But hey, I'm a total newbie to meltano, so I might as well do this the wrong way. The way I implemented it up until now is that I have a Dockerfile that essentialy does this:
Copy code
FROM meltano/meltano:latest 
ARG PROJECT_ID

WORKDIR /meltano
RUN meltano init $PROJECT_ID && cd $PROJECT_ID && \
    meltano add extractor tap-mysql && \
    meltano config tap-mysql set host cloud-sql-proxy && \
    meltano config tap-mysql set user meltano && \
    meltano config tap-mysql set _metadata "*" replication-method LOG_BASED && \
    meltano add loader target-bigquery && \
    DATASET_ID=$(echo $PROJECT_ID | tr '-' '_') && \
    meltano config target-bigquery set dataset_id $DATASET_ID && \
    meltano config target-bigquery set project_id data-analytics && \
    meltano config target-bigquery set location EU

COPY entrypoint.sh .
RUN chmod u+x entrypoint.sh

ENV PROJECT_ID=$PROJECT_ID
ENTRYPOINT [ "./entrypoint.sh" ]
And an entrypoint.sh that looks like this:
Copy code
#!/bin/sh
set -x

if [ -z "$PROJECT_ID" ]; then
	echo FATAL: Project is not defined, exiting...
	exit 1
fi

cd $PROJECT_ID
if [ ! $? -eq 0 ]; then
	echo FATAL: Cannot change into project directory, exiting...
	exit 1
fi

echo "Starting ETL pipeline $PIPELINE_NAME..."
cd $PIPELINE_NAME

meltano select tap-mysql "*-Event" "*"
meltano select tap-mysql "*-EventDataKey" "*"

meltano elt tap-mysql target-bigquery --job_id=$PIPELINE_NAME

exec "$@"
#EOF
Again, as a total n00b, I don't know if this is not according to best practices. I'm eager to learn and hear your input
This mean that the meltano project is being "reset" for each run, but since the database is pointing to an external postgres one, state is preserved. Right? ๐Ÿค”
k
Yes, so this is effectively configuring your plugins at docker build time and then adding some extra config at run time. There is no reason I can think of to not do that, though it is maybe a bit unusual ๐Ÿ™‚ Those cli commands result in changes to your
meltano.yml
which, provided secrets are injected using env vars (rather than hard coded), is safe to check in to version control. If nothing else, that does mean you can inspect and modify the yml without needing to connect to a running docker image. But as you are mostly adding new streams, you shouldn't have any trouble with state clearing, so all should just work as expected for you ๐Ÿ‘
p
Interesting. Since this is the "unusual" approach. What would a more common way of doing it be? Perhaps: 1. Storing the meltano.yml file in source control 2. Retrieving that as part of the docker build and finishing off the build with a
meltano install
? 3. Issuing the
meltano elt
command when starting the container
k
Yep, thats exactly right ๐Ÿ™‚ As Edgar mentioned, settings can be overridden using env vars which is useful for secrets injection and any other dynamic settings you wish to configure at container run time.