Hi, we are nascent Meltano+DBT adopters. I wanted...
# best-practices
b
Hi, we are nascent Meltano+DBT adopters. I wanted to get generic consensus on how people are managing Meltano codebase. • Mono-Repo or Per Pipeline Repo for Meltano • Embedded DBT Codebase per pipeline or Separate DBT codebase • Use Meltano UI or Use Airflow/Dagster UI • What / How do you manage Observability of pipelines and data flow/journey
v
I generally go for MonoRepo Embdeded DBT I use the orchestrator UI for me that can be SQL Agents (eek) or github/gitlab • What / How do you manage Observability of pipelines and data flow/journey Same as orchestrator. When I need more "observability" I make new dbt models that give me information about things I care about
I almost never use the UI I rely on notifications that I have setup, and then I"ll check a UI if needed. 🤷
b
With Mono-repo, would each pipeline have its own parent folder ?
or are the pipelines co-mingled..
v
co mingled, it's not a big deal for me. Let me count
b
v
I care about making things easy for me and the team I don't really care about principles I use principles as guides
b
👆 yes I have a long 20+ years Software Engg background 🙂
v
As much as possible, the beauty of one repo is your definitions are all in one place vs how almost every other tool I've used is set up. They have things all over the place
b
Right, so then what do you consider best strategies, Naming, folder/file organization so that the pipelines dont end up in hot mess..
v
When they are hot messes
Not until then
Normally triggered when someone new gets added to maintaining the repo it's a good way for them to learn 🤷
b
so you only have 1
meltano.yml
file in your mono-repo ?
v
yep
b
what is your deployment…
v
But that works for us we only have 4 extractors, and 2-3 loaders
inherited a bunch but generally that's it if we had 20 it'd be different
b
we have 2-3 extractors and 1 loader currently
where/how do you deploy.. and what’s your orchestrator ?
v
git repo, gitlab. deploy to a window server orchestrator is sql agent jobs 😉
b
We have a Kubernetes ecosystem, we’re deploying each pipeline as singular docker image in kube
v
Good if you already have a k8s ecosystem you should 100% do that!
b
so if I have facility to do one pipeline per deployment mechanism, do you think a co-mingled deployment is still better vs each meltano pipeline as docker image..
not using any Kubernetes operator..
v
I personally would as it's easier to maitain one docker image
meltano handles all your dependency mangement
b
each docker has a entrypoint that starts meltano and points to a single meltano ui location
v
Doesn't mean you have to do that 🙂
b
but currently each meltano starts its own airflow scheduler
v
Copy code
stages:
- build
- run

#Only triggered via a scheduled run. We pull the latest Docker image to run the job with
#Using the docker image is faster to run as we don't have to install meltano or the tap/target packages
runner:
  image:
    name: $CI_REGISTRY_IMAGE:latest
    entrypoint: [""]
  before_script:
  - cp -Rn /project/. . #Copy meltano project into image
  stage: run
  variables:
    TARGET_POSTGRES_PASSWORD: $TAP_POSTGRES_PASSWORD
    TARGET_POSTGRES_HOST: $TAP_POSTGRES_HOST
    DBT_HOST: $TAP_POSTGRES_HOST
    DBT_PASSWORD: $TAP_POSTGRES_PASSWORD
    POSTGRES_PASSWORD: $TAP_POSTGRES_PASSWORD

  services:
  - name: postgres
  script:
  - "meltano run tap-toggl target-postgres dbt:run tap-postgres target-apprise"
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"
b
but airflow backend db is shared
v
Just giving you options 🤷
Copy code
stages:
- build
- run

#Only triggered via a scheduled run. We pull the latest Docker image to run the job with
#Using the docker image is faster to run as we don't have to install meltano or the tap/target packages
runner:
  image:
    name: $CI_REGISTRY_IMAGE:latest
    entrypoint: [""]
  before_script:
  - cp -Rn /project/. . #Copy meltano project into image
  stage: run
  variables:
    TARGET_POSTGRES_PASSWORD: $TAP_POSTGRES_PASSWORD
    TARGET_POSTGRES_HOST: $TAP_POSTGRES_HOST
    DBT_HOST: $TAP_POSTGRES_HOST
    DBT_PASSWORD: $TAP_POSTGRES_PASSWORD
    POSTGRES_PASSWORD: $TAP_POSTGRES_PASSWORD

  services:
  - name: postgres
  script:
  - "meltano run tap-toggl target-postgres dbt:run tap-postgres target-apprise"
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule" 

# Tags <project>:<sha>
# Tags <project>:<ref> (<branch> or <tag>)
# Tags <project>:latest
# Saves us time by building the Docker file once when things change. Runner runs a lot (every 15 minutes or so), docker-build-latest runs infreqently
docker-build-latest:
  stage: build
  image: docker:stable
  variables:
    DOCKER_DRIVER: overlay2
    MELTANO_IMAGE: meltano/meltano
  before_script:
  - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
  - docker pull $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_NAME || true
  services: ["docker:dind"]
  script:
  - >
    docker build
    --cache-from $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_NAME
    --tag $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
    --tag $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_NAME
    --build-arg MELTANO_IMAGE=$MELTANO_IMAGE
    .
  - docker tag $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA $CI_REGISTRY_IMAGE:latest
  - docker push $CI_REGISTRY_IMAGE:latest
  rules:
  - if: $CI_COMMIT_BRANCH == "main" && $CI_PIPELINE_SOURCE != "schedule"
For a k8s deploy I'd look at @ken_payne’s work wherever it is at
He did a bunch of work with helm and how to deploy a standalone system, I'm just showing some other options
b
sure, what is the above code snippet.. i dont exaclty recognize the syntax
v
gitlab ci
b
ah okay
so prebuilt docker image with all taps/targets pre-installed, just copy the new layer of codebase and push and run
v
That's the way I like it 🤷
b
pre-built image auto-refreshes to
latest
i am guessing
v
That's these two
Copy code
- docker tag $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA $CI_REGISTRY_IMAGE:latest
  - docker push $CI_REGISTRY_IMAGE:latest
Note that this docker stuff comes from Meltano's file-docker stuff
b
Oh thanks, I’ll look into Ken’s work