I keep reading in different spots that I need to "...
# infra-deployment
s
I keep reading in different spots that I need to "*setup different dbs for airflow and meltano"* (ex: https://meltano.slack.com/archives/C01TCRBBJD7/p1625505400493200?thread_ts=1625502310.487400&cid=C01TCRBBJD7) I didn't setup our meltano project; could someone explain to me how I would go about doing that? I'm hoping this will fix a lot of my ghost airflow issues
d
@Stéphane Burwash Did you override the database_uri for Meltano or sqlalchemy_connection_uri (or something like that) for Airflow, in meltano.yml or through the environment? Or are they both using the default SQLite databases inside the .meltano directory?
s
Thanks for the answer @douwe_maan In local, we don't seem to do anything (besides run meltano ui, airflow scheduler and airflow webserver in 3 different containers) I prod, we have the same context in ECS (3 task definitions) + these variables are set (I can go check in secret manager what their value is if needs be) -AIRFLOW__CORE__EXECUTOR -AIRFLOW__CORE__SQL_ALCHEMY_CONN -MELTANO_DATABASE_URI Is this what you were talking about?
d
Yeah the last 2! Can you share their (sanitized) values?
s
They're different; guess my issue can be found elsewhere
d
What symptoms/errors are you seeing?
s
I think it's just a mixture of different airflow issues that could have many different causes. Currently, my main issue is getting the logs in production for the airflow webserver (since nothing from the actual run seems to appear in meltano-ui, airflow-webserver, airflow-scheduler ) My latest fix: expose port 8793 in my task definition for the scheduler, but that does not seem to have fixed it
And I need these logs to debug why my runs are failing
d
Do the airflow webserver and scheduler have shared persistent storage?
s
No, only the webserver
I'm guessing this would be the solution?
d
Yeah I'd look into where the scheduler stores its execution logs, and whether webserver has access to that same location or will always find it empty because its not shared with the scheduler
But opening up port 8793 looks like it's another approach to allow the webserver to directly fetch the files from the scheduler. So you can investigate that further as well. I don't think the issue is likely to be Meltano specific though
s
Yeah I think this is as much linked to my lack of knowledge in all things infrastructure Because in my local machine it shows the logs perfectly while running in 3 seperate docker containers
d
Hmm, do you have a directory mounted into each container locally that you don't have in prod?
s
Our main dockerfile:
Copy code
ARG MELTANO_IMAGE=meltano/meltano:latest
FROM $MELTANO_IMAGE

WORKDIR /project

# Install any additional requirements
COPY ./requirements.txt .
RUN pip install -r requirements.txt

# Install all plugins into the `.meltano` directory
COPY ./meltano.yml .
COPY ./extract ./extract
COPY ./load ./load
RUN meltano install --clean

# Pin `discovery.yml` manifest by copying cached version to project root
RUN cp -n .meltano/cache/discovery.yml . 2>/dev/null || :

# Don't allow changes to containerized project files
ENV MELTANO_PROJECT_READONLY 1

# Copy over remaining project files
COPY . .

# # Run command to generate dbt documentation
# RUN meltano invoke dbt docs generate

# Expose default port used by `meltano ui`
EXPOSE 5000

# Create airflow user
# RUN meltano invoke airflow users create \
#     --username potloc \
#     --firstname potloc \
#     --lastname potloc \
#     --role Admin \
#     --email <mailto:spiderman@superhero.org|spiderman@superhero.org> \
#     --password potloc

ENTRYPOINT ["meltano"]
Which we use in prod, with an added docker-compose (only local)
Copy code
version: '3.8'

x-meltano-image: &meltano-image
  image: meltano-potloc:dev
  build: .
  volumes:
    - .:/project

services:
  meltano-ui:
    <<: *meltano-image
    command: ui
    env_file:
      - .env
    expose:
      - 5000
    ports:
      - 5000:5000
    restart: unless-stopped

  airflow-scheduler:
    <<: *meltano-image
    command: invoke airflow scheduler
    expose:
      - 8793
    ports:
      - 8793:8793
    restart: unless-stopped

  airflow-webserver:
    <<: *meltano-image
    command: invoke airflow webserver
    expose:
      - 8080
    ports:
      - 8080:8080
    restart: unless-stopped

  dbt-docs:
    <<: *meltano-image
    command: invoke dbt docs serve --port 8081
    expose:
      - 8081
    ports:
      - 8081:8081
    restart: unless-stopped
d
Okay so the Docker-compose file gives all the containers access to the same local project directory, which is where airflow also stores its logs.
How do you start those containers in production? They don't need to share the entire project directory, but you could share the subdirectory where airflow stores its stuff
s
We use an AWS code pipeline to automatically deploy them (please tell me if that doesn't answer you question)
Our webserver and meltano ui are persited in 2 different RDS databases
All 3 tasks have their separate definition
I think the most intuitive solution is, as you suggested, to put the airflow scheduler in the same persisted storage as the webserver
d
That's definitely easiest. I'm not sure how to configure that with AWS though
But what you want is for your Docker containers to have a shared mounted directory. You can likely Google for that without needing any airflow specific results
s
All 3? Or do we want to keep meltano ui separate from the other 2?
d
Depends on if you also want to be able to see the run logs from Meltano UI. If you want to make this as similar to your local deployment as possible, you can create persistent storage and mount it into each of the containers at “/project/.meltano”, which is where all the supporting files like logs will be stored
s
Awesome thanks. When we run
meltano run tap target
and we get a stream of all the records being extracted, which service would be generating those logs?
d
Not sure what you mean, the logs are generated by “meltano run” but you already know that 😄
s
Hahaha yes 🤣 I'm just trying to figure out where they're stored in my production architecture in cloud watch. Because currently the answer seems to be "nowhere"
d
When you say you run
meltano run tap target
, is that managed by some automation or manually in an (ssh) terminal?
Note that it’s possible that
meltano run
doesn’t store its own logs if you just run it by itself instead of managed by Airflow. I’m not totally sure.
meltano elt
definitely does,t hough.
s
I mean using airflow scheduler (the default meltano architecture). We currently have 3 tasks (ui, webserver and scheduler) who all have logs, but none of them are about actual runs (from what I can see) So I'm trying to find where the output of runs would be stored (ex: when my dag fails, where would the "why" be)
By default, logs are placed in the
AIRFLOW_HOME
directory.
The
AIRFLOW_HOME
folder is
.meltano/orchestrators/airflow
or
.meltano/run/airflow
if I’m not mistaken
s
Alright, so I could point them to the meltano-ui container or something to pick them up in cloudwatch
I'm having issues using the technique of mounting a volume on
project/.meltano
. I'm only trying to use the volume as ephermeral storage (while the container is running), so I'm guessing it's just wipping my
.meltano
post install because I'm getting issues of
orchestrator airflow not installed
d
ah yeah, by mounting that entire subdirectory, the plug-in installation files that “meltano install” had put into the container are also gone. So mounting all of “.meltano” is not actually going to work. Can you try “.meltano/run”?
s
We're in buisness !!!!!
For posterity: Issue: Getting logs in production using ECS To solve: 1. Expose 8793 port for scheduler 2. Create underlying volume on architecture which is mounted on
project/.meltano/run
(or
project/.meltano/airflow
in my case) This should allow you to view logs in the airflow-webserver ui
b
@jo_pearson