Not sure the best place to put this question, but ...
# troubleshooting
n
Not sure the best place to put this question, but this seems like a decent option. I’m curious if anyone has suggestions about ways to optimize the build process for a meltano docker container to help make it easier to make iterative changes quickly? Here’s what I mean: This is the standard meltano dockerfile:
Copy code
ARG MELTANO_IMAGE=meltano/meltano:latest
FROM $MELTANO_IMAGE

WORKDIR /project

# Install any additional requirements
COPY ./requirements.txt .
RUN pip install -r requirements.txt

# Copy over Meltano project directory
COPY . .
RUN meltano install

# Don't allow changes to containerized project files
ENV MELTANO_PROJECT_READONLY 1

# Expose default port used by `meltano ui`
EXPOSE 5000

ENTRYPOINT ["meltano"]
The
meltano install
line needs
meltano.yml
(and any other yml files if they’ve been broken out) to know which plugins to install, and so the line before it copies all those files into the image. This means that if one makes a change to
meltano.yml
, like changing the `select`ed fields for a particular tap, the image will fully reinstall all the plugins when the container is rebuilt. This can really slow down the iteration process, so I’m wondering if there’s a way to refactor this dockerfile to avoid that. For example, might there be a way to get
meltano install
to run based on a single yml file that gets copied in first, then restructure some other yml files that capture all the field level configuration so that we can take more mindful advantage of the image layers? Any ideas/suggestions from others who have run into this?
v
fwiw I've hit this as well. I just eat the time loss as its' so nice things just rebuild and work 🤷 Where I've hit some "show stoppers" here is when my project is installing 20+ plugins and a few of the PIP HTTP calls retry too much, then I have to rerun meltano install 1hr plus and then a retry on top of that 😮
j
Soooo I’ve been screwing around with this too and have just started passing flags on small subsets of what I want to install with make
I.e. make bundle-a
n
ah interesting, does that mean instead of doing the
meltano install
type commands explicitly in your dockerfile you’ve moved the installation steps into Makefiles and have the docker file run those as you need them?
j
Lol yes
n
got it, that seems like a viable option
j
I’m really digging it!
Superset install makes me cry so this way I can skip it if needed lol
n
ohh yeah, I bet that gets gnarly quickly in a situation like this
(as a sort of related side note - inspired by your DuckDB post, I recently went down a similar rabbit hole of trying to wrap up meltano/duckdb/metabase via docker compose)
v
Love the simplicity of just
meltano install tap-1
tap...N, simple easy fix I like it
j
I’m working on 3 different viz options so I’m sort of splitting options in the make file so you can build only the pieces you need
Will share the example when I have it done
n
@visch, that might work too! it’s really only the taps that change frequently, so I could probably copy over the yaml files for all the others first, do a generic
meltano install
, then explicitly install all the taps separately, THEN copy over the yaml files for the taps after they’ve been installed so that they’re added in a downstream layer
only hitch with that plan would be whether or not
meltano install
would know where to find them without the yaml files in place, but I bet I could refactor how those yaml files are organized to deal with that
v
Meltano could probably help here but it doesn't seem simple on that side. Would need a cache system, but maybe there's something magic in
pip
that could be used 🤷
https://pip.pypa.io/en/stable/cli/pip/?highlight=--cache-dir#cmdoption-cache-dir Could probably help dramatically as well not sure how much Meltano does / doesn't do but I'm pretty certain my install don't use a global pip cache while doing installs.
Hard part with docker is you probably dont' want to keep the cache around, so you'd want a build image but if you don't mind the extra bloat it might be a good option 🤷 I haven't dove too deep into this but hope it might help
n
hmm yeah, thanks for those pointers/callouts!
I’ll try some tire kicking on this and will report back with what I find, thanks for your help @jacob_matson and @visch! And for confirming I’m not just missing something straightforward
Circling back with a promised update: I don’t think I’d call this a GOOD idea, but I did confirm that it works. If I create a separate yml file (which I’ve named
meltano_install.yml
) that only lists out the plugins I care about and their corresponding `pip_url`s, I can then do this in my dockerfile:
Copy code
COPY ./meltano_install.yml ./meltano.yml
RUN meltano install
COPY . .
This allows the
meltano install
command to find everything it needs and to cache the installed results as a layer in the image before the next layer puts the actual
meltano.yml
file plus others where they need to be. I can then tinker with those files as much as I want without needing to reinstall anything.
A downside to this is that it’s not especially DRY since there’s some duplication between
meltano_install.yml
and
meltano.yml
. It also generally feels a little “clever in a bad way”, but it’s a start
j
@nick_hamlin I think you should create an issue for this. I think your instinct that it feels wrong means its a missing feature.
n
It’s also possible that there’s a way to get multiple yml files configured in such a way that the field selection is handled separately from the core config, which would fix the “wrongness” I’m feeling. I’m going to tinker a bit more with that approach, but will put in an issue if that turns out to be a dead end.
But this POC at least demonstrates that creative yml management is a viable path to speeding builds WAY up
s
Okay, I have had this discussion as well. What I have done is not place any table selects or metadata like the type of replication into my meltano.yml file. Reason, what changes mostly is what tables I want to ingest not adding new taps or targets. To work out what tables to select and what the metadata should be for each table we use two Meltano environment variables for each tap to set the tables we want to select and how to replicate them. These are set at run-time via Chamber which grabs the values and sets them when running the pipeline. The key environment variable to use are: _SELECT and _METADATA Example: export TAP_SYBASE__NMD__SELECT='["dwh_extract-table1.*","dwh_extract-dwh_table2.*","dbo-table3.*","dbo-table4.*","dbo-table5.*"]' export TAP_SYBASE__NMD__METADATA='{"dwh_extract-dwh-table1": {"is-view":false, "replication-method": "FULL_TABLE", "table-key-properties": ["column_1"]}}' here is another example of the _METADATA environment variable with a default replication method of full table, and overriding the primary key to be empty. This means that target_snowflake will insert the data rather than merge the data. export TAP_SYBASE__NMD__METADATA='{"*": {"replication-method": "FULL_TABLE"}, "dwh_extract-dwh-table1": {"replication-method": "FULL_TABLE", "table-key-properties": []}}'