Not sure the best place to put this question but this seems Meltano #troubleshooting

Not sure the best place to put this question, but ...

nick_hamlin

12/20/2022, 7:10 PM

Not sure the best place to put this question, but this seems like a decent option. I’m curious if anyone has suggestions about ways to optimize the build process for a meltano docker container to help make it easier to make iterative changes quickly? Here’s what I mean: This is the standard meltano dockerfile:

Copy code

ARG MELTANO_IMAGE=meltano/meltano:latest
FROM $MELTANO_IMAGE

WORKDIR /project

# Install any additional requirements
COPY ./requirements.txt .
RUN pip install -r requirements.txt

# Copy over Meltano project directory
COPY . .
RUN meltano install

# Don't allow changes to containerized project files
ENV MELTANO_PROJECT_READONLY 1

# Expose default port used by `meltano ui`
EXPOSE 5000

ENTRYPOINT ["meltano"]

The

meltano install

line needs

meltano.yml

(and any other yml files if they’ve been broken out) to know which plugins to install, and so the line before it copies all those files into the image. This means that if one makes a change to

meltano.yml

, like changing the `select`ed fields for a particular tap, the image will fully reinstall all the plugins when the container is rebuilt. This can really slow down the iteration process, so I’m wondering if there’s a way to refactor this dockerfile to avoid that. For example, might there be a way to get

meltano install

to run based on a single yml file that gets copied in first, then restructure some other yml files that capture all the field level configuration so that we can take more mindful advantage of the image layers? Any ideas/suggestions from others who have run into this?

visch

12/20/2022, 7:23 PM

fwiw I've hit this as well. I just eat the time loss as its' so nice things just rebuild and work 🤷 Where I've hit some "show stoppers" here is when my project is installing 20+ plugins and a few of the PIP HTTP calls retry too much, then I have to rerun meltano install 1hr plus and then a retry on top of that 😮

jacob_matson

12/20/2022, 7:29 PM

Soooo I’ve been screwing around with this too and have just started passing flags on small subsets of what I want to install with make

jacob_matson

12/20/2022, 7:29 PM

I.e. make bundle-a

nick_hamlin

12/20/2022, 7:31 PM

ah interesting, does that mean instead of doing the

meltano install

type commands explicitly in your dockerfile you’ve moved the installation steps into Makefiles and have the docker file run those as you need them?

jacob_matson

12/20/2022, 7:31 PM

Lol yes

nick_hamlin

12/20/2022, 7:32 PM

got it, that seems like a viable option

jacob_matson

12/20/2022, 7:33 PM

I’m really digging it!

jacob_matson

12/20/2022, 7:33 PM

Superset install makes me cry so this way I can skip it if needed lol

nick_hamlin

12/20/2022, 7:40 PM

ohh yeah, I bet that gets gnarly quickly in a situation like this

nick_hamlin

12/20/2022, 7:43 PM

(as a sort of related side note - inspired by your DuckDB post, I recently went down a similar rabbit hole of trying to wrap up meltano/duckdb/metabase via docker compose)

visch

12/20/2022, 7:43 PM

Love the simplicity of just

meltano install tap-1

tap...N, simple easy fix I like it

jacob_matson

12/20/2022, 7:44 PM

I’m working on 3 different viz options so I’m sort of splitting options in the make file so you can build only the pieces you need

jacob_matson

12/20/2022, 7:44 PM

Will share the example when I have it done

nick_hamlin

12/20/2022, 7:47 PM

@visch, that might work too! it’s really only the taps that change frequently, so I could probably copy over the yaml files for all the others first, do a generic

meltano install

, then explicitly install all the taps separately, THEN copy over the yaml files for the taps after they’ve been installed so that they’re added in a downstream layer

nick_hamlin

12/20/2022, 7:47 PM

only hitch with that plan would be whether or not

meltano install

would know where to find them without the yaml files in place, but I bet I could refactor how those yaml files are organized to deal with that

visch

12/20/2022, 7:48 PM

Meltano could probably help here but it doesn't seem simple on that side. Would need a cache system, but maybe there's something magic in

pip

that could be used 🤷

visch

12/20/2022, 7:56 PM

https://pip.pypa.io/en/stable/cli/pip/?highlight=--cache-dir#cmdoption-cache-dir Could probably help dramatically as well not sure how much Meltano does / doesn't do but I'm pretty certain my install don't use a global pip cache while doing installs.

visch

12/20/2022, 7:57 PM

Hard part with docker is you probably dont' want to keep the cache around, so you'd want a build image but if you don't mind the extra bloat it might be a good option 🤷 I haven't dove too deep into this but hope it might help

nick_hamlin

12/20/2022, 7:58 PM

hmm yeah, thanks for those pointers/callouts!

nick_hamlin

12/20/2022, 7:59 PM

I’ll try some tire kicking on this and will report back with what I find, thanks for your help @jacob_matson and @visch! And for confirming I’m not just missing something straightforward

nick_hamlin

12/21/2022, 3:19 PM

Circling back with a promised update: I don’t think I’d call this a GOOD idea, but I did confirm that it works. If I create a separate yml file (which I’ve named

meltano_install.yml

) that only lists out the plugins I care about and their corresponding `pip_url`s, I can then do this in my dockerfile:

Copy code

COPY ./meltano_install.yml ./meltano.yml
RUN meltano install
COPY . .

nick_hamlin

12/21/2022, 3:21 PM

This allows the

meltano install

command to find everything it needs and to cache the installed results as a layer in the image before the next layer puts the actual

meltano.yml

file plus others where they need to be. I can then tinker with those files as much as I want without needing to reinstall anything.

nick_hamlin

12/21/2022, 3:22 PM

A downside to this is that it’s not especially DRY since there’s some duplication between

meltano_install.yml

and

meltano.yml

. It also generally feels a little “clever in a bad way”, but it’s a start

jacob_matson

12/21/2022, 3:25 PM

@nick_hamlin I think you should create an issue for this. I think your instinct that it feels wrong means its a missing feature.

nick_hamlin

12/21/2022, 3:29 PM

It’s also possible that there’s a way to get multiple yml files configured in such a way that the field selection is handled separately from the core config, which would fix the “wrongness” I’m feeling. I’m going to tinker a bit more with that approach, but will put in an issue if that turns out to be a dead end.

nick_hamlin

12/21/2022, 3:30 PM

But this POC at least demonstrates that creative yml management is a viable path to speeding builds WAY up

steve_clarke

12/22/2022, 12:41 AM

Okay, I have had this discussion as well. What I have done is not place any table selects or metadata like the type of replication into my meltano.yml file. Reason, what changes mostly is what tables I want to ingest not adding new taps or targets. To work out what tables to select and what the metadata should be for each table we use two Meltano environment variables for each tap to set the tables we want to select and how to replicate them. These are set at run-time via Chamber which grabs the values and sets them when running the pipeline. The key environment variable to use are: _SELECT and _METADATA Example: export TAP_SYBASE__NMD__SELECT='["dwh_extract-table1.*","dwh_extract-dwh_table2.*","dbo-table3.*","dbo-table4.*","dbo-table5.*"]' export TAP_SYBASE__NMD__METADATA='{"dwh_extract-dwh-table1": {"is-view":false, "replication-method": "FULL_TABLE", "table-key-properties": ["column_1"]}}' here is another example of the _METADATA environment variable with a default replication method of full table, and overriding the primary key to be empty. This means that target_snowflake will insert the data rather than merge the data. export TAP_SYBASE__NMD__METADATA='{"*": {"replication-method": "FULL_TABLE"}, "dwh_extract-dwh-table1": {"replication-method": "FULL_TABLE", "table-key-properties": []}}'

Open in Slack

Previous Next