Hi, I have been thinking about project setups, and...
# troubleshooting
h
Hi, I have been thinking about project setups, and one idea relies on plugin inheritance and
include_paths
to keep things tidy (or possibly just distribute the messiness). This might not be possible, and it might be a really bad idea, but the basic idea is:
meltano.yaml
: Declares the taps and targets themselves, but no config. Imports yamls in
extractors/
and
loaders/
folders. L `extractors/server1.meltano.yaml`: Extractors that inherits from the ones declared in meltano.yaml, Adds some configurations like server name, password, etc. L
extractors/server1/data1.meltano.yaml
: Extractor that inherits from the ones declared in the parent extractors folder, specifies exactly what datasets are extracted. L
loaders/loaders.meltano.yaml
: Same as for extractors… So basically, conceptually, there are three levels of config files: • level 1: meltano.yaml basically declares the different taps, no config or anything, just a base definition • level 2: Taps with configs for specific servers/services, useful when you have several instances of snowflake, postgres etc that you load from. • level 3: Taps that inherits from a specific service, and specifies specific datasets/endpoints/etc to be loaded. This allows us to load different schemas/datasets from the same source on different cadences. Only the “level 3” taps would actually be invoked. I haven’t gotten this to work yet, I don’t know if it is even possible, and I might be really, really overthinking it. Are there easier ways to go about this? Although I’m using docker/prefect for the loads, I want to stick with a single meltano project if possible.
p
@Henning Holgersen this sounds reasonable to me! The inheritance and
include_paths
allow you to really organize in whatever way your team prefers. The inheritance feature only installs the plugin once (previously each child got its own venv) so install time or space isn't an issue when spreading the config out like this. I've done what you're describing on a smaller level and I end up leaving a few configs at each level because some config settings are generic enough to cast across all children, or can be overridden if needed in a child, to avoid re-configuring the same value over and over. Although sometimes for readability its nicer to have the configs all set in the lowest level child.
a
@Henning Holgersen - I might suggest a slightly different approach but very similar:
Copy code
/meltano.yml
/environments/dev.meltano.yml
/environments/staging.meltano.yml
/environments/prod.meltano.yml
/extractors/<domain1>.meltano.yml
/extractors/<domain2>.meltano.yml
/extractors/<domain2>.meltano.yml
/loaders/*.meltano.yml
...
Rather than breaking up extractors into tiers or levels of inheritance, I would try to keep them grouped together by topic area and/or by internal team name or function names. For example, if you have several instances of
tap-slack
for connecting to different slack sites, putting them all in
.../extractors/slack.meltano.yml
will make debugging a lot easier when they are inheriting from each other - and if you need to change all of them, you can change them all in one place. As the team and the repo grows, you can also group subfolders according to which team controls/maintains which files. So, hubspot and google analytics might be nested in a
marketing
subfolder, for instance - and then you can use CODEOWNERS to govern who is allowed to approve changes on each subfolder.
And here's our squared project that @pat_nadolny mentions, in case it helps.
h
Thanks for the input, the squared project was my primary inspiration. I am currently using a two-level setup similar to squared, which works well, but won’t scale as well as I’d like. Adding a third level didn’t seem to work (tap not found), which is part of why I asked. If there is no obvious reason double inheritance wouldn’t work, I’ll give it a few more tries. The domain organization looks interesting, I will ponder how that would look in our project.
c
I think the principle that applies for this is "Separation of concerns" So the question really is, what are your teams' concerns. In my example, the team is small (2-3 people) and our concerns are: - We have LOTS of different data sources So what I ended up with for now is: 1. Almost nothing in meltano.yml with regards to plugins apart from dbt and common testing plugins like
target-jsonl
2. One file per data source in
extract/
folder. Each file specifies everything: Base plugin, inherited plugin (when needed), and common config (we hardly have to differentiate between dev/prod in any of the sources) 3. In the rare case that a "base" plugin might need to be inherited by multiple data sources (which hasn't happened yet), the base plugin will simply get its own file in the
extract/
folder. Point three highlights the other important principle we follow:
Refactor early and often
h
We have a similar team organization @christoph, my small (3-4 person) team is responsible for most of the E/L, from a lot of very different sources. I’m sure we will change our mind on the organizing principles as the sources grow. But I’m happy to hear the general approach doesn’t sound absurd. I was able to get it to work, seems the trouble wasn’t double inheritance, but layered use of
include_paths
. Refactor early and often is good advice, I just hope we can keep it up as the project matures.
c
And one of the Agile Manifesto Principles is always handy to keep around:
Simplicity--the art of maximizing the amount
of work not done--is essential.