Does anyone have best practice recommendations for managing Meltano #best-practices

Does anyone have best practice recommendations for...

Ellis Valentiner

07/11/2025, 2:34 PM

Does anyone have best practice recommendations for managing their

meltano.yml

file? Specifically ours is very verbose and contains a lot of duplication. For instance, we have an inline stream map on every table to add a source database identifier. If we try to define that with a

to apply to all tables then it errors because it doesn't respect the contents of

select

. This means for each extractor we have the same table identified in the

stream_maps

select

, and

metadata

blocks. So we're constantly jumping around the yaml to make updates and its very easy for devs to miss 1 of the 3 places that need to be updated.

visch

07/11/2025, 3:18 PM

Could you share this maybe with at least 5-6 of them? If they are actually the same for each then glob syntax should do the job, if they aren't the same then it's pretty tough as they need different things for each. Then the question is how many of them do you have is it 10, 100, 1000? Without seeing it it's hard to give suggestions, most of the time on my side I can set the defaults to be what we're after 90%+ of the time so you don't have to remember, and the other 10% we just override it. For stubborn stuff we just made a real specific SOP for how to add/remove new tables and it tends to work (but we don't have to tweak much of what you're talking about here, but our use cases are slightly different I'm sure!)

Ellis Valentiner

07/11/2025, 3:22 PM

I also posted to #C069CQNHDNF with one of our more specific problems https://meltano.slack.com/archives/C069CQNHDNF/p1752245447969929

Ellis Valentiner

07/11/2025, 3:23 PM

I'm about to be on paternity leave, so I won't be around to review the PRs and we have several new devs that are still ramping up on the nuances of meltano

💯 1

👶 1

Ellis Valentiner

07/11/2025, 3:25 PM

A substantial issue we run into is that it feels like our database schema gets duplicated into meltano config in multiple places (plus dbt, BigQuery, etc.)

visch

07/11/2025, 4:33 PM

can you combine these into one thread?

visch

07/11/2025, 4:34 PM

congrats on the baby, I have one coming any day as well, probably Sunday

Ellis Valentiner

07/11/2025, 4:35 PM

thanks! and congrats to you. my wife has a scheduled induction so we're pretty sure on the date but anything could happen.

visch

07/11/2025, 4:41 PM

yep scheduled for one Sunday, she's having contractions today ( a few hours ago) so I understand 😅 Stream maps "should" work with the wildcard you're putting in, I'd have to test. The metadata stuff you have, normally postgres DB's already have PK's defined so you shouldn't have to lay those out (unless they just aren't defined in which case you have to) . Do you really need to list each column of the tables as well, or would just select

table1.*

work? Both of those would reduce your line count pretty quick, stream maps, making a small repeatable example is what I"d go for locally with a PG db ie run

docker run -e POSTGRES_PASSWORD=postgres -p 5432:5432 -h postgres -v /home/visch/postgres_data:/var/lib/postgresql/data -d <http://docker.io/library/postgres:16|docker.io/library/postgres:16>

Setup 1-2 tables (no metadata etc) and try to do the same, then post the full troubleshooting stuff sepeartely, there's no logs in your example I can't right now as I'm busy busy getting ready for stuff myself.

visch

07/11/2025, 4:42 PM

Other way is do a bunch of inheritance, one for each table or something but that's a lot of overhead just to make things pretty in

meltano.yml

visch

07/11/2025, 4:43 PM

You could go this route https://docs.meltano.com/concepts/project/#multiple-yaml-files as well but again it seems like too much to me but 🤷 if you actually have 1k+ tables it might make some sense

Ellis Valentiner

07/11/2025, 4:51 PM

For the metadata the issue we had with relying on Postgres's already defined PKs is the

customer_id

column is actually one of our inline stream maps. We have single tenant databases that get replicated to a single warehouse and we add the customer/database identifier on the fly. If we rely on Postgres's PKs then we end up clobbering and writing over data from different customers. We list each column individually because we had some issues with wildcards, particularly with schema changes (new columns not getting added to the target). I'll see if I can whip up a reproducible example.

visch

07/11/2025, 5:12 PM

new columns not getting added to the target

Is a separate issue that shouldn't happen, and you're trading a more maintainable meltano.yml file for it

visch

07/11/2025, 5:16 PM

If we rely on Postgres's PKs then we end up clobbering and writing over data from different customers.

hmm, I'd probably write each customer's data to their own schema in my target (at a minimum, but since the goal is to combine them that makes some sense), then I'd combine them systematically during the transformation step with DBT. Again all of this depends on what the numbers and things actually are, stream maps makes a lot of sense for a few tables but if it's 1000 tables maybe splitting by customers makes more sense. Just ideas!

mark_estey

07/14/2025, 6:16 PM

I have a similar setup where a tap reads from multiple servers that all have the same schema, so I write each to different schemas and then in the destination do a union all between them by adding the schema name as a column:

select 'server1' as server, * from server1.table union all select 'server2' as server, * from server2.table ...

mark_estey

07/14/2025, 6:21 PM

DBT Jinja definitely helps in cases like that to build that view that unions a lot of schema/tables!

Ellis Valentiner

07/14/2025, 6:31 PM

In hindsight we should have done some things differently in setting up our data replication. We should have loaded our data into BigQuery as JSON then used dbt to explode it out into columns.

💯 1

mark_estey

07/14/2025, 6:48 PM

The other benefit of doing it in the destination DB is that you can also form the surrogate key (server + primary key) and transform known foreign keys too in that select & union statement so it's transparently handled for the downstream models. 😅

Ellis Valentiner

07/14/2025, 6:52 PM

At the time we really wanted an easy and simple solution. There didn't seem to be any advantage to the complexity of that before but now we're in a tough spot.

2 Views

Open in Slack

Previous Next