Does anyone have best practice recommendations for...
# best-practices
e
Does anyone have best practice recommendations for managing their
meltano.yml
file? Specifically ours is very verbose and contains a lot of duplication. For instance, we have an inline stream map on every table to add a source database identifier. If we try to define that with a
*
to apply to all tables then it errors because it doesn't respect the contents of
select
. This means for each extractor we have the same table identified in the
stream_maps
,
select
, and
metadata
blocks. So we're constantly jumping around the yaml to make updates and its very easy for devs to miss 1 of the 3 places that need to be updated.
v
Could you share this maybe with at least 5-6 of them? If they are actually the same for each then glob syntax should do the job, if they aren't the same then it's pretty tough as they need different things for each. Then the question is how many of them do you have is it 10, 100, 1000? Without seeing it it's hard to give suggestions, most of the time on my side I can set the defaults to be what we're after 90%+ of the time so you don't have to remember, and the other 10% we just override it. For stubborn stuff we just made a real specific SOP for how to add/remove new tables and it tends to work (but we don't have to tweak much of what you're talking about here, but our use cases are slightly different I'm sure!)
e
I also posted to #C069CQNHDNF with one of our more specific problems https://meltano.slack.com/archives/C069CQNHDNF/p1752245447969929
I'm about to be on paternity leave, so I won't be around to review the PRs and we have several new devs that are still ramping up on the nuances of meltano
💯 1
👶 1
A substantial issue we run into is that it feels like our database schema gets duplicated into meltano config in multiple places (plus dbt, BigQuery, etc.)
v
can you combine these into one thread?
congrats on the baby, I have one coming any day as well, probably Sunday
e
thanks! and congrats to you. my wife has a scheduled induction so we're pretty sure on the date but anything could happen.
v
yep scheduled for one Sunday, she's having contractions today ( a few hours ago) so I understand 😅 Stream maps "should" work with the wildcard you're putting in, I'd have to test. The metadata stuff you have, normally postgres DB's already have PK's defined so you shouldn't have to lay those out (unless they just aren't defined in which case you have to) . Do you really need to list each column of the tables as well, or would just select
table1.*
work? Both of those would reduce your line count pretty quick, stream maps, making a small repeatable example is what I"d go for locally with a PG db ie run
docker run -e POSTGRES_PASSWORD=postgres -p 5432:5432 -h postgres -v /home/visch/postgres_data:/var/lib/postgresql/data -d <http://docker.io/library/postgres:16|docker.io/library/postgres:16>
Setup 1-2 tables (no metadata etc) and try to do the same, then post the full troubleshooting stuff sepeartely, there's no logs in your example I can't right now as I'm busy busy getting ready for stuff myself.
Other way is do a bunch of inheritance, one for each table or something but that's a lot of overhead just to make things pretty in
meltano.yml
You could go this route https://docs.meltano.com/concepts/project/#multiple-yaml-files as well but again it seems like too much to me but 🤷 if you actually have 1k+ tables it might make some sense
e
For the metadata the issue we had with relying on Postgres's already defined PKs is the
customer_id
column is actually one of our inline stream maps. We have single tenant databases that get replicated to a single warehouse and we add the customer/database identifier on the fly. If we rely on Postgres's PKs then we end up clobbering and writing over data from different customers. We list each column individually because we had some issues with wildcards, particularly with schema changes (new columns not getting added to the target). I'll see if I can whip up a reproducible example.
v
new columns not getting added to the target
Is a separate issue that shouldn't happen, and you're trading a more maintainable meltano.yml file for it
If we rely on Postgres's PKs then we end up clobbering and writing over data from different customers.
hmm, I'd probably write each customer's data to their own schema in my target (at a minimum, but since the goal is to combine them that makes some sense), then I'd combine them systematically during the transformation step with DBT. Again all of this depends on what the numbers and things actually are, stream maps makes a lot of sense for a few tables but if it's 1000 tables maybe splitting by customers makes more sense. Just ideas!
m
I have a similar setup where a tap reads from multiple servers that all have the same schema, so I write each to different schemas and then in the destination do a union all between them by adding the schema name as a column:
select 'server1' as server, * from server1.table union all select 'server2' as server, * from server2.table ...
DBT Jinja definitely helps in cases like that to build that view that unions a lot of schema/tables!
e
In hindsight we should have done some things differently in setting up our data replication. We should have loaded our data into BigQuery as JSON then used dbt to explode it out into columns.
💯 1
m
The other benefit of doing it in the destination DB is that you can also form the surrogate key (server + primary key) and transform known foreign keys too in that select & union statement so it's transparently handled for the downstream models. 😅
e
At the time we really wanted an easy and simple solution. There didn't seem to be any advantage to the complexity of that before but now we're in a tough spot.