Hi all How would you handle the following in Meltano I have Meltano #best-practices

Hi all. How would you handle the following in Melt...

glenn_de_munck

06/13/2022, 10:43 AM

Hi all. How would you handle the following in Meltano? I have a source database where the same tables exists in multipe schemas. It's important to know which data comes from which schema after extracting.

visch

06/13/2022, 12:22 PM

Use the tap that corresponds to that DB. I'd recommend messing with a local Postgres container with the

tap-postgres

setup to get used to it (or just go for a dev/test setup). Yes schemas are seperated, check out the Select feature in meltano

thomas_briggs

06/13/2022, 12:50 PM

The stream name produced by the tap will be <schema>-<table> so in a lot of ways this is handled automatically. What the target will do is a different question. The jsonl target uses the stream name as the file name so it'd be automatic there too. All the DB targets I've used (PG, Redshift, Snowflake, etc.) use only the table name though. In that case maybe you'd need to rename the stream using a mapper to retain the schema name somehow? It's an interesting question.

aaronsteers

06/13/2022, 3:53 PM

Hi, @glenn_de_munck. To @thomas_briggs’s point above, it will really depend on your target and how you've configured your target. Some targets ignore/scrub the

<schema>-

prefix by default (which would create a collision for you) and some also have config options for you to control this behavior. Another option, as @thomas_briggs mentioned, is to use a mapper for aliasing.

glenn_de_munck

06/14/2022, 7:59 AM

@aaronsteers Thank you for the explanation. I'm using postgres as target. Are config options available to control this behavior? I'm unable to find it in the documentation...

thomas_briggs

06/14/2022, 12:45 PM

I don't see anything in the pipelinewise-target-postgres code that would handle this. I don't think it would be hard to add though. That aside, I think I would handle this by putting each source schema into its own schema in the destination. Even if you find some way to get all the tables into one schema that feels like it'll create a lot of confusion. Practically what's the difference between all tables having a given prefix and all tables being in a separate schema? Not much, I don't think. 🤷🏻‍♂️

aaronsteers

06/14/2022, 4:57 PM

I think I would handle this by putting each source schema into its own schema in the destination

@thomas_briggs - this is what I did in a past life also. We had a SQL Server DB that had a two schemas we were selecting from and some table name collision. So, the solution we used was instead of landing them all in one job to the same "tap_financedb" schema (for example) we broke it into two jobs landed into "tap_financedb_ops" and "tap_financedb_monitoring".

glenn_de_munck

06/15/2022, 12:10 PM

I guess this will be the way to go. In my case I will have to create more then 70 schemas. 🙃 What I did in the past with ETL was to create a loop that loops over all the schemas in source db. All tables with the same name where merged to one by adding an extra column containing the schema name. Now, I will have to do this in dbt after extraction.

aaronsteers

06/15/2022, 6:57 PM

@glenn_de_munck - This 👆 would make a nice feature in meltano-map-transform. You'd just need a way to (1) append a new column with the (full) stream name and (2) add a column to every stream, which I don't think is possible today.

aaronsteers

06/15/2022, 6:59 PM

If you end up are looping through the schemas anyway, you might be able to pass the schema name also to the mapper to seed the column's value. And for a small number of streams, you could manually add the new column for each, but this clearly is a use case where it'd be best to add that to all streams, using a "*" stream qualifier or similar.

2 Views

Open in Slack

Previous Next