Hi there, I'm new to Meltano and DBT, but I've alr...
# getting-started
w
Hi there, I'm new to Meltano and DBT, but I've already enjoyed participating in this community! Quick question about the
transform
step that I can customize with DBT. (I know it's not strictly a Meltano question, but I figure someone here might be able to advise me). I've already extracted documents from MongoDB and loaded them into my Postgres data warehouse. I'm looking to create a staging model in DBT where I flatten the schema for the documents that have been put into the tables. That being said, each document has an array of
transactions
in it, in addition to global document-specific information. I'm wondering if anyone knows any best practices for dealing with data of that kind. Should I create a separate staging model for the transactions or should I keep it all as a single staging model?
c
I typically keep data of such shape in a single staging model and just unnest the array into multiple rows
w
So you would un-nest the array in that staging model?
c
Most of the time, yes. You can find lots of opinions about normalised vs "de-normalised" design online (most notably probably Derek Asirvadem aka "Performance DBA" https://stackoverflow.com/a/4731664/6056177) Generally, you will achieve a higher performing database with a normalised form though.
j
Interesting and controversial Stackoverflow post! From my experience, denormalization can bring significantly better read performance if you use columnar clustered MPP databases like Snowflake, BigQuery, Redshift or Vertica. JOINs in clustered databases are very expensive (data re-shuffling needed). They can be optimized, e.g. in Vertica by projection design or in Snowflake by cluster keys. But still, querying of denormalized table can be much faster. Consider this use case:
Copy code
SELECT sum(fact), attribute FROM fact_table GROUP BY attribute
Assume that cardinality of the attribute is 2 (boolean). Columnar databases can encode/compress such column extremely efficiently, so the query finish even with subsecond latency on top of billions of rows. And now imagine you would have to run this query like this:
Copy code
SELECT SUM(fact), dimx.attribute
FROM fact_table JOIN dim1 ..... JOIN dimX
GROUP BY dimX.attribute
This will always be much slower in these clustered databases.
w
@christoph @jan_soubusta My issue is that the documents I'm trying to create a staging model for are wildly nested. Each document represents a series of messages between two parties. Each message consists of information about potentially many containers, with potentially many customers responsible for those containers; and each container could potentially have many different items inside, many pertinent dates, etc. I'm finding it difficult to define a single staging model for my situation as I feel like I'm just creating a very very wide table and doing a lot of unpacking. I know that staging is generally supposed to be just light transformations, but I feel like I might have to put a fair amount of business logic in there to limit what we care about.....