Question on the DBT Parquet Trino setup Backgroun...
# best-practices
e
Question on the DBT Parquet Trino setup Background, I started exploring data lake setups via S3 + Trino .. and now wanting to come back to test out DBT... the tap is one I wrote myself.. the target used so far outputs parquet files https://hub.meltano.com/loaders/target-parquet/ .. my question is.. if I add on a DBT transform step.. does it simply push out parquet files.. or what is the "proper" way to integrate DBT with Trino in general? I am currently waiting for Python 3.10 support to hit certain parts of the pypi ecosystem.. so I cannot test this myself unless I spent time rolling back to 3.8
my guess is.. if I start doing DBT steps in that target.. it just pushes out more parquet files.. which I upload as-is..? Then I query in Trino as normal?
a
RE: spend time rolling back to 3.8 You should install pyenv (google it) It makes having as many python versions as you want and switching between them a couple keystrokes. On the Trino side it’s worth reading more into it. Trino uses connectors such as Hive, Iceberg, Hudi https://trino.io/docs/current/connector.html You should probably give their docs a good read through before digging into dbt and Trino.
e
Hi Alex, I can utilize python virtual environments by hand of course.. the issue is not my code but the targets and DBT.. In Meltano GUI there is no clear way yet to "fall back" when a Tap or Target requires a interpreter no longer provided
I will continue reading up on TRINO sure.. I currently have it connected to S3 via Hive Metastore successfully
just trying to get a feel now for the part where i want to create views.. or tables based on the data sourced
a
Nice! Pyenv != virtual environment. It handle multiple interpreters
I feel you. I like iceberg!
For example in icerberg, the setting iceberg.file-format will determine if your spitting out parquets. But still you need a metastore like thrift.
e
hmmm nice, I will look into it (it's been on my list alongside Delta Lake)
HIVE metastore is set.. I am on 3.x alongside Trino 399
I guess if I adopt iceberg.. I need to figure out how to get Meltano Tranform step to likewise spit out the needed parquet..?
a
Meltano transform = dbt? In that case your Trino setup which dbt connects to will read and create files just by creating dbt models and it will keep the metastore (schema) up to date.
e
Meltanos "Transform" step in the GUI triggers DBT as one option.. sorry if it's not clear I am very new to DBT and only used it once via postgresql loader
guess the issue is.. the parquet target I use today.. has no connection whatsoever with Trino.. it just spits out files
so If i am going further .. it might require me to write a custom target which connects to the Trino and HIVE Metastore.. in addition to the S3 bucket(s)
thanks for the info.. I will read up a bit on things more.. but it sounds like I should just try somehow to see what happens if I enable DBT on this target.. my guess is it will spit out additional .parquet files and then I simply upload.. but if I want fancy stuff like apache iceberg or integration with Trino itself it will mean writing a new target
a
You need to add schemas to your metastore if you want to read parquets that weren’t created by trino.
Looks like this generally CREATE SCHEMA IF NOT EXISTS hive.iris WITH (location = 's3a://iris/'); CREATE TABLE IF NOT EXISTS hive.iris.iris_parquet ( sepal_length DOUBLE, sepal_width DOUBLE, petal_length DOUBLE, petal_width DOUBLE, class VARCHAR ) WITH ( external_location = 's3a://iris/iris_parquet', format = 'PARQUET' );
e
correct, I am doing this now actually already.. but it is not a part of the meltano flow yet.. I just upload it to S3 and in TRINO later "downstream" setup that table
a
So in dbt you can use a macro or a pre-hook Or dbt run-operation To automate that during the transform step.
e
sounds like I will get on that reading.. thanks for letting me bounce some questions out there
smells like.. I would need to write a target that is Trino aware then 🙂
if it doesn't exist already that is..
a
I think that would be a sweet target and a good way to keep learning the systems. You could just do something like this meltano run tap-xxx target-parquet dbt-trino:create_schemas dbt-trino:run where create_schemas is a custom command in meltano yml that calls dbt run-operation
Or lace a hook https://docs.getdbt.com/docs/build/hooks-operations In the right spot and dbt run will do it automatically
e
okay I gave something a quick shot.. and it turns out there is currently NO dbt meltano plugin for Trino, Parquet , etc.. the only ones available are
Copy code
meltano discover transformers                                                                                                    
2022-10-15T17:57:43.084882Z [info     ] Environment 'dev' is active                                                                                                        
Transformers                                                                                                                                                               
dbt-snowflake                                                                                                                                                              
dbt-redshift                                                                                                                                                               
dbt-postgres                                                                                                                                                               
dbt-duckdb                                                                                                                                                                 
dbt-bigquery                                                                                                                                                               
dbt
and the
dbt
one i believe they are deprecating
I am quiet curious now.. if there could just be a simple
dbt-parquet
a
No but you could use dbt-duckdb to do exactly what you want for the record
e
I guess tho, in order to get that to work I'd need to switch to target-duckdb though..?
a
No you can use target parquet and dbt duckdb can load parquets very easily and intuitively if it knows where to find the output file.
It’s not the same scale as a data lake. But for learning purposes it works. https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html#next-steps
e