Question on the DBT Parquet Trino setup Background I started Meltano #best-practices

Question on the DBT Parquet Trino setup Backgroun...

emcp

10/15/2022, 3:47 PM

Question on the DBT Parquet Trino setup Background, I started exploring data lake setups via S3 + Trino .. and now wanting to come back to test out DBT... the tap is one I wrote myself.. the target used so far outputs parquet files https://hub.meltano.com/loaders/target-parquet/ .. my question is.. if I add on a DBT transform step.. does it simply push out parquet files.. or what is the "proper" way to integrate DBT with Trino in general? I am currently waiting for Python 3.10 support to hit certain parts of the pypi ecosystem.. so I cannot test this myself unless I spent time rolling back to 3.8

emcp

10/15/2022, 3:53 PM

my guess is.. if I start doing DBT steps in that target.. it just pushes out more parquet files.. which I upload as-is..? Then I query in Trino as normal?

alexander_butler

10/15/2022, 4:10 PM

RE: spend time rolling back to 3.8 You should install pyenv (google it) It makes having as many python versions as you want and switching between them a couple keystrokes. On the Trino side it’s worth reading more into it. Trino uses connectors such as Hive, Iceberg, Hudi https://trino.io/docs/current/connector.html You should probably give their docs a good read through before digging into dbt and Trino.

emcp

10/15/2022, 5:28 PM

Hi Alex, I can utilize python virtual environments by hand of course.. the issue is not my code but the targets and DBT.. In Meltano GUI there is no clear way yet to "fall back" when a Tap or Target requires a interpreter no longer provided

emcp

10/15/2022, 5:29 PM

I will continue reading up on TRINO sure.. I currently have it connected to S3 via Hive Metastore successfully

emcp

10/15/2022, 5:29 PM

just trying to get a feel now for the part where i want to create views.. or tables based on the data sourced

alexander_butler

10/15/2022, 5:29 PM

Nice! Pyenv != virtual environment. It handle multiple interpreters

alexander_butler

10/15/2022, 5:30 PM

I feel you. I like iceberg!

alexander_butler

10/15/2022, 5:30 PM

For example in icerberg, the setting iceberg.file-format will determine if your spitting out parquets. But still you need a metastore like thrift.

emcp

10/15/2022, 5:30 PM

hmmm nice, I will look into it (it's been on my list alongside Delta Lake)

emcp

10/15/2022, 5:30 PM

HIVE metastore is set.. I am on 3.x alongside Trino 399

emcp

10/15/2022, 5:31 PM

I guess if I adopt iceberg.. I need to figure out how to get Meltano Tranform step to likewise spit out the needed parquet..?

alexander_butler

10/15/2022, 5:34 PM

Meltano transform = dbt? In that case your Trino setup which dbt connects to will read and create files just by creating dbt models and it will keep the metastore (schema) up to date.

emcp

10/15/2022, 5:35 PM

Meltanos "Transform" step in the GUI triggers DBT as one option.. sorry if it's not clear I am very new to DBT and only used it once via postgresql loader

emcp

10/15/2022, 5:36 PM

guess the issue is.. the parquet target I use today.. has no connection whatsoever with Trino.. it just spits out files

emcp

10/15/2022, 5:36 PM

so If i am going further .. it might require me to write a custom target which connects to the Trino and HIVE Metastore.. in addition to the S3 bucket(s)

emcp

10/15/2022, 5:41 PM

thanks for the info.. I will read up a bit on things more.. but it sounds like I should just try somehow to see what happens if I enable DBT on this target.. my guess is it will spit out additional .parquet files and then I simply upload.. but if I want fancy stuff like apache iceberg or integration with Trino itself it will mean writing a new target

alexander_butler

10/15/2022, 5:43 PM

You need to add schemas to your metastore if you want to read parquets that weren’t created by trino.

alexander_butler

10/15/2022, 5:44 PM

Looks like this generally CREATE SCHEMA IF NOT EXISTS hive.iris WITH (location = 's3a://iris/'); CREATE TABLE IF NOT EXISTS hive.iris.iris_parquet ( sepal_length DOUBLE, sepal_width DOUBLE, petal_length DOUBLE, petal_width DOUBLE, class VARCHAR ) WITH ( external_location = 's3a://iris/iris_parquet', format = 'PARQUET' );

emcp

10/15/2022, 5:46 PM

correct, I am doing this now actually already.. but it is not a part of the meltano flow yet.. I just upload it to S3 and in TRINO later "downstream" setup that table

alexander_butler

10/15/2022, 5:47 PM

So in dbt you can use a macro or a pre-hook Or dbt run-operation To automate that during the transform step.

emcp

10/15/2022, 5:48 PM

sounds like I will get on that reading.. thanks for letting me bounce some questions out there

emcp

10/15/2022, 5:48 PM

smells like.. I would need to write a target that is Trino aware then 🙂

emcp

10/15/2022, 5:48 PM

if it doesn't exist already that is..

alexander_butler

10/15/2022, 5:50 PM

I think that would be a sweet target and a good way to keep learning the systems. You could just do something like this meltano run tap-xxx target-parquet dbt-trino:create_schemas dbt-trino:run where create_schemas is a custom command in meltano yml that calls dbt run-operation

alexander_butler

10/15/2022, 5:52 PM

Or lace a hook https://docs.getdbt.com/docs/build/hooks-operations In the right spot and dbt run will do it automatically

emcp

10/15/2022, 6:03 PM

okay I gave something a quick shot.. and it turns out there is currently NO dbt meltano plugin for Trino, Parquet , etc.. the only ones available are

Copy code

meltano discover transformers                                                                                                    
2022-10-15T17:57:43.084882Z [info     ] Environment 'dev' is active                                                                                                        
Transformers                                                                                                                                                               
dbt-snowflake                                                                                                                                                              
dbt-redshift                                                                                                                                                               
dbt-postgres                                                                                                                                                               
dbt-duckdb                                                                                                                                                                 
dbt-bigquery                                                                                                                                                               
dbt

and the

dbt

one i believe they are deprecating

emcp

10/15/2022, 6:03 PM

I am quiet curious now.. if there could just be a simple

dbt-parquet

alexander_butler

10/15/2022, 6:04 PM

No but you could use dbt-duckdb to do exactly what you want for the record

emcp

10/15/2022, 6:05 PM

I guess tho, in order to get that to work I'd need to switch to target-duckdb though..?

alexander_butler

10/15/2022, 6:06 PM

No you can use target parquet and dbt duckdb can load parquets very easily and intuitively if it knows where to find the output file.

alexander_butler

10/15/2022, 6:07 PM

It’s not the same scale as a data lake. But for learning purposes it works. https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html#next-steps

emcp

10/15/2022, 6:07 PM

https://hub.meltano.com/transformers/dbt-duckdb--jwills/ I will do some reading

Open in Slack

Previous Next