Question on dbt We currently do a lot of complex transformat Meltano #best-practices

Question on dbt: We currently do a lot of complex ...

jens_dahl_mollerhoj

08/16/2021, 3:13 PM

Question on dbt: We currently do a lot of complex transformation of our data: conditional string manipulations + dependent computations etc. If we started using meltano, we would have to specify all that in SQL right? Is there anyway around this? python seems like a much better tool for this job

taylor

08/16/2021, 3:30 PM

dbt is SQL-based, so if you plan to use dbt (either w/in Meltano or not) you'd be choosing SQL as the primary transformation language. You can use Jinja macros to get some python-like abilities to template the SQL, but dbt is all SQL-based.

edgar_ramirez_mondragon

08/16/2021, 3:34 PM

@jens_dahl_mollerhoj I'm curious what you're using in Python for those transformations? Pandas?

jens_dahl_mollerhoj

08/16/2021, 3:38 PM

It's mainly just a bunch of business logic parsing string. Imagine we have to parse a string: "spend:223.23EUR, impact: 8232.023". We have hundreds of lines of python code checking all kinds of business logic to return a few columns we need to set for this particular row

jens_dahl_mollerhoj

08/16/2021, 3:39 PM

SQL is fine, it's just not really the kind of language I want to build a parser in..

taylor

08/16/2021, 3:54 PM

@jens_dahl_mollerhoj where is the data when you're doing these transformations? Is it in flight before getting to a warehouse or object store or are you transforming in place after it's moved?

jens_dahl_mollerhoj

08/17/2021, 7:12 PM

@taylor currently we're transforming in flight (ETL) but we're investigating the methodology proposed by tools such as meltano (ELT). Even if it was viable for us to do our current transformations with SQL, I wouldn't feel comfortable that we would be ready for future requirements being limited to doing transformations if we're limited to using SQL.

taylor

08/17/2021, 9:03 PM

One potential solution to this would be our stream maps feature which would allow you to transform data on a per record basis in keeping with the Singer spec. Aggregates would not be allowed, but specifically for parsing data, this could be a good solution for you https://gitlab.com/meltano/meltano/-/issues/2300

jens_dahl_mollerhoj

08/19/2021, 7:23 AM

Thanks @taylor! I must admit I found the description on inline stream maps in those docs a little confusing. But as I understand it, this would allow us to use a custom parser before loading the data. That would get us some of the way - but we would loose the advantages of having the source data in the system for debugging etc. not to mention the ability to have the parsing depend on data already present in the system, and to do aggregations 🙂 There's a lot of enthusiasm for dbt everywhere I look, but it seems to be that giving up the ability to use python for transformations is a pretty big downside

taylor

08/19/2021, 1:25 PM

You're right that it would allow you to transform prior to loading. It'll be especially useful for anonymization of data or minor cleanup.

taylor

08/19/2021, 1:28 PM

As for giving up Python, I'd argue that SQL is a more powerful language if you're loading your data into a relational database. There are of course certain transformations that are easier in python (pivots come to mind) but overall SQL is an extremely powerful language and the built-in benefits of a tool like dbt (lineage, testing, documentation, templating) outweigh the benefits of a loss of pure Python.

jens_dahl_mollerhoj

08/19/2021, 5:35 PM

As far as I can tell, the internet certainly seems to agree with that sentiment. I guess I'll have to give it a go to figure out if I agree. I would argue that relational databases and declarative languages are powerful tools - SQL seems powerful because of those ideas, SQL itself is a bizarre COBOL like language from the 70s 😂

taylor

08/19/2021, 6:44 PM

I would agree that SQL is powerful because of relational algebra, yes 😄 It's not always the best tool for the job, but I think the renaissance of SQL is because people are finding that it's really good at what it does even with the warts of the implementation language, and the added flexibility of a general purpose programming language like python, specifically for data transformation, isn't worth the loss of all the beauty of relational algebra and relational structures. but I'm a SQL stan, so I'm a bit biased 😅

ken_payne

08/24/2021, 2:29 PM

Two things I would add - certain warehouses are better than others in this regard. Lots of SQL extensions now exist for doing things like json extraction and complex string parsing with tools like regex that aren't included in the ANSI standard. Also, some warehouses (like redshift) do support python User Defined Functions (UDFs) for when you really really can't do a particular transformation using successive manipulations in SQL 🙂 Probably best used sparingly though.

Open in Slack

Previous Next