So here s a philosophical question certain data sources ie s Meltano #random

So here’s a philosophical question: certain data ...

andrew_stewart

08/13/2021, 9:05 PM

So here’s a philosophical question: certain data sources (ie, salesforce) return streams with fields that are encoded as

CamelCase

. Presumably those field names passed through as-is from tap to target, and likely flattened to

camelcase

depending on the target engine.. consequently losing the information necessary to infer word separation. Ignoring double quoting for the time being, the simplest solution in these cases would be to add support for automatic

CamelCase

snake_case

conversion. But whose responsibility should that be? Tap? Target?

andrew_stewart

08/13/2021, 9:07 PM

aaronsteers

08/13/2021, 10:04 PM

Hi, @andrew_stewart. I love this question and happy to offer my two cents. As a design philosophy, a past team I was in had a practice of keeping "raw" tables (the output/target of the Singer "EL" pipeline) in whatever casing and escaping was the most natural convergence of the source system and Target landing space. In our case, with Snowflake as the target, all casing was converted to upper case and CamelCase word breaks were then lost. Because there was zero configuration, this was a very stable process but it wasn't great for conformance with readability. With the above describing the EL part of the pipeline, the first step in the T part of the pipeline was always to create what we called a "staging" layer in dbt. For this layer, we intentionally "outlawed" any heavy lifting and just set about matching the naming convention, data types, and sometimes dedupe logic which was the foundation for all views/tables being of the same language and best practices. Then more complicated transformations would have the staging layer as their first point of access, disallowed by our design practices from accessing the raw data directly.

aaronsteers

08/13/2021, 10:07 PM

I think it's important to keep the EL layer as simple as possible, but in the future, I do think it would be interesting to come up with a mapping algorithm in Singer targets which tried the bridge the naming convention chasm automatically between source and target - for instance trying to inject underscores where word breaks are detected. It's probably a lot of work to do so, and importantly we'd want it to be very stable, with a shared codebase across targets.

aaronsteers

08/13/2021, 10:10 PM

In practice, I believe this may require a list of known/expected acronyms ("EC2Box", for example should be ec2_box and not e_c2_box or e_c_2_box) and other "special words" which don't follow patterns that are detectable by a naive parser. I built something like this in PowerQuery/M to get "snake_case" databases column names into "Proper Case" reporting-friendly names in Power Bi. It was not easy but it is doable in theory. The rough thing here is that every time you "improve" the logic you end up breaking something - so it almost has to be perfect in v1.

edgar_ramirez_mondragon

08/13/2021, 10:16 PM

oh I've encountered those

MerchantID => merchant_i_d

edge cases

andrew_stewart

08/13/2021, 11:14 PM

Completely agree with sticking to E->L->T

andrew_stewart

08/13/2021, 11:14 PM

and so I suppose the only other real solution here is the double quoting

andrew_stewart

08/13/2021, 11:37 PM

well, I should say the only real E->L transit based solutioin

andrew_stewart

08/13/2021, 11:37 PM

maybe

aaronsteers

08/14/2021, 4:09 PM

Double quoting gets hard to deal with pretty quickly in some systems. (Especially Snowflake which is horrible for it imho.) I think that's why most pipelines avoid it. I wonder though if the first layer of the transform pipeline could be at least semi-automated with something like this in a dbt macro: https://github.com/blakeembrey/change-case

aaronsteers

08/14/2021, 4:11 PM

But yes, double quoting is the only thing I can think of that could be stable in the EL layer.

andrew_stewart

08/14/2021, 7:23 PM

In some sense, that loss of information due to case-flattening from source field to target field is itself a form of (unintended) transformation.

andrew_stewart

08/14/2021, 7:24 PM

I wonder if there could be a role for the SDK in propagating a common mechanism.

aaronsteers

08/16/2021, 3:59 PM

Yeah, I could definitely see this being an extension of Stream Maps in the SDK.

Open in Slack

Previous Next