So here’s a philosophical question: certain data ...
# random
a
So here’s a philosophical question: certain data sources (ie, salesforce) return streams with fields that are encoded as
CamelCase
. Presumably those field names passed through as-is from tap to target, and likely flattened to
camelcase
depending on the target engine.. consequently losing the information necessary to infer word separation. Ignoring double quoting for the time being, the simplest solution in these cases would be to add support for automatic
CamelCase
->
snake_case
conversion. But whose responsibility should that be? Tap? Target?
a
Hi, @andrew_stewart. I love this question and happy to offer my two cents. As a design philosophy, a past team I was in had a practice of keeping "raw" tables (the output/target of the Singer "EL" pipeline) in whatever casing and escaping was the most natural convergence of the source system and Target landing space. In our case, with Snowflake as the target, all casing was converted to upper case and CamelCase word breaks were then lost. Because there was zero configuration, this was a very stable process but it wasn't great for conformance with readability. With the above describing the EL part of the pipeline, the first step in the T part of the pipeline was always to create what we called a "staging" layer in dbt. For this layer, we intentionally "outlawed" any heavy lifting and just set about matching the naming convention, data types, and sometimes dedupe logic which was the foundation for all views/tables being of the same language and best practices. Then more complicated transformations would have the staging layer as their first point of access, disallowed by our design practices from accessing the raw data directly.
I think it's important to keep the EL layer as simple as possible, but in the future, I do think it would be interesting to come up with a mapping algorithm in Singer targets which tried the bridge the naming convention chasm automatically between source and target - for instance trying to inject underscores where word breaks are detected. It's probably a lot of work to do so, and importantly we'd want it to be very stable, with a shared codebase across targets.
In practice, I believe this may require a list of known/expected acronyms ("EC2Box", for example should be ec2_box and not e_c2_box or e_c_2_box) and other "special words" which don't follow patterns that are detectable by a naive parser. I built something like this in PowerQuery/M to get "snake_case" databases column names into "Proper Case" reporting-friendly names in Power Bi. It was not easy but it is doable in theory. The rough thing here is that every time you "improve" the logic you end up breaking something - so it almost has to be perfect in v1.
e
oh I've encountered those
MerchantID => merchant_i_d
edge cases
a
Completely agree with sticking to E->L->T
and so I suppose the only other real solution here is the double quoting
well, I should say the only real E->L transit based solutioin
maybe
a
Double quoting gets hard to deal with pretty quickly in some systems. (Especially Snowflake which is horrible for it imho.) I think that's why most pipelines avoid it. I wonder though if the first layer of the transform pipeline could be at least semi-automated with something like this in a dbt macro: https://github.com/blakeembrey/change-case
But yes, double quoting is the only thing I can think of that could be stable in the EL layer.
a
In some sense, that loss of information due to case-flattening from source field to target field is itself a form of (unintended) transformation.
I wonder if there could be a role for the SDK in propagating a common mechanism.
a
Yeah, I could definitely see this being an extension of Stream Maps in the SDK.