I would love to add some some transformation to a ...
# singer-tap-development
a
I would love to add some some transformation to a current tap. Is that best practice or something a tap should have?
a
Hi, @albert_m. Can you say more about your use case?
Also, out of curiosity, which tap(s) are you looking at?
Have you seen the stream maps docs yet? This comes out of box in SDK-based taps and targets and we're working on an in-between mapper which can perform the same operations inline with non-SDK taps and targets.
a
I was thinking for the AWS cost explorer or typeform. They have some data that comes as strings but they are intergers from the api. I would like the tap to convert them.
@aaronsteers I'm just trying to understand when is the best time to use transformations and learn how to use transformations with the SDK.
a
I see. Thanks, @albert_m. The word "transformations" is highly overloaded so I just wanted to make sure I give you the right answer. As the owner of a tap, there are basically two approaches: model with as light a touch as possible (echoing results from the API with very few changes), or model what best respects the nature of the data model for downstream usage. These two are not mutually exclusive but they are often in tension. In cases where you want to make some amount of changes, the best and easiest way in the SDK is to simply modify the record in
Stream.post_process()
and make the corresponding changes also in your declared
Stream.schema
.
If users want to transform the stream further, that's where the stream maps feature comes in.
I think most developers start out echoing the results almost directly from the source API, because this gives the fastest final product, it's the easiest to maintain, it doesn't require deep knowledge of the API, and it doesn't require many hard choices. That said, post_process() will let you clean up the data, remove redundancies, extract IDs from urls, cast data types, etc. Another important usage is to bring primary key and incremental replication keys top-level properties if they are not so already.
Does this help to answer your question?
a
@aaronsteers thanks for that, I needed to read that and just see how the community was doing stuff. I do have the approach of echoing the API results with little change. I was reading how meltano does stuff with dbt. So I curious to know if that part of development or not
a
Great! Glad it's helpful info. Regarding dbt, yes that's what we normally refer to when we talk about transformations or the "transformation layer" of ELT. That step is completely decoupled from the tap itself, and it takes as it's input all of the output data produced by one or more taps. The reason these layers are separate is that taps are generating streams, with no memory or cache of the entire dataset. This is necessary for them to scale and perform predictably. But once the full data is landed in your target (Snowflake, Redshift, etc.), then any number of transformations can be performed on those datasets. Also, just to be clear, dbt never modifies any of the original data - it will only build new datasets derived from those landed datasets.
a
Make sense. Thanks for the answer. It makes sense.