I would love to add some some transformation to a current ta Meltano #singer-tap-development

Join Slack

I would love to add some some transformation to a ...

# singer-tap-development

albert_m

10/22/2021, 4:56 PM

I would love to add some some transformation to a current tap. Is that best practice or something a tap should have?

aaronsteers

10/22/2021, 4:57 PM

Hi, @albert_m. Can you say more about your use case?

aaronsteers

10/22/2021, 4:58 PM

Also, out of curiosity, which tap(s) are you looking at?

aaronsteers

10/22/2021, 5:00 PM

Have you seen the stream maps docs yet? This comes out of box in SDK-based taps and targets and we're working on an in-between mapper which can perform the same operations inline with non-SDK taps and targets.

albert_m

10/22/2021, 8:24 PM

I was thinking for the AWS cost explorer or typeform. They have some data that comes as strings but they are intergers from the api. I would like the tap to convert them.

albert_m

10/22/2021, 8:32 PM

@aaronsteers I'm just trying to understand when is the best time to use transformations and learn how to use transformations with the SDK.

aaronsteers

10/22/2021, 8:59 PM

I see. Thanks, @albert_m. The word "transformations" is highly overloaded so I just wanted to make sure I give you the right answer. As the owner of a tap, there are basically two approaches: model with as light a touch as possible (echoing results from the API with very few changes), or model what best respects the nature of the data model for downstream usage. These two are not mutually exclusive but they are often in tension. In cases where you want to make some amount of changes, the best and easiest way in the SDK is to simply modify the record in

Stream.post_process()

and make the corresponding changes also in your declared

Stream.schema

aaronsteers

10/22/2021, 9:00 PM

If users want to transform the stream further, that's where the stream maps feature comes in.

aaronsteers

10/22/2021, 9:05 PM

I think most developers start out echoing the results almost directly from the source API, because this gives the fastest final product, it's the easiest to maintain, it doesn't require deep knowledge of the API, and it doesn't require many hard choices. That said, post_process() will let you clean up the data, remove redundancies, extract IDs from urls, cast data types, etc. Another important usage is to bring primary key and incremental replication keys top-level properties if they are not so already.

aaronsteers

10/22/2021, 9:06 PM

Does this help to answer your question?

albert_m

10/22/2021, 9:12 PM

@aaronsteers thanks for that, I needed to read that and just see how the community was doing stuff. I do have the approach of echoing the API results with little change. I was reading how meltano does stuff with dbt. So I curious to know if that part of development or not

aaronsteers

10/22/2021, 9:35 PM

Great! Glad it's helpful info. Regarding dbt, yes that's what we normally refer to when we talk about transformations or the "transformation layer" of ELT. That step is completely decoupled from the tap itself, and it takes as it's input all of the output data produced by one or more taps. The reason these layers are separate is that taps are generating streams, with no memory or cache of the entire dataset. This is necessary for them to scale and perform predictably. But once the full data is landed in your target (Snowflake, Redshift, etc.), then any number of transformations can be performed on those datasets. Also, just to be clear, dbt never modifies any of the original data - it will only build new datasets derived from those landed datasets.

albert_m

10/22/2021, 9:43 PM

Make sense. Thanks for the answer. It makes sense.

Open in Slack

Previous Next