I d like to split the stream of records produced by a tap in Meltano #troubleshooting

I’d like to split the stream of records produced b...

Matt Menzenski

06/18/2024, 7:27 PM

I’d like to split the stream of records produced by a tap into individual streams based on a property on the records. I believe this is possible (using __source__ and/or __filter__, not sure exactly how yet). After splitting into individual streams, I’d like to write the records to Snowflake using the MeltanoLabs variant of the target-snowflake loader. But, I’d like to write the records to multiple schemas, and I’m not sure if this is possible - the

schema

(and

default_schema

) property gets set as part of the target configuration and it doesn’t seem like there’s a way to configure it so that records with

source_name=service_a

get written to schema

service_a

while records with

source_name=service_b

get written to schema

service_b

. Is there any to accomplish this kind of behavior?

Matt Menzenski

06/18/2024, 7:32 PM

apparently this sort of thing is possible with the

schema_mapping

config property in the transferwise variant 🤔 But we’re already using the MeltanoLabs variant in production and I don’t want to switch

Edgar Ramírez (Arch.dev)

06/19/2024, 11:17 AM

schema_mapping isn't yet a builtin SDK setting, but target-snowflake supports creating schemas based on the

<schema>-<table>

stream name pattern. Maybe that's enough for a workaround?

Matt Menzenski

06/20/2024, 1:52 AM

Ooh, yeah, thank you - that should work fine

Matt Menzenski

06/20/2024, 10:33 AM

Although this maybe makes the stream splitting the big unknown step then… Can I use an inline stream map to dynamically split a stream? A record in the stream will have a property

namespace

which is a key-value dict:

”namespace”: {“database”: “customer_service”, “collection”: “Customer”}

- I would like to be able to split the single stream into a stream for each database and collection (so this example would be split into a new

customer_service-customer

stream) without having to hardcode a lot of all possible databases and collections (as new ones will be created and I want them to be picked up automatically).

Matt Menzenski

06/20/2024, 11:25 PM

I am now thinking that defining a standalone mapper plugin that can set stream_id is maybe the best way to do this

Edgar Ramírez (Arch.dev)

06/21/2024, 8:58 AM

Ah, you're right it'd be nice to automate the renaming based on the actual records. Something like this but in reverse, ie splitting instead of merging, so not even

schema_mapping

would help with that.

👍 1

Matt Menzenski

06/21/2024, 12:10 PM

If I split a stream (by associating a record with a new stream_id), do I need to do something with schema messages as well? The inline stream maps docs don’t make any mention of having to add a new schema record for the new stream - is there something happening under the hood that allows a target to “know” that the new stream uses the same schema as the original stream?

Edgar Ramírez (Arch.dev)

06/21/2024, 5:20 PM

A mapper is expected to yield zero or more messages for each one in the input. That's how you can, for example, alias streams. At the implementation level, zero or more messages are yielded for each message of every type: https://github.com/MeltanoLabs/meltano-map-transform/blob/2d57a57e594e0c987f825a0f04e54e92ffeb16e7/meltano_map_transform/mapper.py#L109-L115

Matt Menzenski

06/21/2024, 5:22 PM

thanks! to confirm I’m understanding that particular block of code correctly - that is yielding a schema message to all known streams, each time one is received? If so that sounds like exactly what I need to do

Edgar Ramírez (Arch.dev)

06/21/2024, 5:25 PM

Hmm, not sure what to "all known streams" refers to specifically

Matt Menzenski

06/21/2024, 5:26 PM

I am maybe misinterpreting the

for stream_map in self.mapper.stream_maps[stream_id]:

Matt Menzenski

06/21/2024, 5:26 PM

but, if I split one stream into ten, dynamically, it sounds like I should be emitting a schema message to each of the ten new streams

Matt Menzenski

06/21/2024, 5:27 PM

target-snowflake supports creating schemas based on the
<schema>-<table>
stream name pattern. Maybe that’s enough for a workaround?

Specifically, I am going to try to split one stream into many using this naming pattern for the new streams so that I can make target-snowflake land the data into the tables that we’re already using, rather than into a single giant table containing all events.

Edgar Ramírez (Arch.dev)

06/24/2024, 11:38 AM

but, if I split one stream into ten, dynamically, it sounds like I should be emitting a schema message to each of the ten new streams

that is correct

Edgar Ramírez (Arch.dev)

06/24/2024, 11:39 AM

essentially each new stream should have SCHEMA and RECORD messages. STATE doesn't matter because it's just the future input to the un-split tap, so it can be left untouched.

Matt Menzenski

06/24/2024, 12:10 PM

Thanks Edgar! FWIW I’ve opened https://github.com/meltano/sdk/issues/2502 for this (for this idea specifically)

👀 1

Matt Menzenski

06/24/2024, 12:11 PM

I’m going to continue to noodle on this. In this case, we’re using a tap built with the SDK (a tap that we wrote/we control) so I’m wondering if maybe it’s easier to push this splitting up into the tap and do dynamic stream names there 🤔

Edgar Ramírez (Arch.dev)

06/25/2024, 10:02 AM

Thanks, left a comment on the issue.

I’m wondering if maybe it’s easier to push this splitting up into the tap and do dynamic stream names there

Yeah, if it's not ever meant to be public or generally useful outside your org it might definitely make sense to do it in the tap.

Matt Menzenski

06/25/2024, 5:42 PM

Is this an appropriate use case for stream partitioning? https://sdk.meltano.com/en/latest/partitioning.html

Matt Menzenski

06/25/2024, 5:44 PM

parent-child streams almost seems like it would work, except that I don’t really want the “parent” stream, just the “children”, and I want to determine the “child” streams dynamically 🤔

43 Views

Open in Slack

Previous Next