Singer spec thought Could we add a message when a tap is fin Meltano #singer-tap-development

Singer spec thought: Could we add a message when a...

julian_knight

09/14/2021, 7:11 PM

Singer spec thought: Could we add a message when a tap is finished outputting all the records for a stream? Most targets periodically flush rows when certain size limits are reached. But for streams which are smaller than the buffer limit, they won’t be flushed until the whole pipeline is complete, as the target is still waiting to see if more records are coming. Consider what I think is a common case for taps: a dozen small streams and one really big stream. You want to set the buffer size to be relatively large to get good behavior for the large stream, but the small streams might fall completely under that threshold. If the big stream craps out, e.g. on the first load/a full refresh, all of the smaller streams are lost and will have to start over from the beginning. Sure these are small but if you’re dealing with a rate-limiting API you really want to avoid making duplicate API calls as much as possible. Such a message could be added without breaking compatibility. Targets could optionally choose to flush streams when they see the message. Thoughts?

aaronsteers

09/14/2021, 7:18 PM

I like this idea. Kind of an "end of stream" version of EOF that taps could optionally emit. This topic has come up in regards to target development on the SDK. I think this thread sums it up nicely: https://gitlab.com/meltano/sdk/-/issues/172#note_638059462

aaronsteers

09/14/2021, 7:18 PM

cc @pnadolny

aaronsteers

09/14/2021, 7:20 PM

Since the end of stream is generally followed by a state message, we might also be able to place a "hint" in the state message that a certain stream is fully drained - at least for the time being.

aaronsteers

09/14/2021, 7:21 PM

This 👆 injection within state could be done by convention, without a breaking change anywhere in the spec.

aaronsteers

09/14/2021, 7:22 PM

@julian_knight - If you want to submit an idea to our (newly forming!) Singer Spec Working Group, you can place topics into this issue tracker: https://github.com/MeltanoLabs/Singer-Spec-Working-Group/issues

pnadolny

09/14/2021, 7:35 PM

I think this is a great idea! I was also thinking through ways that we could hint that the stream is done. I had a few use cases where that would be ideal

pnadolny

09/14/2021, 7:40 PM

I'm not sure what the use case is for the

currently_syncing

key but I agree that we could just add an additional key to the state message like

completed_streams

or something that the tap can populate after it knows all records have been sent.

pnadolny

09/14/2021, 7:40 PM

https://github.com/transferwise/pipelinewise-tap-snowflake/blob/4b084bd80196b6bbd2b193c90155dee8dbc77e8f/tap_snowflake/__init__.py#L443

pnadolny

09/14/2021, 7:40 PM

https://github.com/transferwise/pipelinewise-tap-snowflake/blob/4b084bd80196b6bbd2b193c90155dee8dbc77e8f/tap_snowflake/__init__.py#L468

pnadolny

09/14/2021, 7:41 PM

^^ for example - thats the snowflake tap we use that I was thinking about adding a similar completed streams key to.

julian_knight

09/14/2021, 9:16 PM

We are also using the pipelinewise

target-snowflake

, which is what got me thinking about this

aaronsteers

09/14/2021, 9:22 PM

@julian_knight and @pnadolny - To solve the "dangling records" issue, we finally added a max-time-elapsed (or "max record age") in the SDK for targets. We started with I think 30 minutes, and I am considering bumping it up to 5 minutes. It seems like possibly having an extra flush of records every 4 or 5 minutes would have minimal impact on performance - and would make scenarios like AWS Lambda more feasible.

aaronsteers

09/14/2021, 9:25 PM

As I see it, we (1) don't want to flush too often (every 10 records, for instance, would be a nightmare) and (2) we don't want to have excessive liability in terms of age of last-emitted state. Number (2) said another way: we want to consider the impact of the process getting killed at any point and how much would have to be replayed if so.

2 Views

Open in Slack

Previous Next