Singer spec thought: Could we add a message when a...
# singer-tap-development
j
Singer spec thought: Could we add a message when a tap is finished outputting all the records for a stream? Most targets periodically flush rows when certain size limits are reached. But for streams which are smaller than the buffer limit, they won’t be flushed until the whole pipeline is complete, as the target is still waiting to see if more records are coming. Consider what I think is a common case for taps: a dozen small streams and one really big stream. You want to set the buffer size to be relatively large to get good behavior for the large stream, but the small streams might fall completely under that threshold. If the big stream craps out, e.g. on the first load/a full refresh, all of the smaller streams are lost and will have to start over from the beginning. Sure these are small but if you’re dealing with a rate-limiting API you really want to avoid making duplicate API calls as much as possible. Such a message could be added without breaking compatibility. Targets could optionally choose to flush streams when they see the message. Thoughts?
a
I like this idea. Kind of an "end of stream" version of EOF that taps could optionally emit. This topic has come up in regards to target development on the SDK. I think this thread sums it up nicely: https://gitlab.com/meltano/sdk/-/issues/172#note_638059462
cc @pnadolny
Since the end of stream is generally followed by a state message, we might also be able to place a "hint" in the state message that a certain stream is fully drained - at least for the time being.
This 👆 injection within state could be done by convention, without a breaking change anywhere in the spec.
@julian_knight - If you want to submit an idea to our (newly forming!) Singer Spec Working Group, you can place topics into this issue tracker: https://github.com/MeltanoLabs/Singer-Spec-Working-Group/issues
p
I think this is a great idea! I was also thinking through ways that we could hint that the stream is done. I had a few use cases where that would be ideal
I'm not sure what the use case is for the
currently_syncing
key but I agree that we could just add an additional key to the state message like
completed_streams
or something that the tap can populate after it knows all records have been sent.
^^ for example - thats the snowflake tap we use that I was thinking about adding a similar completed streams key to.
j
We are also using the pipelinewise
target-snowflake
, which is what got me thinking about this
a
@julian_knight and @pnadolny - To solve the "dangling records" issue, we finally added a max-time-elapsed (or "max record age") in the SDK for targets. We started with I think 30 minutes, and I am considering bumping it up to 5 minutes. It seems like possibly having an extra flush of records every 4 or 5 minutes would have minimal impact on performance - and would make scenarios like AWS Lambda more feasible.
As I see it, we (1) don't want to flush too often (every 10 records, for instance, would be a nightmare) and (2) we don't want to have excessive liability in terms of age of last-emitted state. Number (2) said another way: we want to consider the impact of the process getting killed at any point and how much would have to be replayed if so.