The question I was going to ask in demo day: I’m w...
# singer-tap-development
j
The question I was going to ask in demo day: I’m working with an API that has time-series data, but doesn’t support ordering (order is hard-coded to most-recent-first aka time descending). I could just use
is_sorted = False
and accept that the first run will need to sync all of the data before any state can be saved. However the API supports time-range filtering, so it’s hypothetically possible (and what I would do with a non-SDK tap) to batch the API into some time window (days, weeks, whatever), run in batches from oldest to newest, and emit state at the end of each batch, which would allow picking back up if the first run is interrupted. This seems like a potential use-case for partitions, generating a partition for each time window. A few questions about this though: • Is partitions really the recommended way to solve this, or is there some other recommended solution? • Will using partitions save a new state value for each time window? That seems unnecessary, as every time window partition should only be run once; the only state we need is the last partition that was completed • Is there a way to use both partitioning and child streams? This API call happens to also be a child stream, and when I specified my own partitions it broke the context from the parent stream. Is there a way to merge these contexts together? Is this a bug? If so I’d be happy to open an issue
a
I'll address the points in individual comments here below... 🙂
Is there a way to use both partitioning and child streams?
This was raised previously in another thread and I don't think we have a proven/recommended solution yet. Happy to discuss further in an issue.
Will using partitions save a new state value for each time window?
Yes, currently, the partition context (or parent context) makes up the default
state_partitioning_keys
- but this can be overridden. In theory, you could define
partitions()
property to give multiple time-based partitions, and then save only an aggregated state. Exactly how to do this though may take some exploration and trial and error.
Is partitions really the recommended way to solve this, or is there some other recommended solution?
The only recommended approach today would be to set
is_sorted = False
. To be clear though, you could still implement incremental stream bookmarks with this method - the only difference being that interruptions are not resumable, and therefor state progress markers are treated pessimistically (for resume purposes) until "finalize_...()" method is called on the state at the end of the stream.
Is partitions really the recommended way to solve this, or is there some other recommended solution?
Second answer to the same question... 🙂 I wonder if a complex "next_page_token" dict could enumerate the date range partitions that you mention and then also make its own call to invoke the "finalize...()" operation on the state bookmark at the end of each top-level date partition.
I didn't remember the name of the method, but I looked it up: finalize_state_progress_markers()
Looks like that docstring may be a bit confusing or misleading, but in theory, you could explicitly call something like that to finalize the state up to the current point.
This code is not very easy to read, but here's another implementation that actually creates a nested loop of dates and pagination tokens. In that case, the API does not let you retrieve more than 24 hours per url call, so we had to have an outer loop of dates with an inner loop of continuation tokens. The pagination token is then just a custom dictionary which can hold arbitrary values. Returning None instead of a dict ends the loop as usual.
j
Thanks @aaronsteers! I’ll try this out and maybe open an issue about child streams + partitions
Thanks to @visch for logging this and @edgar_ramirez_mondragon for finding some possible code paths which might be adapted to do a cross-product between
partitions
and parent
context
when both exist.
> Is there a way to use both partitioning and child streams?
This was raised previously in another thread and I don't think we have a proven/recommended solution yet. Happy to discuss further in an issue.
After I wrote this yesterday, the same exact topic came up with @visch, which led to that issue getting created. 🙂