julian_knight
11/17/2021, 7:45 PMis_sorted = False
and accept that the first run will need to sync all of the data before any state can be saved.
However the API supports time-range filtering, so it’s hypothetically possible (and what I would do with a non-SDK tap) to batch the API into some time window (days, weeks, whatever), run in batches from oldest to newest, and emit state at the end of each batch, which would allow picking back up if the first run is interrupted.
This seems like a potential use-case for partitions, generating a partition for each time window. A few questions about this though:
• Is partitions really the recommended way to solve this, or is there some other recommended solution?
• Will using partitions save a new state value for each time window? That seems unnecessary, as every time window partition should only be run once; the only state we need is the last partition that was completed
• Is there a way to use both partitioning and child streams? This API call happens to also be a child stream, and when I specified my own partitions it broke the context from the parent stream. Is there a way to merge these contexts together? Is this a bug? If so I’d be happy to open an issueaaronsteers
11/17/2021, 9:11 PMaaronsteers
11/17/2021, 9:11 PMIs there a way to use both partitioning and child streams?This was raised previously in another thread and I don't think we have a proven/recommended solution yet. Happy to discuss further in an issue.
aaronsteers
11/17/2021, 9:16 PMWill using partitions save a new state value for each time window?Yes, currently, the partition context (or parent context) makes up the default
state_partitioning_keys
- but this can be overridden. In theory, you could define partitions()
property to give multiple time-based partitions, and then save only an aggregated state. Exactly how to do this though may take some exploration and trial and error.aaronsteers
11/17/2021, 9:17 PMIs partitions really the recommended way to solve this, or is there some other recommended solution?The only recommended approach today would be to set
is_sorted = False
. To be clear though, you could still implement incremental stream bookmarks with this method - the only difference being that interruptions are not resumable, and therefor state progress markers are treated pessimistically (for resume purposes) until "finalize_...()" method is called on the state at the end of the stream.aaronsteers
11/17/2021, 9:20 PMIs partitions really the recommended way to solve this, or is there some other recommended solution?Second answer to the same question... 🙂 I wonder if a complex "next_page_token" dict could enumerate the date range partitions that you mention and then also make its own call to invoke the "finalize...()" operation on the state bookmark at the end of each top-level date partition.
aaronsteers
11/17/2021, 9:21 PMaaronsteers
11/17/2021, 9:22 PMaaronsteers
11/17/2021, 9:26 PMjulian_knight
11/19/2021, 5:28 PMaaronsteers
11/19/2021, 5:31 PMaaronsteers
11/19/2021, 5:32 PMpartitions
and parent context
when both exist.aaronsteers
11/19/2021, 5:33 PM> Is there a way to use both partitioning and child streams?
This was raised previously in another thread and I don't think we have a proven/recommended solution yet. Happy to discuss further in an issue.After I wrote this yesterday, the same exact topic came up with @visch, which led to that issue getting created. 🙂