The question I was going to ask in demo day I m working with Meltano #singer-tap-development

The question I was going to ask in demo day: I’m w...

julian_knight

11/17/2021, 7:45 PM

The question I was going to ask in demo day: I’m working with an API that has time-series data, but doesn’t support ordering (order is hard-coded to most-recent-first aka time descending). I could just use

is_sorted = False

and accept that the first run will need to sync all of the data before any state can be saved. However the API supports time-range filtering, so it’s hypothetically possible (and what I would do with a non-SDK tap) to batch the API into some time window (days, weeks, whatever), run in batches from oldest to newest, and emit state at the end of each batch, which would allow picking back up if the first run is interrupted. This seems like a potential use-case for partitions, generating a partition for each time window. A few questions about this though: • Is partitions really the recommended way to solve this, or is there some other recommended solution? • Will using partitions save a new state value for each time window? That seems unnecessary, as every time window partition should only be run once; the only state we need is the last partition that was completed • Is there a way to use both partitioning and child streams? This API call happens to also be a child stream, and when I specified my own partitions it broke the context from the parent stream. Is there a way to merge these contexts together? Is this a bug? If so I’d be happy to open an issue

aaronsteers

11/17/2021, 9:11 PM

I'll address the points in individual comments here below... 🙂

aaronsteers

11/17/2021, 9:11 PM

Is there a way to use both partitioning and child streams?

This was raised previously in another thread and I don't think we have a proven/recommended solution yet. Happy to discuss further in an issue.

aaronsteers

11/17/2021, 9:16 PM

Will using partitions save a new state value for each time window?

Yes, currently, the partition context (or parent context) makes up the default

state_partitioning_keys

- but this can be overridden. In theory, you could define

partitions()

property to give multiple time-based partitions, and then save only an aggregated state. Exactly how to do this though may take some exploration and trial and error.

aaronsteers

11/17/2021, 9:17 PM

Is partitions really the recommended way to solve this, or is there some other recommended solution?

The only recommended approach today would be to set

is_sorted = False

. To be clear though, you could still implement incremental stream bookmarks with this method - the only difference being that interruptions are not resumable, and therefor state progress markers are treated pessimistically (for resume purposes) until "finalize_...()" method is called on the state at the end of the stream.

aaronsteers

11/17/2021, 9:20 PM

Is partitions really the recommended way to solve this, or is there some other recommended solution?

Second answer to the same question... 🙂 I wonder if a complex "next_page_token" dict could enumerate the date range partitions that you mention and then also make its own call to invoke the "finalize...()" operation on the state bookmark at the end of each top-level date partition.

aaronsteers

11/17/2021, 9:21 PM

I didn't remember the name of the method, but I looked it up: finalize_state_progress_markers()

aaronsteers

11/17/2021, 9:22 PM

Looks like that docstring may be a bit confusing or misleading, but in theory, you could explicitly call something like that to finalize the state up to the current point.

aaronsteers

11/17/2021, 9:26 PM

This code is not very easy to read, but here's another implementation that actually creates a nested loop of dates and pagination tokens. In that case, the API does not let you retrieve more than 24 hours per url call, so we had to have an outer loop of dates with an inner loop of continuation tokens. The pagination token is then just a custom dictionary which can hold arbitrary values. Returning None instead of a dict ends the loop as usual.

julian_knight

11/19/2021, 5:28 PM

Thanks @aaronsteers! I’ll try this out and maybe open an issue about child streams + partitions

aaronsteers

11/19/2021, 5:31 PM

Merging parent context with partitions (#273) · Issues · Meltano / Meltano SDK for Singer Taps and Targets · GitLab

aaronsteers

11/19/2021, 5:32 PM

Thanks to @visch for logging this and @edgar_ramirez_mondragon for finding some possible code paths which might be adapted to do a cross-product between

partitions

and parent

context

when both exist.

aaronsteers

11/19/2021, 5:33 PM

> Is there a way to use both partitioning and child streams?

This was raised previously in another thread and I don't think we have a proven/recommended solution yet. Happy to discuss further in an issue.

After I wrote this yesterday, the same exact topic came up with @visch, which led to that issue getting created. 🙂

5 Views

Open in Slack

Previous Next