We're interested in approaches to backfill large a...
# singer-tap-development
j
We're interested in approaches to backfill large amounts of data from an API that's not super performant. The situation is that if I request all historical records from this API endpoint it takes ~10 minutes to respond, so we want to grab batches of data (say 1-2 days), paginating through each batch and then grabbing the next. One idea we had was to override
request_records()
to loop through a time window in chunks, but it feels like there might be an easier pattern that we're not seeing.
cc @avishua_stein
c
This is what I do in this scenario (my case is even worse, the source API does not even give me any data if the volume is too high. and the source api does not have any way to paginate through response data sets): 1. Implement incremental sync in the tap (using replication_key property in the Meltano RESTStream class) 2. Implement start_date in the tap 3. Implement a new property called 'end_date' in the tap I then loop through however many iterations of 1-day full sync runs I need like so:
TAP_MYTAP_START_DATE="2022-01-01" TAP_MYTAP_END_DATE="2022-01-02" meltano elt tap-mytap target-jsonl --full-refresh
TAP_MYTAP_START_DATE="2022-01-02" TAP_MYTAP_END_DATE="2022-01-03" meltano elt tap-mytap target-jsonl --full-refresh
a
@joshuadevlin - Is this a situation where the API is slow but rate limits are not any issue? In other similar cases, the rate limit is the limiting function on throughput, but if this API does not have a limiting/throttling problem, then yes - a parallelized and/or async request process could help significantly.
j
I don't think rate limits are an issue, but overall performance is (ie we're mindful of overall run time)
a
That confirms what I inferred from the context you provided. The approach provided by @christoph seems applicable, with the only update being to implement
end_date
in the tap. Here's an issue where we discussed a parallel REST implementation last August - in which multiple calls would be made without waiting for the prior to return, and then as results are returned, the tap would process them and send along the retrieved records: Support for async and potentially parallel REST calls · Issue #183 · meltano/sdk (github.com) (Note, that the discussion is mostly on the GitLab side, so you'll need to click through for additional context and implications that were put into the thread.)
As noted in that thread, how well you can parallelize the calls does depend on what your pagination token is. But an approach like what @christoph suggested, where you simultaneously loop through date partitions, should work for date-based incremental keys.