I'm creating a tap that connects to an XML API. Th...
# singer-tap-development
j
I'm creating a tap that connects to an XML API. This API is rate limited and can accept requests of up-to 1000 records, 150 requests per minute, 10,000 requests per day. The stream I'm using to process this data currently inherits from the
Stream
base class. Any recommendations on how to handle this? Is the
RESTStream
the place to look?
v
this
is the XML data? or rate limiting?
j
Rate limiting. My initial thoughts are to add pagination to handle making 1000 requests at a time, and incrementing if there are more requests to make. To handle the requests per minute/per day I'm thinking I can use the
metrics.http_request_counter
to track how many requests we're making. I think there's a decorator I can make to sleep the method making the requests if we're rate limited. Does this sound accurate to you? Am I missing anything?
v
APIs are always interesting is the short answer. I personally would wait to implement rate limiting until I get an error in the tap. Once I get an error in the tap then I'd look at what the api gives me back (the headers, body, etc) then I'd check the API's docs to be sure what I"m seeing aligns with what the docs say. Then I'd decide on the action. Doing it first I've found leads you to overengineer
But once you hit it, there's a bunch of backoff_* functions in the Rest class. See https://sdk.meltano.com/en/latest/code_samples.html#custom-backoff
e
APIs are always interesting is the short answer.
They are indeed 😅. I’ve seen a few different implementations of rate limiting. The best imo return headers in the response to indicate how many requests you got left and, if you’ve exceeded the limit, how long until you are good to go again. The helper methods @visch linked to can be used in that case.
l
in https://github.com/MeltanoLabs/tap-github/ we've handled this rate limiting (5000 requests/hour/api token) pretty much like Derek describes it: we keep sending requests, and when we get an error, we look into it to see if it's rate limiting (there's a specific error message we can look for). If so, we switch to a different API token until all of them have run out of credits. But you could just figure out how long to wait for and sleep. Agree with Derek's opinion on holding back with rate limiting. It's full of traps and until the taps are running full async, you can probably live without it. In our experience, tap-github does on average 1 req/sec, so if you get anything similar, you'd only need to really worry about the daily limit (even though they can be handled the same way).
j
Thanks for the input everyone. I'm porting over to meltano to make the experience of adding a new data source more straight forward for my team. This particular data source has the XML API interface described above. Some additional details: The API is for table data that we're syncing each hour (for now). 1 request yields up-to 1000 records for one table, and there are over 140 tables. Since we're limited to 1000 records per request, we can actually hit the rate limit pretty easily, unfortunately. We don't currently pay for multiple api tokens right now, either, but that's something I will talk to the owner of the data source about. Right now we are sending an XML API request that contains auth, the table name to query for, the dates to query between, and batch limit. Since we're limited to 1000 records per request, the batch limit goes up by increments of 1000 (Ex: 0, 1000 | 1000, 1000 | 2000, 1000, etc). If we get a full 1000 records, it's possible that there are more records to get, so we should send another request. If there are less than 1000 records then there are no more records to query. If we get a particular error code we would know there are no more records to query. This unfortunate API does allows us to check how many requests we have left, but this check does count against our limit. With that added context being out there, I will take the advice above, but will likely limit the scope of rate limiting to enforcing the daily constraint for now. If you all have any recommendations with how to move forward with this batch strategy, I'd love to hear more.