Bit of an abstract question, but we have a lot of ...
# plugins-general
r
Bit of an abstract question, but we have a lot of custom sources that are highly rate limited, and so we cache calls (because we don’t trust checkpointing alone enough). If we see API calls as the E, and caching as the L in ETL, then we basically have a highly coupled E/L that are hard to decouple. Except if we see caching as part of E, and do a separate L step on top. Has anyone else faced this? Is there a common approach to “cached E”? Or should we take this a different route – stop caching, and invest more in reliable checkpointing so we don’t make the same calls twice? Would love to hear some opinions
v
The E/L stuff doesn't really work well for me to understand what you're talking about. To me if you're looking at using Meltano / Singer to do this I'd focus on it this way. 1. I have a source system (not multiple let's just focus on one) that are highly rate limited, how do I handle this? 2. We don't want to make the call twice My questions for you 1. Can you define "highly rate limited" 2. Are you sure that all calls you have are for data that never changes? Can you give a few example calls? What I typically see is someone says they are very cautious about the number of API calls but they don't have a technical reason for being cautious so you should honestly just ignore it and query the API, wait for 429's / 500's and handle them with backoff. That's 90% of the cases I hear this. 10% of them there's a legit technical reason, Salesforce API is a good example they really limit your total number of API calls. Even in the salesforce case though I wouldn't recommend caching unless there's some reason like everyone wants to run the tap on their local computer and you have 10 dev's and the developer server is out of API limits or something then we could go down the rabbit hole if it's worth it. It's not that it's hard it's just that API calls are so cheap it's not worth optimizing for, and when you do need to optimize for that API it's very very API specific as to how you should "cache" calls to be sure you're as efficient as possible while also not missing data
r
These are good objections, and exactly the type I’m after. Sadly, it’s like the SalesForce example for a lot of the APIs. Pay considerable amounts per 1000 calls. Some APIs will give you subsequent calls to the same resource for free within a month of the last call, but plenty don’t
I think I might get there with a combination of good state keeping + consistent scheduling. Still, caching would make me feel a bit better. Say we make small changes to loader etc. Then we don’t have to rerun all the API calls. In our homegrown setup, the “L” is the cache, and it’s easy to check if something needs re-extracting because the E talks to the L. But I understand that in this idealistic world of EL, there’s a separation.
v
Sure if you want to front end a cache do API -> Caching Layer -> Database Singer world tap-api -> target-cache tap-cache -> target-postgres or something. That can work for sure! You can definietly cache all the API calls it sounds like you really want to do that so you should just do it https://realpython.com/caching-external-api-requests/
I would look hard at what a "considerable" amount is for 1000 calls. Is it $10,000 , $1, $.10 and make business choices based on that 🤷
r
Hahaha love this
You can definietly cache all the API calls it sounds like you really want to do that so you should just do it
So you’re suggesting framing the cache as a target. Fair enough : )
Ty for the link. This is a refactoring of an existing system into Meltano, so we already have the cache etc
v
instead of doing a massive refactor just write the tap-cache and you'd be off to the races, but I think you'd hit more issues doing it that way 🤷
r
Just thinking out loud I guess that second workflow could also be a dbt run instead of a tap->target run 🤔