Bit of an abstract question but we have a lot of custom sour Meltano #plugins-general

Bit of an abstract question, but we have a lot of ...

Ruben Vereecken

11/30/2023, 10:49 AM

Bit of an abstract question, but we have a lot of custom sources that are highly rate limited, and so we cache calls (because we don’t trust checkpointing alone enough). If we see API calls as the E, and caching as the L in ETL, then we basically have a highly coupled E/L that are hard to decouple. Except if we see caching as part of E, and do a separate L step on top. Has anyone else faced this? Is there a common approach to “cached E”? Or should we take this a different route – stop caching, and invest more in reliable checkpointing so we don’t make the same calls twice? Would love to hear some opinions

visch

11/30/2023, 3:13 PM

The E/L stuff doesn't really work well for me to understand what you're talking about. To me if you're looking at using Meltano / Singer to do this I'd focus on it this way. 1. I have a source system (not multiple let's just focus on one) that are highly rate limited, how do I handle this? 2. We don't want to make the call twice My questions for you 1. Can you define "highly rate limited" 2. Are you sure that all calls you have are for data that never changes? Can you give a few example calls? What I typically see is someone says they are very cautious about the number of API calls but they don't have a technical reason for being cautious so you should honestly just ignore it and query the API, wait for 429's / 500's and handle them with backoff. That's 90% of the cases I hear this. 10% of them there's a legit technical reason, Salesforce API is a good example they really limit your total number of API calls. Even in the salesforce case though I wouldn't recommend caching unless there's some reason like everyone wants to run the tap on their local computer and you have 10 dev's and the developer server is out of API limits or something then we could go down the rabbit hole if it's worth it. It's not that it's hard it's just that API calls are so cheap it's not worth optimizing for, and when you do need to optimize for that API it's very very API specific as to how you should "cache" calls to be sure you're as efficient as possible while also not missing data

Ruben Vereecken

11/30/2023, 3:21 PM

These are good objections, and exactly the type I’m after. Sadly, it’s like the SalesForce example for a lot of the APIs. Pay considerable amounts per 1000 calls. Some APIs will give you subsequent calls to the same resource for free within a month of the last call, but plenty don’t

Ruben Vereecken

11/30/2023, 3:24 PM

I think I might get there with a combination of good state keeping + consistent scheduling. Still, caching would make me feel a bit better. Say we make small changes to loader etc. Then we don’t have to rerun all the API calls. In our homegrown setup, the “L” is the cache, and it’s easy to check if something needs re-extracting because the E talks to the L. But I understand that in this idealistic world of EL, there’s a separation.

visch

11/30/2023, 3:29 PM

Sure if you want to front end a cache do API -> Caching Layer -> Database Singer world tap-api -> target-cache tap-cache -> target-postgres or something. That can work for sure! You can definietly cache all the API calls it sounds like you really want to do that so you should just do it https://realpython.com/caching-external-api-requests/

visch

11/30/2023, 3:29 PM

I would look hard at what a "considerable" amount is for 1000 calls. Is it $10,000 , $1, $.10 and make business choices based on that 🤷

Ruben Vereecken

11/30/2023, 3:37 PM

Hahaha love this

You can definietly cache all the API calls it sounds like you really want to do that so you should just do it

So you’re suggesting framing the cache as a target. Fair enough : )

Ruben Vereecken

11/30/2023, 3:38 PM

Ty for the link. This is a refactoring of an existing system into Meltano, so we already have the cache etc

visch

11/30/2023, 3:39 PM

instead of doing a massive refactor just write the tap-cache and you'd be off to the races, but I think you'd hit more issues doing it that way 🤷

Ruben Vereecken

11/30/2023, 3:43 PM

Just thinking out loud I guess that second workflow could also be a dbt run instead of a tap->target run 🤔

Open in Slack

Previous Next