What do you guys do in cases where streams take forever to p Meltano #singer-tap-development

What do you guys do in cases where streams take fo...

Ian OLeary

05/22/2024, 12:42 PM

What do you guys do in cases where streams take forever to process? I have a few streams in my custom tap for a REST API that are unexpectedly slow (less data than my other streams, but take ~5x as long). I use date range pagination, and have tried cutting down the pagination buckets into more digestible pieces for my tap, but this particular stream seems to take forever even though it's only loading like a total of 100k rows of data over a 2 year timeframe. Also, each response takes about 5 minutes to get. Is this an issue I should bring up to the engineers of source system? Or is it likely an issue on my end? Also one of the streams just gets completely stuck at 5-16 (replication key gets set to that over and over). My other streams with basically the same logic (more data and 2-week date range pagination) run in probably 1/5th the time. How do you guys go about troubleshooting these things?

visch

05/22/2024, 12:49 PM

I can't find the write-up quickly, but Aaronsteers did a great write-up of a "best practice" for how to split streams. I think the idea is those "pesky" streams you should split into separate taps so that your main pipeline stays working (the 90% that are good to go) That is a good start as it allows everything else to keep running. How to speed up long running streams is a seperate question and it's all dependent on the source you're pulling data from. The ideas you have are good and general, but I tend to go towards reading the docs on that source API to understand if there's some other way to pull or batch data from that system. Sometimes there's bulk options that 100x things. If there is just no way else to pull data then binning is the right approach which is what you're doing, and you can add more logic to the orchestration side so that you're not syncing it as often and "babying" that stream

Ian OLeary

05/22/2024, 12:56 PM

I solved this for one of my streams where I just overrode the pagination method to chunk it into 4-day buckets and that worked fine (made sense because the row total for that table was in the millions and not the 100-500k like my other streams), but I just ran this one for an hour with the same logic and returned a total of like 15k records which seems kind of absurd for how long it took. That's why I'm led to believe this is a distinct issue with this particular API endpoint. Are you sure it isn't something I could just bug the source's engineering team about? lol. They do not have great API documentation (any). Right now I'm just not deploying the pesky pipelines since there's no downstream report being demanded - but there will be eventually so I just want to get ahead of it

Ian OLeary

05/22/2024, 12:57 PM

I'll take a look at Aaron's write up

Open in Slack

Previous Next