I rebuilt one of my SDK based taps last week Nov 22 2021 and Meltano #singer-tap-development

I rebuilt one of my SDK-based taps last week (Nov ...

josh_lloyd

11/29/2021, 8:22 PM

I rebuilt one of my SDK-based taps last week (Nov 22 2021) and ran several pipelines based on the rebuilt tap. Various runs stalled indefinitely at the beginning of a new stream but before any records were synced. Last log I see when the runs get hung up is (even with

--log-level=debug

Copy code

tap-pendo        | extractor | time=2021-11-29 16:56:24 name=tap-pendo level=INFO message=Beginning incremental sync of 'guideEvents' with context: {'guideId': '<abcDEF>'}...

And then it will run for days (literally) without ever making another log. At first, I assumed this had something to do with the recent release of the SDK (this commit in particular) . But I can see other places in the logs where the error logs from the new

validate_response

method is being produced. I don’t know exactly how to reproduce this because the API I’m using doesn’t produce this behavior on a predictable basis. All I know is that the tap is built for the Pendo API. Without any useful logs I don’t know where else to look to debug. Any thoughts?

aaronsteers

11/29/2021, 8:51 PM

Probably stating the obvious, but sounds like a circular reference or infinite loop somewhere. If this is available on GitHub or similar, is there a repo link you can share by chance? cc @edgar_ramirez_mondragon

edgar_ramirez_mondragon

11/29/2021, 9:24 PM

Yup, that'd help. The default decorator by itself should not cause an infinite loop 🤔

josh_lloyd

11/29/2021, 10:15 PM

https://github.com/Widen/tap-pendo/blob/master/tap_pendo/client.py

josh_lloyd

11/29/2021, 10:17 PM

@aaronsteers @edgar_ramirez_mondragon ☝️ Strange this would start happening now. It’s been running for a month+ with no failure like this. Now every 2-4 runs is getting stuck like this.

josh_lloyd

11/29/2021, 10:21 PM

Now that I’m thinking about it, this only seems to be getting hung up on child streams (i.e., featureEvents, pageEvents, guideEvents, TrackEvents, pollEvents in this tap)

aaronsteers

11/29/2021, 10:29 PM

I'm not sure if this is the issue or not, but I believe this private method may no longer be the correct override point.

aaronsteers

11/29/2021, 10:32 PM

With the latest error handling in place, the private method is now called

_request

and the decorator is added by overriding `request_decorator`: singer_sdk/streams/rest.py · main · Meltano / Meltano SDK for Singer Taps and Targets · GitLab

aaronsteers

11/29/2021, 10:33 PM

Whenever overriding private methods, there's an increased risk of them breaking since those are not part of the formal API. Admittedly, you had the correct solution given what we had previously made available. Hopefully with this latest update, you can get the same functionality without having to override private members. 🤞

aaronsteers

11/29/2021, 10:35 PM

And to be clear, even if the above "fixes" your problem (and I'm not yet 100% sure it will), there's probably something on our side we can+should do to prevent the silent failure or loop that's happening here.

josh_lloyd

11/29/2021, 10:57 PM

yeah, I saw that code change you guys made and the change you made actually worked in at least one run that I can identify. For example, I could see an error message in the logs with

Copy code

tap-pendo        | extractor | time=2021-11-29 21:18:49 name=backoff level=INFO message=Backing off _request(...) for 0.4s (singer_sdk.exceptions.RetriableAPIError: 502 Server Error: Bad Gateway for path: /aggregation)

My previous override was working just like this. Since that private method no longer exists I doubt this is causing the hang up, but just in case, I’ll remove it and try again. If anything, the only significant difference I can see between your new backoff code and old override is that I’ve added the

giveup

param:

Copy code

giveup=lambda e: e.response is not None and 400 <= e.response.status_code < 500,

but I’m not confident this explains the behavior.

edgar_ramirez_mondragon

11/29/2021, 11:09 PM

@aaronsteers

to prevent the silent failure or loop that's happening here.

quick thing would be to add some logging there

josh_lloyd

12/03/2021, 6:12 PM

@aaronsteers @edgar_ramirez_mondragon Just to bring this discussion full circle: I played around a bit more to see if I find a fix. I suspected that the tap was hanging on a dangling request. The

requests

package does not implement a default timeout, so I overrode the

_request

function to include a timeout and added the timeout exception to the backoff decorator. Since doing that, I have not seen any more hangs in this location in the logs. It also sped up the average time to pipeline completion 🙂 . Having done this, I wonder if it makes sense to add a default (but easily configurable) timeout within the SDK. Something high like 5 or 10 minutes of course, but at least in doing so developers would eventually get some log of what happened and then they can adjust the timeout limit to their liking. Just a thought.

aaronsteers

12/03/2021, 6:16 PM

Very cool. Thanks for this update! And yes, I really like the idea of adding a default timeout if we don't have one already in the latest implementation. Do you want to log an issue for this? And if you had time to draft an MR also, this sounds like something we could get merged pretty quickly. @edgar_ramirez_mondragon - Wdyt?

edgar_ramirez_mondragon

12/03/2021, 6:38 PM

I wonder if it makes sense to add a default (but easily configurable) timeout within the SDK. Something high like 5 or 10 minutes of course

That sounds like a good idea

Open in Slack

Previous Next