I rebuilt one of my SDK-based taps last week (Nov ...
# singer-tap-development
j
I rebuilt one of my SDK-based taps last week (Nov 22 2021) and ran several pipelines based on the rebuilt tap. Various runs stalled indefinitely at the beginning of a new stream but before any records were synced. Last log I see when the runs get hung up is (even with
--log-level=debug
):
Copy code
tap-pendo        | extractor | time=2021-11-29 16:56:24 name=tap-pendo level=INFO message=Beginning incremental sync of 'guideEvents' with context: {'guideId': '<abcDEF>'}...
And then it will run for days (literally) without ever making another log. At first, I assumed this had something to do with the recent release of the SDK (this commit in particular) . But I can see other places in the logs where the error logs from the new
validate_response
method is being produced. I don鈥檛 know exactly how to reproduce this because the API I鈥檓 using doesn鈥檛 produce this behavior on a predictable basis. All I know is that the tap is built for the Pendo API. Without any useful logs I don鈥檛 know where else to look to debug. Any thoughts?
a
Probably stating the obvious, but sounds like a circular reference or infinite loop somewhere. If this is available on GitHub or similar, is there a repo link you can share by chance? cc @edgar_ramirez_mondragon
e
Yup, that'd help. The default decorator by itself should not cause an infinite loop 馃
@aaronsteers @edgar_ramirez_mondragon 鈽濓笍 Strange this would start happening now. It鈥檚 been running for a month+ with no failure like this. Now every 2-4 runs is getting stuck like this.
Now that I鈥檓 thinking about it, this only seems to be getting hung up on child streams (i.e., featureEvents, pageEvents, guideEvents, TrackEvents, pollEvents in this tap)
a
I'm not sure if this is the issue or not, but I believe this private method may no longer be the correct override point.
With the latest error handling in place, the private method is now called
_request
and the decorator is added by overriding `request_decorator`: singer_sdk/streams/rest.py 路 main 路 Meltano / Meltano SDK for Singer Taps and Targets 路 GitLab
Whenever overriding private methods, there's an increased risk of them breaking since those are not part of the formal API. Admittedly, you had the correct solution given what we had previously made available. Hopefully with this latest update, you can get the same functionality without having to override private members. 馃
And to be clear, even if the above "fixes" your problem (and I'm not yet 100% sure it will), there's probably something on our side we can+should do to prevent the silent failure or loop that's happening here.
j
yeah, I saw that code change you guys made and the change you made actually worked in at least one run that I can identify. For example, I could see an error message in the logs with
Copy code
tap-pendo        | extractor | time=2021-11-29 21:18:49 name=backoff level=INFO message=Backing off _request(...) for 0.4s (singer_sdk.exceptions.RetriableAPIError: 502 Server Error: Bad Gateway for path: /aggregation)
My previous override was working just like this. Since that private method no longer exists I doubt this is causing the hang up, but just in case, I鈥檒l remove it and try again. If anything, the only significant difference I can see between your new backoff code and old override is that I鈥檝e added the
giveup
param:
Copy code
giveup=lambda e: e.response is not None and 400 <= e.response.status_code < 500,
but I鈥檓 not confident this explains the behavior.
e
@aaronsteers
to prevent the silent failure or loop that's happening here.
quick thing would be to add some logging there
j
@aaronsteers @edgar_ramirez_mondragon Just to bring this discussion full circle: I played around a bit more to see if I find a fix. I suspected that the tap was hanging on a dangling request. The
requests
package does not implement a default timeout, so I overrode the
_request
function to include a timeout and added the timeout exception to the backoff decorator. Since doing that, I have not seen any more hangs in this location in the logs. It also sped up the average time to pipeline completion 馃檪 . Having done this, I wonder if it makes sense to add a default (but easily configurable) timeout within the SDK. Something high like 5 or 10 minutes of course, but at least in doing so developers would eventually get some log of what happened and then they can adjust the timeout limit to their liking. Just a thought.
a
Very cool. Thanks for this update! And yes, I really like the idea of adding a default timeout if we don't have one already in the latest implementation. Do you want to log an issue for this? And if you had time to draft an MR also, this sounds like something we could get merged pretty quickly. @edgar_ramirez_mondragon - Wdyt?
e
I wonder if it makes sense to add a default (but easily configurable) timeout within the SDK. Something high like 5 or 10 minutes of course
That sounds like a good idea