Hi everyone, I'm using the meltano labs variant o...
# troubleshooting
x
Hi everyone, I'm using the meltano labs variant of tap-github. These are my configs.
Copy code
- name: tap-github
    variant: meltanolabs
    config:
      flattening_enabled: false
      repositories:
      - XXXXXX
      - XXXXXX
      start_date: '2020-01-01'
    select:
    - commits.*
    - events.*
    - reviews.*
    - issues.*
    - pull_request_commits.*
    - pull_requests.*
It runs fine for about 40 minutes then hits the following error.
Copy code
singer_sdk.exceptions.RetriableAPIError: 403 Client Error: b'{"message":"API rate limit exceeded for user ID 12345. If you reach out to GitHub Support for help, please include the request ID XXXXXXX and timestamp 2025-07-01 15:15:50 UTC.","documentation_url":"<https://docs.github.com/rest/overview/rate-limits-for-the-rest-api>","status":"403"}' (Reason: Forbidden) for path: /repos/XXXXXXX/pulls/12345/commits
I understand that this is due to a github API rate limit being hit. The issue is, is there a way around this while still being able to pull the full history of data? tap-postgres doesn't seem to support an
end_date
parameter and it also doesn't seem to support a
limit
. What I'm hoping for is the task to run successfully and then for the bookmark to increment. That way, even if this requires multiple runs over many hours, it is possible to get through the full history chunk by chunk. Not super clear what
rate_limit_buffer
does.
v
Have you read through the readme here https://github.com/MeltanoLabs/tap-github ?
x
Yes I've read the Readme. It isn't very clear on what the rate limit buffer does. That appears the only mechanism to limit a run. There is no parameter like end_date or max_records. I thought I would ask here before digging into the code and experimenting with that rate limit buffer parameter.
👍 2
v
Have you tried
additional_auth_tokens
?
x
I have actually used that but the rate limit is hitting at a user level. Additional tokens created by me do not help. So yes I can find some other people to create tokens and give me their tokens or create an org token that has a higher rate limit. That just felt like a very clumsy way around the issue. I was hoping for some mechanism to do the one off backfill in chunks
👍 1
v
What I'd do is try one of the streams and see if it works, then maybe a github app instead I think the limits are larger
e
@xiaozhou_wang do you happen to know if the records in any or all of these streams come sorted by the replication key? If that's the case, then we could set
is_sorted = True
in their stream classes to make interruptions safer.
x
@visch Thanks I think those are good suggestions. A github app is still not an elegant solution but at least its moving in the direction of being less hacky than multiple personal access tokens. I'll give those a go
@Edgar Ramírez (Arch.dev) I just had a look at the underlying Github APIs. I think the challenge is there isn't that much consistency across those. So I can see it's a pain from a developer point of view. https://docs.github.com/en/rest?apiVersion=2022-11-28 List Commits uses
since
and
until
which are timestamps to filter. List Pull Requests uses
sort
and page size / page number List Reviews has neither (although typically there will be fewer of these than PRs)
Likely the smallest lift way to tackle this issue is just to have a parameter that enables backoff. Should come with a warning that it might need to wait an hour at least for the API limit to reset. It will result in a job that's running continuously for hours. However, that's just wasting server time which most times is better than wasting human time
💯 1