Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

429 Too Many Requests #367

Closed
dcroote opened this issue Oct 22, 2021 · 14 comments
Closed

429 Too Many Requests #367

dcroote opened this issue Oct 22, 2021 · 14 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@dcroote
Copy link
Contributor

dcroote commented Oct 22, 2021

We've seen HTTP status client error (429 Too Many Requests) for url for github.com urls intermittently even when using the GITHUB_TOKEN env and reducing concurrency to 10 threads. A few examples of Action runs where this occurred: 1, 2, 3, 4 and here is the workflow itself. Note the workflow uses a cached installation of lychee installed via cargo from a specific commit sha rather than the v0.7.1 marketplace Action because of additional functionality that's been added since that release.

I think the ideal solution would be as described in #348 to avoid hitting the same url repeatedly from different files (which is indeed the case with the urls causing failures in the Actions runs above), but I can see this is complex (#349).

Another theoretical idea could be running lychee twice, once with high concurrency using --exclude github.com and again with minimal concurrency and --include github.com.

A third might be identifying github.com urls and applying different internal handling of how they're requested?

Would love to get your thoughts and also confirm it isn't a usage issue.

Overall though great tool, it's helpful to check both internal and external urls + images / img tags quickly

@MichaIng
Copy link
Member

I personally simply -a 429. The error is specific enough to apply that in general and on every run different GitHub URLs are affected so that all of them are checked sufficiently regularly 😄.

URL/requests caching across files indeed may help, but it does not rule out this issue completely, as long as there is a rate limit on GitHub even with token.

@mre
Copy link
Member

mre commented Oct 26, 2021

The only proper solution is to understand and respect rate-limiting mechanisms for different websites.
Github sends rate-limiting headers with every response:

$ curl -I https://api.github.com/users/octocat/orgs

> HTTP/2 200
> Server: nginx
> Date: Fri, 12 Oct 2012 23:33:14 GMT
> Content-Type: application/json; charset=utf-8
> ETag: "a00049ba79152d03380c34652f2cb612"
> X-GitHub-Media-Type: github.v3
> X-RateLimit-Limit: 5000
> X-RateLimit-Remaining: 4987
> X-RateLimit-Reset: 1350085394
> Content-Length: 5
> Cache-Control: max-age=0, private, must-revalidate
> X-Content-Type-Options: nosniff

This is not a standard however. Other services might use different headers and formats. More info here (⚠️ Medium link).

A simpler solution is to use a per-website rate-limiting factor. Something like --rate-limit github.com,1, where the first part is a domain (a regular expression really) and the second part is the max number of requests per second. It could be below 1 such as 0.1 for max 1 request every 10 seconds.
This does not fully solve the problem of course, but it would be a start. There might be a global rate limiting factor as well that would be set to infinite in the beginning and be overwritten. That would probably fix 80% of the problems at the cost of slowing down execution with many links pointing to a website that could handle higher concurrency.

If we used tower for the request handling, this would be an optional feature: https://docs.rs/tower/0.4.10/tower/limit/rate/struct.RateLimit.html, which is really cool. However, that is still a long way off and I'm not even sure if it provides everything else we'd need. Given that we need some way to rate limit requests per domain at some point anyway I'd say we could start with that.

@mre
Copy link
Member

mre commented Oct 26, 2021

This is some code on how I envision rate limiting to look like in the future: https://github.com/james7132/Hourai/blob/4a4b343f237f726ae1d2e2943e0bed58406f2cf6/logger/hourai-feeds/src/reddit/rate_limiter.rs
If anyone wants to give a shot, post a comment here for help.

@mre
Copy link
Member

mre commented Oct 26, 2021

See also seanmonstar/reqwest#491

@JAremko
Copy link

JAremko commented Nov 3, 2021

I test files in batches with separate lychee-actions. They run ~2 hours apart from each other. It would be cool if there was an arg that can set up the root directory. -basedoesn't work in the action for some reason. I ended up encoding it in the file path doc/**/*.html

I think gists should also be tested with tokens. See lycheeverse/lychee-action#54

On the other hand there might be a way to fix it for good by splitting link checking process into 2 actions.
First one collects all the unique URLs (normalizing, removing # etc) and stores them somehow. Maybe simply output them in a parseble form as a part of CI log. Or maybe in issue.
The second action goes through this list and tests the links, if it hits recoverable error it generates another list and the cycle continues 😁 It would be much less "hacky" if there was a way to schedule an action from an action 🤔

@JAremko
Copy link

JAremko commented Nov 3, 2021

Actually, there is https://github.com/marketplace/actions/schedule-job-action thingy. But it seems to require personal token. Wouldn't work for my use case.

@JAremko
Copy link

JAremko commented Nov 3, 2021

Other solution would be an introduction of "URL health". "Healthy" URL is an URL that has been successfully tested in last 24 hours and it doesn't need to be tested again for some time. 🤒 URL is an URL that has a recoverable error or timed out less than X times in a row. And 💀 URL is everything else.

@MichaIng
Copy link
Member

MichaIng commented Nov 3, 2021

--base doesn't work in the action for some reason.

Note that this option does not define the root directory for input files but the root path or URL for the contained relative/internal links, like <a href="/subdir/page.html">. For the inputs you hence still need to define file names, a regex or an URL you want to check links on.

I think gists should also be tested with tokens.

Makes sense, probably all subdomains of github.com should use the token as well.

@JAremko
Copy link

JAremko commented Nov 3, 2021

Actually, if link checker would have some kind of state shared between runs, I like the idea of storing links and their statuses (health) in an issue for further reinspection. Maybe even using such issue as a config file of sorts.
The action can update that config issue with the list of ignored URLs, add the dates when an URL has been tested last time, what URLs are in "temporary outage" list and what links will it test at the next run. It would be much easier to see whats going on 🤔

@MichaIng Should'v RTFM 🤕 🌴

@mre
Copy link
Member

mre commented Nov 4, 2021

Other solution would be an introduction of "URL health". "Healthy" URL is an URL that has been successfully tested in last 24 hours and it doesn't need to be tested again for some time. 🤒

Yeah that is planned. #163

@rigtorp
Copy link
Contributor

rigtorp commented Nov 23, 2021

Honoring rate limit headers would be a good idea and not too difficult, here's how it's done in pytest-check-links:
https://github.com/jupyterlab/pytest-check-links/blob/master/pytest_check_links/plugin.py#L301

        if response.status_code >= 400:
            if retries and self.sleep(response.headers):
                self.uncache_url(url_no_anchor)
                return self.fetch_with_retries(url, retries=retries - 1)

and self.sleep tries to sleep if the Retry-After header is present (https://github.com/jupyterlab/pytest-check-links/blob/master/pytest_check_links/plugin.py#L243).

Here's the spec for the Retry-After header https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After

@dcroote
Copy link
Contributor Author

dcroote commented Nov 27, 2021

@rigtorp - thanks for the resources. This would be helpful for sites that include a Retry-After header, but unfortunately GitHub has its own set of headers. See the above comment by @mre.

@rigtorp
Copy link
Contributor

rigtorp commented Dec 1, 2021

@dcroote They do support it, it's a standard header implemented by all the reverse proxies and caches. So it's the more generic solution. The rate limit headers are typically only implemented for APIs, where a API consumer can use them to throttle work items pulled from a queue for example.

First to generate rate limit:

while true; do curl --head "https://github.com/rigtorp/MPMCQueue"; done

After some time:

HTTP/2 429 
date: Wed, 01 Dec 2021 21:51:55 GMT
server: Varnish
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-content-type-options: nosniff
x-frame-options: deny
x-xss-protection: 1; mode=block
content-security-policy: default-src 'none'; style-src 'unsafe-inline'
content-type: text/html; charset=utf-8
retry-after: 60
content-length: 1474
x-github-request-id: B2AC:C7D8:6E231:76832:61A7EE7B

Here github is asking to wait 60 seconds.

@mre
Copy link
Member

mre commented Mar 28, 2022

Current status for people coming in just now:

  • Response caching was implemented in Response caching? #163. Enable it with --cache.
  • If you're getting 429s on Github, check that you have a GITHUB_TOKEN set. This can significantly improve the situation by allowing way more requests.
  • You can exclude domains with --exclude pattern, e.g. --exclude github.com.
  • You can reduce the number of concurrent requests to avoid running into the rate-limiter: --max-concurrency
  • You can run lychee multiple times on a subset of the requests as @JAremko described above.
  • I wrote a crate for parsing rate limit headers: https://github.com/mre/rate-limits. Planning to use it in combination with Look into tower as a replacement for deadpool + channels #36 to respect the server settings.

With that, I guess we can close this and focus on the open issues for the missing functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants