Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gracefully handle HTTP 429 return codes (Too Many Requests) #618

Closed
miabbott opened this issue Apr 24, 2019 · 23 comments
Closed

gracefully handle HTTP 429 return codes (Too Many Requests) #618

miabbott opened this issue Apr 24, 2019 · 23 comments
Assignees

Comments

@miabbott
Copy link

Downstream BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1702519

In the above BZ, the RHCOS nodes are attempting to pull a container from quay.io via skopeo and the registry is returning HTTP 429 (Too Many Requests). skopeo reacts with:

msg="Error determining repository tags: Invalid status code returned when fetching tags list 429 (Too Many Requests)"

I think the error message is misleading since 429 is a valid code (https://tools.ietf.org/html/rfc6585#section-4), so at a minimum the message should be changed.

cc: @runcom

@vrothberg
Copy link
Member

Hi @miabbott, thanks for opening the issue.

[...], so at a minimum the message should be changed.

How about "Error determining repository tags: too many requests"? Alternatively, we can do a simple "stop DOSing the registry" ;-)

@miabbott
Copy link
Author

@vrothberg I think that message is a good start; I think it might be useful to include the HTTP code, too.

Do you think there is anything else that can be done in this scenario besides just error more cleanly?

@vrothberg
Copy link
Member

Do you think there is anything else that can be done in this scenario besides just error more cleanly?

I don't know the penalty (i.e., how long the client is rejected) but if that's ~predictable we could try to play smart and retry after a certain amount of time. @mtrmac do you know the (default) penalty?

@miabbott
Copy link
Author

miabbott commented Apr 24, 2019

The spec says:

...MAY include a Retry-After header indicating how long to wait before making a new request.

So it is unfortunately up to the registry/server if they decide to tell the client how long to back-off. Maybe retry according to Retry-After if it is present in the response?

@vrothberg
Copy link
Member

Sounds great, thanks @miabbott !

@wking
Copy link
Contributor

wking commented Apr 24, 2019

And probably do some sort of exponential backoff (60s, doubling the wait between attempts, give up after... 3? attempts) if there is no Retry-After.

@josephschorr
Copy link

We can add a Retry-After header too, if that helps

@mtrmac
Copy link
Collaborator

mtrmac commented May 7, 2019

[...], so at a minimum the message should be changed.

How about "Error determining repository tags: too many requests"? Alternatively, we can do a simple "stop DOSing the registry" ;-)

This is not raw HTTP; the docker/distribution protocol has an integrated error reporting mechanism: https://github.com/docker/distribution/blob/master/docs/spec/api.md#on-failure-too-many-requests-1 .

We should probably call github.com/docker/distribution/registry/client.HandleErrorResponse or something similar in GetRepositoryTags, instead of just blindly checking for 200.

@mtrmac
Copy link
Collaborator

mtrmac commented May 7, 2019

Do you think there is anything else that can be done in this scenario besides just error more cleanly?

I don't know the penalty (i.e., how long the client is rejected) but if that's ~predictable we could try to play smart and retry after a certain amount of time. @mtrmac do you know the (default) penalty?

It would be a bit weird to retry only on this single code path. I’m not sure whether it is wanted/desired for c/image to automatically retry at all (IIRC the kubelet, and the image builder, do retry at least image pulls already, so compounding the retries might be bad for error detection latency).

If we don’t retry, we definitely should return a recognizable error type (which the HandleErrorResponse call would do); should we also make the Retry-After header value available in the error? The conversation above suggests that it is not implemented in the relevant registry, and the spec does not describe it either (though, of course, using the HTTP-specified header would not be wrong :) ).

@mtrmac
Copy link
Collaborator

mtrmac commented May 7, 2019

Is there a public registry, and an easy way to cause it to return 429 without affecting other users, available for testing possible fixes? Or an easy way to set up an instance locally?

Failing that, a HTTP headers+body dump of the response would probably be sufficient.

@thomasmckay
Copy link

We'll be adding a 'Retry-After' to quay, is there tooling that will honor it? Ideally we'd make use of this immediately with skopeo, will an issue fixed here make it into a skopeo build?

@rhatdan
Copy link
Member

rhatdan commented Aug 7, 2019

Yes we update containers/image into skopeo on each release of containers/image.

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 7, 2019

Right now, c/image makes an one-shot attempt, there is no retry logic in it, and no special handling of 429/Retry-After.

CRI-O/the kubelet and openshift/builder do ultimately do retries at a higher level, but by that time the Retry-After information is lost.

So, no, right now there isn’t tooling and there would be no “immediate” benefit.


(Adding some kind of support to c/image, whether retrying or just relaying the error in a machine-detectable way, would certainly make it into skopeo in a short while. OTOH the “machine-detectable” aspect of an error object is a bit compromised by a CLI.)

@twaugh
Copy link
Contributor

twaugh commented Sep 10, 2019

We are running into this when using skopeo from the command line from a Python program which wants to manipulate tags in Quay.io.

Is anyone working on fixing this? Is there anything I can do to help?

@vrothberg
Copy link
Member

I have some cycles now and can look into it.

@vrothberg
Copy link
Member

Is there a public registry, and an easy way to cause it to return 429 without affecting other users, available for testing possible fixes? Or an easy way to set up an instance locally?

@twaugh, do you know?

@twaugh
Copy link
Contributor

twaugh commented Sep 13, 2019

If you send enough requests in a brief burst to Quay.io, you'll get a 429 response -- other users not bursting requests will not.

Alternatively, perhaps set up a local nginx proxy with Quay.io as the backend. Something like:

events {}

http {
    limit_req_zone $binary_remote_addr zone=mylimit:1m rate=1r/s;
 
    server {
        listen 443;
        location / {
            limit_req zone=mylimit;
            limit_req_status 429;
            limit_conn_status 429;
            proxy_pass https://quay.io;
        }
    }
}

(may need some tweaking)

and podman run --name nginx -v .../nginx.conf:/etc/nginx/nginx.conf:Z -p 8443:443 nginx

@vrothberg vrothberg self-assigned this Sep 17, 2019
@cgwalters
Copy link
Contributor

If you send enough requests in a brief burst to Quay.io

Are you really sure you need to enumerate the tags? Note for example in openshift/pivot#51 we stopped using skopeo entirely because we didn't need it.

And if you do need to enumerate the tags, why would you do it in a burst? Rather pass down the data across a pipeline or cache it.

@vrothberg
Copy link
Member

@twaugh, thanks. Do you have a registry or proxy setting the Retry-After header? I forced a local nginx returning it for testing purposes but would love to throw me at a production endpoint.

@twaugh
Copy link
Contributor

twaugh commented Sep 17, 2019

Are you really sure you need to enumerate the tags?

In fact, for our case, no we don't. We filed #698 about that.

But our team is not the only one with automation that uses Skopeo. Other teams use the registry our team decides on, and we have an authenticating proxy in front of Quay.io making 429 responses more likely, so switching to using Quay.io will be painful for teams using Skopeo until this issue is addressed.

We are separately investigating making the authenticating proxy automatically retry on 429 responses but have not had any luck. In any case, given that Quay.io can give 429 responses, Skopeo ought to handle that.

@twaugh, thanks. Do you have a registry or proxy setting the Retry-After header? I forced a local nginx returning it for testing purposes but would love to throw me at a production endpoint.

No. Quay.io's 429 responses do not set Retry-After.

@vrothberg
Copy link
Member

Please have a look at the proposed fix: #703

It's not looking for a Retry-After header and performs a simple exponential backoff starting at 20 secs.

@cuishuang
Copy link

There is an official introduction
https://docs.quay.io/issues/429.html

@vrothberg
Copy link
Member

The issue was fixed by #703, closing. Thanks to everybody involved for the feedback and for testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants