Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ratelimit code seems to fail #1041

Closed
2 tasks done
tlimoncelli opened this issue Aug 15, 2022 · 10 comments · Fixed by #1043
Closed
2 tasks done

Ratelimit code seems to fail #1041

tlimoncelli opened this issue Aug 15, 2022 · 10 comments · Fixed by #1043
Milestone

Comments

@tlimoncelli
Copy link

Confirmation

  • My issue isn't already found on the issue tracker.
  • I have replicated my issue using the latest version of the library and it is still present.

cloudflare-go version

v0.46.0

Go environment

GO111MODULE=""
GOARCH="arm64"
GOBIN="/Users/tlimoncelli/bin"
GOCACHE="/Users/tlimoncelli/Library/Caches/go-build"
GOENV="/Users/tlimoncelli/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/tlimoncelli/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/tlimoncelli/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.19"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/tlimoncelli/git/dnscontrol/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/j8/8pxbmmjs6xjbd4nxjpd_q0rm0000gp/T/go-build3635307803=/tmp/go-build -gno-record-gcc-switches -fno-common"

Expected output

I'm not entirely sure if I'm going in the right direction and can use some help.

This module is used by DNSControl, which seems to fail its integration tests when ratelimiting kicks in:

https://app.circleci.com/pipelines/github/StackExchange/dnscontrol/1335/workflows/407541cf-dcf9-416e-8aea-799e1505050d/jobs/10332

I think I'm seeing a failure that I think is related to the ratelimiting code added in e1f3c42 by @benjvi

https://github.com/cloudflare/cloudflare-go/blob/master/cloudflare.go#L260

My question is:
On line 260, why do we always set respErr instead of only setting it if there is an error?

Actual output

could not read response body: %!w(<nil>)

Code demonstrating the issue

https://github.com/StackExchange/dnscontrol

Steps to reproduce

FYI: I apologize for the incomplete bug report. I'm recording what I have now and will add to it as I research it more.

  1. Run enough queries fast enough to trigger rate limiting

References

Bug in DNSControl: StackExchange/dnscontrol#1711

CC: @tresni

@jacobbednarz
Copy link
Member

Why do you think the rate limiting code is incorrect? By the looks, this is hitting those rate limits and running out of retries; the error message could definitely be improved to state that instead of failing on the body read.

@jacobbednarz
Copy link
Member

Also, if the DNSControl tests are always run from the same account you can provide me with the account ID and I can take a look at bumping the allowed rate limits. It won't mean you can run it nonstop so you'll need to limit real API calls to when they are actually needed.

@tlimoncelli
Copy link
Author

Thank you for the offer to increase the allowed rate. I may take you up on that offer but first I'd like to make sure the rate limiting code works, which is easier to do at the lower limit. (It could be a bug on my side too!)

To be clear, I'm not 100% certain there is a bug. I'm still trying to understand the code.

In particular, in this code respErr is set without first checking if err is not null. Why does the code assume that ioutil.ReadAll(resp.Body) will always fail? Is ioutil.ReadAll() just being used to get to the EOF so we can close the descriptor?

				respBody, err = ioutil.ReadAll(resp.Body)
				resp.Body.Close()

				respErr = fmt.Errorf("could not read response body: %w", err)

@jacobbednarz
Copy link
Member

from a brief look at the code, this is inside of a block that checks if respErr (from api.request) is not nil and then falls into the following condition

                // retry if the server is rate limiting us or if it failed
		// assumes server operations are rolled back on failure
		if respErr != nil || resp.StatusCode == http.StatusTooManyRequests || resp.StatusCode >= 500 {

from there, it attempts to read the body but also sets respErr to the read error since it is nil.

i feel like the logic here is a little odd however, the more likely scenario is that the rate limited condition is just throwing the wrong error message.

this code is rather confusing to follow which is why i swapped out this for retryable-http for the experimental client in #977. tbh, i wouldn't mind doing the same here however, i'd have to first unpick the web that it is to make sure we don't break anything else too.

@jacobbednarz
Copy link
Member

using a mitmproxy, i can see the retry/backoff logic is working as expected.

SwtpClP9-Rth2m3Xm-XAeNdgGi

@jacobbednarz
Copy link
Member

the following patch propagates the out of retries message as expected which might be the clearer solution here

diff --git cloudflare.go cloudflare.go
index b8bd301..5c17da0 100644
--- cloudflare.go
+++ cloudflare.go
@@ -251,6 +251,10 @@ func (api *API) makeRequestWithAuthTypeAndHeaders(ctx context.Context, method, u
 		// retry if the server is rate limiting us or if it failed
 		// assumes server operations are rolled back on failure
 		if respErr != nil || resp.StatusCode == http.StatusTooManyRequests || resp.StatusCode >= 500 {
+			if resp.StatusCode == http.StatusTooManyRequests {
+				respErr = errors.New("ran out of rate limit retries")
+			}
+
 			// if we got a valid http response, try to read body so we can reuse the connection
 			// see https://golang.org/pkg/net/http/#Client.Do
 			if respErr == nil {

@tlimoncelli
Copy link
Author

Yes, that would be an improvement!

I have some other suggestions:

  1. The err in the other message shouldn't be the err from ReadAll.
  2. The default policy (MaxRetries: 3) feels a little stingy. Is that enough retries to even make it to 30? (I didn't do the math but it seems likely). Anyway... my suggestion is to change the MaxRetries default to be higher.. IMHO if the MaxRetryDelay is high (more than a minute) then it is safe to have MaxRetires set to 20 or more. I'd gladly submit a PR

@jacobbednarz
Copy link
Member

  1. The err in the other message shouldn't be the err from ReadAll.

no but at the moment, if the API 5xx's it is very unlikely to be parsable. i think it's safe (for now) to leave this. there is work to address this but it's not super straight forward due to a few constraints we are working on.

2. The default policy (MaxRetries: 3) feels a little stingy. Is that enough retries to even make it to 30? (I didn't do the math but it seems likely).

the problem with the retries is that the backoff basically doubles each retry so but 3 retries, you're over 10s in waiting for a response. for most, this isn't a great experience as if you don't have the logger in place getting updates, you just think it is sitting there not working. you can control this individually using the UsingRetryPolicy method to configure if you have a specific use case in mind.

the other factor to consider here is that the rate limit is per 5 minute window so retrying for 10-30s may not solve it and i doubt anyone would want to sit there for up to 4 minutes 59 seconds waiting for it to retry.

@jacobbednarz
Copy link
Member

have made this clearer with #1043

@github-actions github-actions bot added this to the v0.47.0 milestone Aug 15, 2022
@github-actions
Copy link

This functionality has been released in v0.47.0.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants