Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagnose if rate limit timeouts are due to pipeline timeouts in 100 name orders #7846

Open
beautifulentropy opened this issue Nov 25, 2024 · 0 comments
Assignees

Comments

@beautifulentropy
Copy link
Member

beautifulentropy commented Nov 25, 2024

By default, go-redis will retry a request 3 times. Check if retries are applied to individual requests in a pipeline or if they are applied to the entire pipeline. The theory we are trying to prove out is whether a timeout of 1 or 2 keys in a 100 name order results in a whole pipeline being retried, and thus considerably more load on those same shards.

We can add a label to our metrics to bin transactions by count. With a ~105 upper limit for new-order rate limit checks, including per-name checks, we could use bins like: 1-25, 26-50, 51-75, 76-105, 106+. With these deployed we should be able to correlate timeouts to queries.

@aarongable aarongable added this to the Sprint 2024-12-03 milestone Dec 3, 2024
jsha added a commit that referenced this issue Dec 5, 2024
For batch operations, include the operation and the number of keys in
the error message. This should help diagnose whether we are getting `i/o
timeout` errors disproportionately for larger requests, or for certain
operations.

Also, make the ignored errors part of the overall WFE request logs,
which allows us to get additional context, like whether certain
requesters or domain names are getting disproportionately many errors.

Related to #7846.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants