-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds the ability to hedge storage requests. #4826
Conversation
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really want this feature and I think the implementation looks solid; I've added a few small questions.
I think we should not merge this though until we have a way to deal with the following situation:
GCS/S3/whatever is experiencing a partial outage, which is causing its latency to spike. This is adding 200ms to all requests. This additional latency causes all of our requests to get hedged, which actually does more harm than good here since it increases tail latencies across the board ironically.
AFAICS from this PR, there is no protection against this.
I'm also not sure if any metrics have been added to see how many requests are hedged - did I miss something?
One dumb solution would be to add a config option for the maximum number of hedged requests per querier per second/minute.
Signed-off-by: Cyril Tovena <[email protected]>
Unfortunately all those ask require improvement of the library upstream. I guess we could do that. If we all think we should do it, then I can look into it. |
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Does it have to be changed upstream? We could create a layer of indirection which both tracks all hedged requests and cancels it if there have been too many. Having it upstream would be nice, too. |
The library uses the http.Roundtripper pattern:
It might be possible by wrapping before AND after, but that seems complex. I'll give a go if you don't hear from me on that matter means I couldn;t :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good and I like @dannykopping's suggestion to limit the number of hedged requests to a certain percentage of total reqs, but I also don't want perfect to be the enemy of good and think this is beneficial enough on it's own.
Instead of making this a per-store option, could we implement it a level higher by wrapping the storage client interface to create a hedging client? That would also allow us to expose only one hedging config block, rather than one per backend.
FYI I asked this before submitting this PR. |
I hesitated to do this although I realized it won't be applicable to some other backend like grpc or local. But if we think that doesn't matter I'm in for making it broader. |
Yeah I really like that, if possible |
Signed-off-by: Cyril Tovena <[email protected]>
Done ✨ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, we'll add the hedging rate-limiting in a follow-up PR
Hedges GCS/S3/Azure/Swift requests using this library.
There's 2 minor caveats:
What this PR does / why we need it:
This allow to reduce the tail latency see paper Tail at Scale by Jeffrey Dean, Luiz André Barroso. In short: the client first sends one request, but then sends an additional request after a timeout if the previous hasn't returned an answer in the expected time. The client cancels remaining requests once the first result is received.
Special notes for your reviewer:
Checklist
CHANGELOG.md
about the changes.