Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds the ability to hedge storage requests. #4826

Merged
merged 14 commits into from
Nov 30, 2021

Conversation

cyriltovena
Copy link
Contributor

@cyriltovena cyriltovena commented Nov 25, 2021

Hedges GCS/S3/Azure/Swift requests using this library.

There's 2 minor caveats:

  • Doesn't work for the ruler.
  • For Openstack the implementation also hedge auth request. Required an upstream patch if we don't want to.

What this PR does / why we need it:

This allow to reduce the tail latency see paper Tail at Scale by Jeffrey Dean, Luiz André Barroso. In short: the client first sends one request, but then sends an additional request after a timeout if the previous hasn't returned an answer in the expected time. The client cancels remaining requests once the first result is received.

Special notes for your reviewer:

Checklist

  • Documentation added
  • Tests updated
  • Add an entry in the CHANGELOG.md about the changes.

Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really want this feature and I think the implementation looks solid; I've added a few small questions.

I think we should not merge this though until we have a way to deal with the following situation:

GCS/S3/whatever is experiencing a partial outage, which is causing its latency to spike. This is adding 200ms to all requests. This additional latency causes all of our requests to get hedged, which actually does more harm than good here since it increases tail latencies across the board ironically.

AFAICS from this PR, there is no protection against this.
I'm also not sure if any metrics have been added to see how many requests are hedged - did I miss something?

One dumb solution would be to add a config option for the maximum number of hedged requests per querier per second/minute.

pkg/storage/chunk/aws/s3_storage_client_test.go Outdated Show resolved Hide resolved
Signed-off-by: Cyril Tovena <[email protected]>
@cyriltovena
Copy link
Contributor Author

I really want this feature and I think the implementation looks solid; I've added a few small questions.

I think we should not merge this though until we have a way to deal with the following situation:

GCS/S3/whatever is experiencing a partial outage, which is causing its latency to spike. This is adding 200ms to all requests. This additional latency causes all of our requests to get hedged, which actually does more harm than good here since it increases tail latencies across the board ironically.

AFAICS from this PR, there is no protection against this. I'm also not sure if any metrics have been added to see how many requests are hedged - did I miss something?

One dumb solution would be to add a config option for the maximum number of hedged requests per querier per second/minute.

Unfortunately all those ask require improvement of the library upstream. I guess we could do that.

If we all think we should do it, then I can look into it.

Signed-off-by: Cyril Tovena <[email protected]>
Signed-off-by: Cyril Tovena <[email protected]>
@dannykopping
Copy link
Contributor

I really want this feature and I think the implementation looks solid; I've added a few small questions.
I think we should not merge this though until we have a way to deal with the following situation:
GCS/S3/whatever is experiencing a partial outage, which is causing its latency to spike. This is adding 200ms to all requests. This additional latency causes all of our requests to get hedged, which actually does more harm than good here since it increases tail latencies across the board ironically.
AFAICS from this PR, there is no protection against this. I'm also not sure if any metrics have been added to see how many requests are hedged - did I miss something?
One dumb solution would be to add a config option for the maximum number of hedged requests per querier per second/minute.

Unfortunately all those ask require improvement of the library upstream. I guess we could do that.

If we all think we should do it, then I can look into it.

Does it have to be changed upstream? We could create a layer of indirection which both tracks all hedged requests and cancels it if there have been too many.

Having it upstream would be nice, too.

@cyriltovena
Copy link
Contributor Author

cyriltovena commented Nov 26, 2021

Does it have to be changed upstream? We could create a layer of indirection which both tracks all hedged requests and cancels it if there have been too many.

Having it upstream would be nice, too.

The library uses the http.Roundtripper pattern:

  • if you wrap after you don't know if you're a hedge request or a normal one.
  • if you wrap before you can't predict if the request would be hedge or not.

It might be possible by wrapping before AND after, but that seems complex. I'll give a go if you don't hear from me on that matter means I couldn;t :)

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and I like @dannykopping's suggestion to limit the number of hedged requests to a certain percentage of total reqs, but I also don't want perfect to be the enemy of good and think this is beneficial enough on it's own.

Instead of making this a per-store option, could we implement it a level higher by wrapping the storage client interface to create a hedging client? That would also allow us to expose only one hedging config block, rather than one per backend.

@cyriltovena
Copy link
Contributor Author

FYI I asked this before submitting this PR.

cristalhq/hedgedhttp#17

@cyriltovena
Copy link
Contributor Author

Instead of making this a per-store option, could we implement it a level higher by wrapping the storage client interface to create a hedging client? That would also allow us to expose only one hedging config block, rather than one per backend.

I hesitated to do this although I realized it won't be applicable to some other backend like grpc or local. But if we think that doesn't matter I'm in for making it broader.

@dannykopping
Copy link
Contributor

Instead of making this a per-store option, could we implement it a level higher by wrapping the storage client interface to create a hedging client? That would also allow us to expose only one hedging config block, rather than one per backend.

Yeah I really like that, if possible

Signed-off-by: Cyril Tovena <[email protected]>
@cyriltovena
Copy link
Contributor Author

Yeah I really like that, if possible

Done ✨

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, we'll add the hedging rate-limiting in a follow-up PR

@cyriltovena cyriltovena enabled auto-merge (squash) November 30, 2021 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants