Adds the ability to hedge storage requests. #4826

cyriltovena · 2021-11-25T16:56:08Z

Hedges GCS/S3/Azure/Swift requests using this library.

There's 2 minor caveats:

Doesn't work for the ruler.
For Openstack the implementation also hedge auth request. Required an upstream patch if we don't want to.

What this PR does / why we need it:

This allow to reduce the tail latency see paper Tail at Scale by Jeffrey Dean, Luiz André Barroso. In short: the client first sends one request, but then sends an additional request after a timeout if the previous hasn't returned an answer in the expected time. The client cancels remaining requests once the first result is received.

Special notes for your reviewer:

Checklist

Documentation added
Tests updated
Add an entry in the CHANGELOG.md about the changes.

Signed-off-by: Cyril Tovena <[email protected]>

dannykopping

I really want this feature and I think the implementation looks solid; I've added a few small questions.

I think we should not merge this though until we have a way to deal with the following situation:

GCS/S3/whatever is experiencing a partial outage, which is causing its latency to spike. This is adding 200ms to all requests. This additional latency causes all of our requests to get hedged, which actually does more harm than good here since it increases tail latencies across the board ironically.

AFAICS from this PR, there is no protection against this.
I'm also not sure if any metrics have been added to see how many requests are hedged - did I miss something?

One dumb solution would be to add a config option for the maximum number of hedged requests per querier per second/minute.

pkg/storage/chunk/aws/s3_storage_client_test.go

pkg/storage/chunk/azure/blob_storage_client.go

pkg/storage/chunk/openstack/swift_object_client.go

Signed-off-by: Cyril Tovena <[email protected]>

cyriltovena · 2021-11-26T11:29:46Z

I really want this feature and I think the implementation looks solid; I've added a few small questions.

I think we should not merge this though until we have a way to deal with the following situation:

GCS/S3/whatever is experiencing a partial outage, which is causing its latency to spike. This is adding 200ms to all requests. This additional latency causes all of our requests to get hedged, which actually does more harm than good here since it increases tail latencies across the board ironically.

AFAICS from this PR, there is no protection against this. I'm also not sure if any metrics have been added to see how many requests are hedged - did I miss something?

One dumb solution would be to add a config option for the maximum number of hedged requests per querier per second/minute.

Unfortunately all those ask require improvement of the library upstream. I guess we could do that.

If we all think we should do it, then I can look into it.

Signed-off-by: Cyril Tovena <[email protected]>

dannykopping · 2021-11-26T13:08:59Z

I really want this feature and I think the implementation looks solid; I've added a few small questions.
I think we should not merge this though until we have a way to deal with the following situation:
GCS/S3/whatever is experiencing a partial outage, which is causing its latency to spike. This is adding 200ms to all requests. This additional latency causes all of our requests to get hedged, which actually does more harm than good here since it increases tail latencies across the board ironically.
AFAICS from this PR, there is no protection against this. I'm also not sure if any metrics have been added to see how many requests are hedged - did I miss something?
One dumb solution would be to add a config option for the maximum number of hedged requests per querier per second/minute.

Unfortunately all those ask require improvement of the library upstream. I guess we could do that.

If we all think we should do it, then I can look into it.

Does it have to be changed upstream? We could create a layer of indirection which both tracks all hedged requests and cancels it if there have been too many.

Having it upstream would be nice, too.

pkg/storage/chunk/aws/s3_storage_client_test.go

cyriltovena · 2021-11-26T14:25:29Z

Does it have to be changed upstream? We could create a layer of indirection which both tracks all hedged requests and cancels it if there have been too many.

Having it upstream would be nice, too.

The library uses the http.Roundtripper pattern:

if you wrap after you don't know if you're a hedge request or a normal one.
if you wrap before you can't predict if the request would be hedge or not.

It might be possible by wrapping before AND after, but that seems complex. I'll give a go if you don't hear from me on that matter means I couldn;t :)

owen-d

This looks good and I like @dannykopping's suggestion to limit the number of hedged requests to a certain percentage of total reqs, but I also don't want perfect to be the enemy of good and think this is beneficial enough on it's own.

Instead of making this a per-store option, could we implement it a level higher by wrapping the storage client interface to create a hedging client? That would also allow us to expose only one hedging config block, rather than one per backend.

cyriltovena · 2021-11-26T14:48:11Z

FYI I asked this before submitting this PR.

cristalhq/hedgedhttp#17

cyriltovena · 2021-11-26T14:59:45Z

Instead of making this a per-store option, could we implement it a level higher by wrapping the storage client interface to create a hedging client? That would also allow us to expose only one hedging config block, rather than one per backend.

I hesitated to do this although I realized it won't be applicable to some other backend like grpc or local. But if we think that doesn't matter I'm in for making it broader.

dannykopping · 2021-11-26T15:00:13Z

Instead of making this a per-store option, could we implement it a level higher by wrapping the storage client interface to create a hedging client? That would also allow us to expose only one hedging config block, rather than one per backend.

Yeah I really like that, if possible

Signed-off-by: Cyril Tovena <[email protected]>

cyriltovena · 2021-11-29T07:56:52Z

Yeah I really like that, if possible

Done ✨

dannykopping

Approving, we'll add the hedging rate-limiting in a follow-up PR

Adds the ability to hedge request for all backends

1adb2a3

Signed-off-by: Cyril Tovena <[email protected]>

cyriltovena requested review from KMiller-Grafana and a team as code owners November 25, 2021 16:56

pull-request-size bot added the size/XL label Nov 25, 2021

cyriltovena added 7 commits November 26, 2021 09:21

Remove race from tests

39d29df

Signed-off-by: Cyril Tovena <[email protected]>

Remove the race

5fa99ca

Signed-off-by: Cyril Tovena <[email protected]>

Remove the race

d077242

Signed-off-by: Cyril Tovena <[email protected]>

Testing

77fb483

Signed-off-by: Cyril Tovena <[email protected]>

More testing

e1335ca

Signed-off-by: Cyril Tovena <[email protected]>

Setup credentials to avoid auth

12eafc8

Signed-off-by: Cyril Tovena <[email protected]>

gomod

22b3ca9

Signed-off-by: Cyril Tovena <[email protected]>

dannykopping suggested changes Nov 26, 2021

View reviewed changes

pkg/storage/chunk/aws/s3_storage_client_test.go Outdated Show resolved Hide resolved

pkg/storage/chunk/azure/blob_storage_client.go Show resolved Hide resolved

pkg/storage/chunk/openstack/swift_object_client.go Show resolved Hide resolved

improve tests

97d158a

Signed-off-by: Cyril Tovena <[email protected]>

cyriltovena added 3 commits November 26, 2021 12:34

Merge remote-tracking branch 'upstream/main' into hedging-gets

63a358d

Signed-off-by: Cyril Tovena <[email protected]>

gomod

6e83c0f

Signed-off-by: Cyril Tovena <[email protected]>

changelog

493596a

Signed-off-by: Cyril Tovena <[email protected]>

jeschkies approved these changes Nov 26, 2021

View reviewed changes

dannykopping reviewed Nov 26, 2021

View reviewed changes

pkg/storage/chunk/aws/s3_storage_client_test.go Show resolved Hide resolved

owen-d approved these changes Nov 26, 2021

View reviewed changes

Group the configuration

da6bdaa

Signed-off-by: Cyril Tovena <[email protected]>

dannykopping approved these changes Nov 30, 2021

View reviewed changes

Merge branch 'main' into hedging-gets

115012f

cyriltovena enabled auto-merge (squash) November 30, 2021 15:42

cyriltovena merged commit b27894f into grafana:main Nov 30, 2021

KMiller-Grafana mentioned this pull request Jan 6, 2022

Docs: Revise hedging configuration block descriptions #5069

Merged

dannykopping mentioned this pull request Jan 31, 2022

Support fine grained retry with custom backoff configs in gcs storage client. #5276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds the ability to hedge storage requests. #4826

Adds the ability to hedge storage requests. #4826

cyriltovena commented Nov 25, 2021 •

edited

Loading

dannykopping left a comment

cyriltovena commented Nov 26, 2021

dannykopping commented Nov 26, 2021

cyriltovena commented Nov 26, 2021 •

edited

Loading

owen-d left a comment

cyriltovena commented Nov 26, 2021

cyriltovena commented Nov 26, 2021

dannykopping commented Nov 26, 2021

cyriltovena commented Nov 29, 2021

dannykopping left a comment

Adds the ability to hedge storage requests. #4826

Adds the ability to hedge storage requests. #4826

Conversation

cyriltovena commented Nov 25, 2021 • edited Loading

dannykopping left a comment

Choose a reason for hiding this comment

cyriltovena commented Nov 26, 2021

dannykopping commented Nov 26, 2021

cyriltovena commented Nov 26, 2021 • edited Loading

owen-d left a comment

Choose a reason for hiding this comment

cyriltovena commented Nov 26, 2021

cyriltovena commented Nov 26, 2021

dannykopping commented Nov 26, 2021

cyriltovena commented Nov 29, 2021

dannykopping left a comment

Choose a reason for hiding this comment

cyriltovena commented Nov 25, 2021 •

edited

Loading

cyriltovena commented Nov 26, 2021 •

edited

Loading