Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use rpcHoldTimeout to calculate blocking timeout #15541

Merged
merged 1 commit into from
Nov 24, 2022

Conversation

kisunji
Copy link
Contributor

@kisunji kisunji commented Nov 23, 2022

Description

Fixes #15246
Fixes #15537

Reverts a bug I introduced while refactoring:
https://github.com/hashicorp/consul/pull/14965/files#diff-8f12aa7f72e8647fddfac936ab541982aae517babd3064a581bbcdcc2595c2dcL347-R360

Note that Timeout included rpcHoldTimeout but I extracted it out of Timeout into HasTimedOut. But we still need to consider it when calculating BlockingTimeout.

It's hard to write a test for this due to timeouts being very timing + random jitter dependent. It should not break any existing tests and at most it adds 7s to existing timeouts.

The problem I'm trying to solve is:

/v1/catalog/services?wait=2s&index=16165

client uses BlockingTimeout to calculate read timeout
2s + 2s/16 = 2.125s

server adds max possible jitter
2s + 2s/16 = 2.125s

if timeouts are nearly identical it is likelier for the client to timeout the connection before the server has had a chance to respond (or time out the blocking query).

With this PR:

GET /v1/catalog/services?wait=2s&index=16165

client uses (*QueryOptions).BlockingTimeout to calculate read timeout
2s + 2s/16 + 7s = 9.125s
             ^ RPCHoldTimeout buffer

server adds max possible jitter in (*Server).rpcQueryTimeout
2s + 2s/16 = 2.125s

Server will always be the one to apply timeout and return from a blocking query early

@kisunji kisunji force-pushed the kisunji/blocking-timeout-bugfix branch from 009ce31 to 606b72a Compare November 23, 2022 17:48
@kisunji kisunji requested a review from banks November 23, 2022 17:52
@kisunji kisunji force-pushed the kisunji/blocking-timeout-bugfix branch from 606b72a to d10fbff Compare November 23, 2022 22:06
@kisunji kisunji requested a review from a team November 23, 2022 22:08
Comment on lines +636 to +638
// Override the default client timeout but add RPCHoldTimeout
// as a buffer for retries during leadership changes.
timeout = blockingTimeout + p.RPCHoldTimeout
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all this timeout calculation is confusing, in a nutshell I am re-adding RPCHoldTimeout that I removed in https://github.com/hashicorp/consul/pull/14965/files#diff-8f12aa7f72e8647fddfac936ab541982aae517babd3064a581bbcdcc2595c2dcL347-R360

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXCEPT in the non-blocking case, which is controlled by a different config rpc_client_timeout and should not have rpc_hold_timeout included (since it's only relevant for blocking queries)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also guarantees that the server timeout is longer than the clients in case of leader rotation, which I guess is the whole point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other way around: assuming client and server calculate timeout the same way, it guarantees clients will always have the longer timeout (servers will timeout first, which is desired)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, yep, reversed that in my head

@kisunji kisunji force-pushed the kisunji/blocking-timeout-bugfix branch from d10fbff to b3f544a Compare November 23, 2022 22:20
Copy link
Contributor

@wilkermichael wilkermichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job

err := c1.RPC("Long.Wait", &structs.NodeSpecificRequest{
QueryOptions: structs.QueryOptions{
MinQueryIndex: 1,
MaxQueryTime: 20 * time.Millisecond,
},
}, &out)
require.Error(t, err)
require.Contains(t, err.Error(), "rpc error making call: i/o deadline reached")
require.ErrorContains(t, err, "rpc error making call: i/o deadline reached")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's nice, I didn't know about ErrorContains

Comment on lines +636 to +638
// Override the default client timeout but add RPCHoldTimeout
// as a buffer for retries during leadership changes.
timeout = blockingTimeout + p.RPCHoldTimeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also guarantees that the server timeout is longer than the clients in case of leader rotation, which I guess is the whole point?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants