-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce bounds on MaxQueryTime #9064
Conversation
The MaxQueryTime value used in QueryOptions.HasTimedOut() can be set to an invalid value that would throw off how RPC requests are retried. This fix uses the same logic that enforces the MaxQueryTime bounds in the blockingRPC() call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pierreca! Thanks for opening this PR!
I'm not totally sure I understand the circumstances of the issue here.
I did a search for callers of that HasTimedOut
method and that's only being called in the client RPC (which turns out to be from your original PR here #8921). The default configuration for that rpc hold timeout is much lower than it is for the server: only 5 seconds. Can you explain a bit more about what you're trying to do here?
Say you have a long-poll RPC request between a client and a server and MaxQueryTime is not set by the user of the API. the default behavior of the server is to pin it to a default value (300 seconds), but the default behavior of the client is to use it without checking its bounds, and if not set defaults to 0. IMHO this is problematic for 2 reasons:
This PR fixes both problems and this is the most minimalistic way I have found to do it. This is part of a larger scenario that i'm trying to fix: a user should be able to have a long-poll request opened with a client survive a server dying and leader election taking place. There's a very easy way to test this with a minimal cluster with 3 or 5 servers, a client, and a long-poll request made to the client for node state for example. Kill the leader and see the request fail with a wrapped RPC error. Right now, this does not work for multiple reasons (and imho this breaks nomad's HA promise):
|
I'm not sure that can be made to work, regardless of timeouts. In that scenario, the request is being forwarded through the leader (as all RPCs are) to the server connected to that client node, and from there forwarded to the client. If the leader is killed, there's no way to maintain that connection; it must be retried by the API client by sending the request again (either to the new leader or to be forwarded to the leader). |
I agree with the path taken by the request. This is my understanding of a typical request path, with an optional follower in the middle:
and the response follows the same path in the opposite direction.
There is existing code in the client that rotates the server and tries to survive leader election: https://github.com/hashicorp/nomad/blob/master/client/rpc.go#L83 the Line 222 in c14c616
I think in order to uphold the high availability promise of nomad, you have to make it the client responsibility to retry, not the user: the client has all the necessary information AND pre-existing code supporting this and that's what i'm trying to fix :) |
(posting that whole comment here: config.go#L222-L226)
So by my reading of this, the only thing extending the RPCHoldTimeout gets us is longer timeouts if the leader election takes more than 5s? |
@tgross I'm not suggesting to extend Any retry attempt on a long-poll request has to deal with 3 variables:
what I'm trying to make consistent is 1. On the server side ( On the client side ( |
Oh ok, I understand now. Apologies for my density, @pierreca, and thanks for walking me through it! 😀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
The MaxQueryTime value used in QueryOptions.HasTimedOut() can be set to an invalid value that would throw off how RPC requests are retried. This fix uses the same logic that enforces the MaxQueryTime bounds in the blockingRPC() call.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
The MaxQueryTime value used in QueryOptions.HasTimedOut() can be set to
an invalid value that would throw off how RPC requests are retried.
This fix uses the same logic that enforces the MaxQueryTime bounds in the
blockingRPC() call.