-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: defensive against getting stale alloc updates #5906
Conversation
When fetching node alloc assignments, be defensive against a stale read before killing local nodes allocs. The bug is when both client and servers are restarting and the client requests the node allocation for the node, it might get stale data as server hasn't finished applying all the restored raft transaction to store. Consequently, client would kill and destroy the alloc locally, just to fetch it again moments later when server store is up to date. The bug can be reproduced quite reliably with single node setup (configured with persistence). I suspect it's too edge-casey to occur in production cluster with multiple servers, but we may need to examine leader failover scenarios more closely. In this commit, we only remove and destroy allocs if the removal index is more recent than the alloc index. This seems like a cheap resiliency fix we already use for detecting alloc updates. A more proper fix would be to ensure that a nomad server only serves RPC calls when state store is fully restored or up to date in leadership transition cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find! Since we remove allocs based on their absence there's no ModifyIndex to check for freshness. This appears to bring alloc removal correctness in line with alloc updates.
@@ -1944,6 +1947,7 @@ OUTER: | |||
filtered: filtered, | |||
pulled: pulledAllocs, | |||
migrateTokens: resp.MigrateTokens, | |||
index: resp.Index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L1942 updates req.MinQueryIndex if and only if resp.Index is greater, so I wonder if there's some reason we should use req.MinQueryIndex here instead. I'm honestly not sure L1942 is reachable. Perhaps there's a timeout that could cause a response before resp.Index is greater than MinQueryIndex?
Not a blocker as I think at worst it's an edge case of an edge case that when hit will negate the correctness improvement of this PR. It can't make the behavior worse than before the PR AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be using req.MinQueryIndex
here. It's simpler to reason that we reconciling local state against pulled state (at index resp.Index) without worrying much about the indirection or interference of
req.MinQueryIndex` (i.e. if resp.Index is earlier than req.MinQueryIndex, using req.MinQueryIndex risks us believing the server state is more recent than it actually is). We expect the reconciler to work even if resp.Index went back in time expectedly.
As for req.MinQueryIndex
, it seems that we are protecting against servers state going back in time! That feels quite odd and I wonder if it's just being defensive or a case we hit at some point.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
When fetching node alloc assignments, be defensive against a stale read before
killing local nodes allocs.
The bug is when both client and servers are restarting and the client requests
the node allocation for the node, it might get stale data as server hasn't
finished applying all the restored raft transaction to store.
Consequently, client would kill and destroy the alloc locally, just to fetch it
again moments later when server store is up to date.
The bug can be reproduced quite reliably with single node setup (configured with
persistence). I suspect it's too edge-casey to occur in production cluster with
multiple servers, but we may need to examine leader failover scenarios more closely.
In this commit, we only remove and destroy allocs if the removal index is more
recent than the alloc index. This seems like a cheap resiliency fix we already
use for detecting alloc updates.
A more proper fix would be to ensure that a nomad server only serves
RPC calls when state store is fully restored or up to date in leadership
transition cases.