-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that the index processed by the client is at least as new as the last one processed. #18269
Ensure that the index processed by the client is at least as new as the last one processed. #18269
Conversation
…he last index processed so that stale data does not impact the running allocations.
I don't think the UI tests are related here. May I ask someone to re-run the tests or help me understand the failure mode? |
Thanks for the PR and detailed issue description @stswidwinski.
Yeah, this part of the code is very tricky to test reliably since there's a lot going on, causing tests to be flaky sometimes.
I'm not sure what's going with the this failure. I suspect it may be because this is an external PR the Percy token may not be exposed. But we can safely ignore this error as that step only uploads UI visual diff testing results, which this PR wouldn't affect. The changes look good to me, but I will just chat with the rest of the team to make sure I'm not missing anything. Thanks again for the detailed work! |
Testing for absence is a rather difficult task without hooks into the code to assert that given code paths have been hit. In this test we simply make sure that a transaction with a lower raft index does not make it to the client and depend on the timeout to make sure that this is the case. If the invariant is broken we expect the test to become either flaky or entirely broken. I acknowledge there may be a better way to test here, but lacking context this seems to be a minimal attempt to ensure that a similar issue doesn't creep back in. Feedback welcome.
1. error is ignored in WaitForResult 2. existing timeouts in blocking queries cannot be overridden and would make the tests always pass 3. the scheduler seems to ignore raft transactions with lower ids As such, we would need more infrastructure to write such tests. Do you have an idea is there is a precedent I can use here no to reinvent the wheel? In general we need: 1. To reduce the raft index of scheduler (purge the state or reassign the state: This can be done with regular assignment into a newly created state 2. Induce a timeout on the blocking query to ensure that the query result is passed to client -> This I see no easy way of achieving
I'm beginning to think that these tests are not quite possible without broader changes to timeout settings. Please see the recent commit comments for some more context. Do you know if there is an easy way to override the timeouts used for blocking queries by Agent? What do you think? |
@lgfa29 please let me know how you'd like to proceed |
Hi @stswidwinski 👋 Apologies for the delay, but I don't have a good answer yet 😅 I'm checking with the rest of the team for any additional ideas and I will let you know if something comes up. But since this is such a tricky behaviour to trigger in a test we may need do without out for now. |
Hey @lgfa29! Thank you for the update and sorry to keep bugging you. Let me know if I can help in some way or if any part of the bug report (or PR) makes this more confusing. Otherwise, I'll sit tight :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a fine shortcircuit to add, but I am struggling to see where it could fix misbehavior.
I'm beginning to think that these tests are not quite possible without broader changes to timeout settings.
I think you're correct here. One approach we could take is adding a rpcQueryWaitTime time.Duration
to the Client struct
and overriding it during tests to at least exercise the "blocking query timeout" case. At (normal) runtime the zero value should get converted to the default in Client.rpc(...)
, but it's probably best to explicitly set it to structs.DefaultBlockingRPCQueryTime
in client.NewClient(...)
.
The main downside I can see is that this is yet-another-field on the already enormous Client struct, and a test-only one at that. I can't think of a less invasive change we could make to exercise some of this logic though.
I think moving Client.RPC(...)
and dependencies into a new struct that implements RPCer interface{ RPC(...) error }
or similar might provide us with a reasonable hook to manipulate RPC behavior in tests, but (a) that's a lot of work, and (b) we could easily mock so much that we're not actually testing realistic runtime scenarios anymore.
Another option would be to refactor the function being changed here into multiple functions and unit test them. This greatly increases the (b) not testing realistic scenarios
risk above, but there may be opportunities to assert some useful things.
Long story short: I'd love someone to explore better testing options for Client RPCs, but I don't think we need to hold up this PR to wait for them.
I think the basic tenant of the breakage right now is that right now we accept the answer of the Scheduler independent of whether it is new or if it is stale. In the case of staleness we would like to avoid applying the changes that such an answer would imply (killing allocations which are "unexpected" and garbage collecting them). The stale replies may exist if the server we are using to get information from (we allow staleness so this is not the leader) times out awaiting for the raft state to catch up with an index which would allow it to process the query correctly. This happens more frequently than one would expect because of the retry logic within the client. The issue submitted (#18267) is quite verbose about this, but I'm happy to provide any more detail if it's helpful. The change in a somewhat trivial way discards the data returned by the scheduler are requires that the response corresponds to the minimal index requested. |
Use identifier-style keys for structured logs Co-authored-by: Michael Schurter <[email protected]>
Ah you're right! I was thinking this check was sufficient: https://github.com/hashicorp/nomad/pull/18269/files#diff-bd3a55a72186f59e2e63efb4951573b2f9e4a7cc98086e922b0859f8ccc1dd09R2396 However that check is only performed for allocs that are observable by the server returning the response. If the server is too stale to see the alloc exists, the Client won't even make it to that check and think the alloc has been GC'd by the server and therefore should be GC'd locally. Yikes. I'll add a changelog entry and get this merged. Thanks @stswidwinski! |
Similar to #18269, it is possible that even if Node.GetClientAllocs retrieves fresh allocs that the subsequent Alloc.GetAllocs call retrieves stale allocs. While `diffAlloc(existing, updated)` properly ignores stale alloc *updates*, alloc deletions have no such check. So if a client retrieves an alloc created at index 123, and then a subsequent Alloc.GetAllocs call hits a new server which returns results at index 100, the client will stop the alloc created at 123 because it will be missing from the stale response. This change applies the same logic as #18269 and ensures only fresh responses are used. Glossary: * fresh - modified at an index > the query index * stale - modified at an index <= the query index
Similar to #18269, it is possible that even if Node.GetClientAllocs retrieves fresh allocs that the subsequent Alloc.GetAllocs call retrieves stale allocs. While `diffAlloc(existing, updated)` properly ignores stale alloc *updates*, alloc deletions have no such check. So if a client retrieves an alloc created at index 123, and then a subsequent Alloc.GetAllocs call hits a new server which returns results at index 100, the client will stop the alloc created at 123 because it will be missing from the stale response. This change applies the same logic as #18269 and ensures only fresh responses are used. Glossary: * fresh - modified at an index > the query index * stale - modified at an index <= the query index
When waiting on a previous alloc we must query against the leader before switching to a stale query with index set. Also check to ensure the response is fresh before using it like #18269
When waiting on a previous alloc we must query against the leader before switching to a stale query with index set. Also check to ensure the response is fresh before using it like #18269
Similar to #18269, it is possible that even if Node.GetClientAllocs retrieves fresh allocs that the subsequent Alloc.GetAllocs call retrieves stale allocs. While `diffAlloc(existing, updated)` properly ignores stale alloc *updates*, alloc deletions have no such check. So if a client retrieves an alloc created at index 123, and then a subsequent Alloc.GetAllocs call hits a new server which returns results at index 100, the client will stop the alloc created at 123 because it will be missing from the stale response. This change applies the same logic as #18269 and ensures only fresh responses are used. Glossary: * fresh - modified at an index > the query index * stale - modified at an index <= the query index
Similar to #18269, it is possible that even if Node.GetClientAllocs retrieves fresh allocs that the subsequent Alloc.GetAllocs call retrieves stale allocs. While `diffAlloc(existing, updated)` properly ignores stale alloc *updates*, alloc deletions have no such check. So if a client retrieves an alloc created at index 123, and then a subsequent Alloc.GetAllocs call hits a new server which returns results at index 100, the client will stop the alloc created at 123 because it will be missing from the stale response. This change applies the same logic as #18269 and ensures only fresh responses are used. Glossary: * fresh - modified at an index > the query index * stale - modified at an index <= the query index
Similar to #18269, it is possible that even if Node.GetClientAllocs retrieves fresh allocs that the subsequent Alloc.GetAllocs call retrieves stale allocs. While `diffAlloc(existing, updated)` properly ignores stale alloc *updates*, alloc deletions have no such check. So if a client retrieves an alloc created at index 123, and then a subsequent Alloc.GetAllocs call hits a new server which returns results at index 100, the client will stop the alloc created at 123 because it will be missing from the stale response. This change applies the same logic as #18269 and ensures only fresh responses are used. Glossary: * fresh - modified at an index > the query index * stale - modified at an index <= the query index
When waiting on a previous alloc we must query against the leader before switching to a stale query with index set. Also check to ensure the response is fresh before using it like #18269
When waiting on a previous alloc we must query against the leader before switching to a stale query with index set. Also check to ensure the response is fresh before using it like #18269
When waiting on a previous alloc we must query against the leader before switching to a stale query with index set. Also check to ensure the response is fresh before using it like #18269
A fix for #18267, ensuring that "time travel" of the client state is not possible.
I could not find tests which may exercise this logic meaningfully. Please point me the right direction and I'm happy to add them
Please note that this isn't an actual good fix to the underlying issue, but rather a stop-gap measure.