-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update eval modify index as part of plan apply. #3669
Conversation
nomad/structs/structs.go
Outdated
// EvalID is the eval ID of the plan being applied. We also update the modify | ||
// index of the eval ID as part of applying plan results. This is to ensure that | ||
// other workers that are dequeing evaluations don't miss updates that can affect | ||
// scheduling decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This is to ensure..." -> The modify index of the evaluation is updated as part of applying the plan to ensure that subsequent scheduling events for the same job will wait for the index that last produced state changes. This is necessary for blocked evaluations since they can be processed many times, potentially making state updates, without the state of the evaluation itself being updated.
nomad/eval_endpoint_test.go
Outdated
@@ -286,6 +286,73 @@ func TestEvalEndpoint_Dequeue_WaitIndex(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestEvalEndpoint_Dequeue_UpdateWaitIndex(t *testing.T) { | |||
// test enqueing an eval, updating a plan result for the same eval and dequeing the eval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dequeuing
nomad/state/state_store.go
Outdated
} | ||
if existing == nil { | ||
// return if there isn't an eval with this ID. | ||
// In some cases (like snapshot restores), we process evals that are not already in the state store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshot restores don't go through the normal upsert codepath. They have there own, direct insertion to avoid these problems.
nomad/state/state_store.go
Outdated
if existing == nil { | ||
// return if there isn't an eval with this ID. | ||
// In some cases (like snapshot restores), we process evals that are not already in the state store. | ||
s.logger.Printf("[WARN] state_store: unable to find eval ID %v, cannot update modify index ", evalID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do %q instead and there is an extra space on the end.
nomad/state/state_store.go
Outdated
} | ||
eval := existing.(*structs.Evaluation).Copy() | ||
// Update the indexes | ||
eval.CreateIndex = existing.(*structs.Evaluation).CreateIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the copy capture this one?
nomad/state/state_store_test.go
Outdated
evalOut, err := state.EvalByID(ws, eval.ID) | ||
assert.Nil(err) | ||
assert.NotNil(evalOut) | ||
assert.Equal(uint64(1000), evalOut.ModifyIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert.EqualValues(1000, evalOut.ModifyIndex)
1223158
to
a49db95
Compare
nomad/state/state_store.go
Outdated
return fmt.Errorf("eval lookup failed: %v", err) | ||
} | ||
if existing == nil { | ||
return fmt.Errorf("[ERR] state_store: unable to find eval id %q", evalID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err := fmt.Errorf("unable to find eval id %q", evalID)
s.logger.Printf("[ERR] state_store: %v", err)
return err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one, do you mind if the second line was just s.logger.Printf(err)
so it doesn't double print [ERR] state_store?
What does this actually fix in symptoms and problems? Would be nice with some more context of the root cause and fix for it |
@jippi It fixes a rare race that shows up under high load and limited resources. This manifested as subtle confusing bugs like duplicate deployments created, or reusing the same alloc index for a different allocation. Bug: During high load and limited resources, if the scheduler made some placements but was blocked on creating more, and then got unblocked, it could miss the placements it already made. Root cause: Scheduler workers processing an evaluation and not seeing state updates from previous plans that were applied associated with the same evaluation. Fix : Update the eval's last modify index when plans are applied, so that scheduler workers can wait for the raft index to catch up before starting to process the eval. |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
No description provided.