-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
periodic: always reset periodic children status #10145
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I took a pass through the rest of structs.Job
to see if there was anything else we needed to hit. Looks like ModifyIndex
and CreateIndex
will get handled when we upsert into raft. (Multiregion
also gave me pause, but a job that's been registered will already have the "multiregion interpolated" version).
|
||
dispatched := m.dispatchedJobs(job) | ||
require.NotEmpty(t, dispatched) | ||
require.Empty(t, dispatched[0].Status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you, but would verifying that the status eventually does get set correctly be a lot of lift for this test? Just to make sure we're not relying on a side-effect of the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to make an e2e or have a higher level integration tests, but I fear it'll be flaky and a slow test: it depends on a periodic job triggering and running after a leadership transition.
Alternatively, we could force the conditions by manipulating the job store directly: force the inserted job to be running state, and force job update. After the scaffolding, it's not obvious to me if the resulting test is better at protecting against a regression than this one, though it's going to be slower.
If you rebase on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for great info and the quick work!
Potentially related issue: #10222 |
Fixes a bug where Nomad reports negative or incorrect running children counts for periodic jobs. The periodic dispatcher derives a child job without reseting the status. If the periodic job has a `running` status, the derived job will start as `running` status and transition to `pending`. Since this is unexpected transition, the counting in StateStore.setJobSummary gets out of sync and result in negative/incorrect values. Note that this only affects periodic jobs after a leader transition. During the first job registration, the job is added with `pending` or `""` status. However, after a leader transition, the new leader repopulates the dispatcher heap with `"running"` status and triggers the bug.
8d0d26c
to
032945b
Compare
14b5511
to
d81df3d
Compare
I've updated the PR to resets the status for dispatch jobs. It's another manifestation of the a 1.0.3 regression bug with a failing test added in 032945b where the failure is seen in https://app.circleci.com/pipelines/github/hashicorp/nomad/15268/workflows/a938267c-e127-49f9-9cef-707ae434d486/jobs/144218 . Apparently, in some cases, we did reset the job status in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. That we hit this bug with the dispatch jobs seems like it really helps because we don't have the non-determinism of a leader election to reproduce.
@@ -4,10 +4,12 @@ job "periodic" { | |||
|
|||
constraint { | |||
attribute = "${attr.kernel.name}" | |||
value = "linux" | |||
operator = "set_contains_any" | |||
value = "darwin,linux" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's handy, good catch. There's probably a bunch of tests where this is safe to do. Something for a separate PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, set_contains_any
isn't documented in https://www.nomadproject.io/docs/job-specification/constraint - it's only documented for affinity constraints so we should update that too.
Also, for future readers, I considered having the constraint be based on docker os attribute - but chose not to. Windows is a bit slow, and I didn't want to run tests on LCOW Windows clients if we add some.
periodic: always reset periodic children status
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes a bug where Nomad reports negative or incorrect running children
counts for periodic jobs.
The periodic dispatcher derives a child job without reseting the status.
If the periodic job has a
running
status, the derived job will startas
running
status and transition topending
. Since this isunexpected transition, the counting in
StateStore.setJobSummary
gets out of sync andresult in negative/incorrect values.
Note that this only affects periodic jobs after a leader transition.
During the first job registration, the job is added with
pending
or""
status. However, after a leader transition, the new leaderrepopulates the dispatcher heap with
"running"
status and triggers thebug.
Alternative implementation
I have considered updating the FSM handler so that job registration event always resets the job status. I'm nervous of such change that will re-interpret already commited log entries, resulting into unintended changes or discrepancy about the server state during upgrades that may depend on the state of raft logs and snapshots.
Debugging notes
An internal test cluster encountered this bug. The job
horizontal_hey
started having negative running children counts at index 1609. The following output highlights the issue, and also demonstrate that index 1609 is the first job registration after leader election (term 3) and shows a "running" status, unlike the empty status in term 2.