Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

periodic: always reset periodic children status #10145

Merged
merged 3 commits into from
Mar 26, 2021
Merged

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Mar 9, 2021

Fixes a bug where Nomad reports negative or incorrect running children
counts for periodic jobs.

The periodic dispatcher derives a child job without reseting the status.
If the periodic job has a running status, the derived job will start
as running status and transition to pending. Since this is
unexpected transition, the counting in StateStore.setJobSummary gets out of sync and
result in negative/incorrect values.

Note that this only affects periodic jobs after a leader transition.
During the first job registration, the job is added with pending or
"" status. However, after a leader transition, the new leader
repopulates the dispatcher heap with "running" status and triggers the
bug.

Alternative implementation

I have considered updating the FSM handler so that job registration event always resets the job status. I'm nervous of such change that will re-interpret already commited log entries, resulting into unintended changes or discrepancy about the server state during upgrades that may depend on the state of raft logs and snapshots.

Debugging notes

An internal test cluster encountered this bug. The job horizontal_hey started having negative running children counts at index 1609. The following output highlights the issue, and also demonstrate that index 1609 is the first job registration after leader election (term 3) and shows a "running" status, unlike the empty status in term 2.

$ IDX=1608; nomad operator raft _state --last-index $IDX . | jq '.JobSummaries[] | select(.JobID=="horizontal_hey") | .Children'
{
  "Pending": 0,
  "Running": 0,
  "Dead": 12
}
$ IDX=1609; nomad operator raft _state --last-index $IDX . | jq '.JobSummaries[] | select(.JobID=="horizontal_hey") | .Children'
{
  "Pending": 1,
  "Running": -1,
  "Dead": 12
}
$ nomad operator raft _logs . \
>   | jq -r '.[] | select(.CommandType == "JobRegisterRequestType")
>                | select(.Body.Job.ParentID == "horizontal_hey")
>                | [.Term, .Index, .Body.Job.Status]
>                | @csv' \
>   | head -n30
2,121,""
2,248,""
2,351,""
2,460,""
2,573,""
2,850,""
2,955,""
2,1063,""
2,1173,""
2,1281,""
2,1395,""
2,1504,""
3,1609,"running"
3,1668,"running"
3,1747,"running"
3,1829,"running"
3,1917,"running"
3,2002,"running"
3,2084,"running"
3,2170,"running"
3,2248,"running"
3,2334,"running"
3,2435,"running"
3,2518,"running"
3,2609,"running"
3,2698,"running"
3,2784,"running"
3,2871,"running"
3,2959,"running"
3,3045,"running"
$ IDX=3045; nomad operator raft _state --last-index $IDX . | jq '.JobSummaries[] | select(.JobID=="horizontal_hey") | .Children'
{
  "Pending": 1,
  "Running": -18,
  "Dead": 29
}

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I took a pass through the rest of structs.Job to see if there was anything else we needed to hit. Looks like ModifyIndex and CreateIndex will get handled when we upsert into raft. (Multiregion also gave me pause, but a job that's been registered will already have the "multiregion interpolated" version).


dispatched := m.dispatchedJobs(job)
require.NotEmpty(t, dispatched)
require.Empty(t, dispatched[0].Status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, but would verifying that the status eventually does get set correctly be a lot of lift for this test? Just to make sure we're not relying on a side-effect of the bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to make an e2e or have a higher level integration tests, but I fear it'll be flaky and a slow test: it depends on a periodic job triggering and running after a leadership transition.

Alternatively, we could force the conditions by manipulating the job store directly: force the inserted job to be running state, and force job update. After the scaffolding, it's not obvious to me if the resulting test is better at protecting against a regression than this one, though it's going to be slower.

@tgross
Copy link
Member

tgross commented Mar 9, 2021

If you rebase on main you'll get the fix for lint-go @notnoop

Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for great info and the quick work!

@tgross
Copy link
Member

tgross commented Mar 24, 2021

Potentially related issue: #10222

Mahmood Ali added 2 commits March 25, 2021 11:27
Fixes a bug where Nomad reports negative or incorrect running children
counts for periodic jobs.

The periodic dispatcher derives a child job without reseting the status.
If the periodic job has a `running` status, the derived job will start
as `running` status and transition to `pending`.  Since this is
unexpected transition, the counting in StateStore.setJobSummary gets out of sync and
result in negative/incorrect values.

Note that this only affects periodic jobs after a leader transition.
During the first job registration, the job is added with `pending` or
`""` status. However, after a leader transition, the new leader
repopulates the dispatcher heap with `"running"` status and triggers the
bug.
@notnoop
Copy link
Contributor Author

notnoop commented Mar 25, 2021

I've updated the PR to resets the status for dispatch jobs. It's another manifestation of the a 1.0.3 regression bug with a failing test added in 032945b where the failure is seen in https://app.circleci.com/pipelines/github/hashicorp/nomad/15268/workflows/a938267c-e127-49f9-9cef-707ae434d486/jobs/144218 .

Apparently, in some cases, we did reset the job status in setJobStatus handler before, but the logic changed in in https://github.com/hashicorp/nomad/pull/9768/files#diff-53c7d524897a192ece819816a9dcf006dee46747fbb853e7f638601ea556a72cL4430-L4432 . Sadly, I'm failing to get the tests passing if I unilaterily set the job status to "", so I will dig in more.

@notnoop notnoop requested a review from tgross March 25, 2021 23:18
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. That we hit this bug with the dispatch jobs seems like it really helps because we don't have the non-determinism of a leader election to reproduce.

@@ -4,10 +4,12 @@ job "periodic" {

constraint {
attribute = "${attr.kernel.name}"
value = "linux"
operator = "set_contains_any"
value = "darwin,linux"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's handy, good catch. There's probably a bunch of tests where this is safe to do. Something for a separate PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, set_contains_any isn't documented in https://www.nomadproject.io/docs/job-specification/constraint - it's only documented for affinity constraints so we should update that too.

Also, for future readers, I considered having the constraint be based on docker os attribute - but chose not to. Windows is a bit slow, and I didn't want to run tests on LCOW Windows clients if we add some.

@notnoop notnoop merged commit 274e795 into main Mar 26, 2021
@notnoop notnoop deleted the b-periodic-init-status branch March 26, 2021 13:19
tgross added a commit that referenced this pull request Mar 30, 2021
* Fixed order
* Added missing changlog entry for GH-10145
* Fixed misclassified entry for GH-10158
tgross added a commit that referenced this pull request Mar 30, 2021
* Fixed order
* Added missing changlog entry for GH-10145
* Fixed misclassified entry for GH-10158
@tgross tgross added this to the 1.1.0 milestone Apr 28, 2021
schmichael pushed a commit that referenced this pull request May 14, 2021
periodic: always reset periodic children status
@lgfa29 lgfa29 mentioned this pull request Sep 2, 2021
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants