-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task runner to avoid running task if terminal #5890
Conversation
8603028
to
f42aa1e
Compare
dead := tr.state.State == structs.TaskStateDead | ||
tr.stateLock.RUnlock() | ||
|
||
if dead { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here I only check if the task itself is dead - I suspect we should be checking if the restore alloc had a terminated alloc state. I suspect that an alloc with tasks with mixed status causes some some complications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is the right behavior. An alloc is considering running if one task completes, and all allocs will be killed if leader task dies or a task failed enough times. Until that happens, we should treat other tasks as running.
This change fixes a bug where nomad would avoid running alloc tasks if the alloc is client terminal but the server copy on the client isn't marked as running. Here, we fix the case by having task runner uses the allocRunner.shouldRun() instead of only checking the server updated alloc. Here, we preserve much of the invariants such that `tr.Run()` is always run, and don't change the overall alloc runner and task runner lifecycles. Fixes #5883
f42aa1e
to
f3c944a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
|
||
// TestAllocRunner_Restore_Completed asserts that restoring a completed | ||
// batch alloc doesn't run it again | ||
func TestAllocRunner_Restore_CompletedBatch(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name/comment mismatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test looks good, but just to verify:
- Does it fail without your fixes?
- Does it pass with
-race
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it passes with -race
and was failing before - here is a sample build failure [1] when adding test alone. The failure snippet is:
goroutine 87 [chan receive, 14 minutes]:
github.com/hashicorp/nomad/client/allocrunner.destroy(0xc000342780)
/home/travis/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner_test.go:27 +0x54
runtime.Goexit()
/home/travis/.gimme/versions/go1.12.6.linux.amd64/src/runtime/panic.go:406 +0xed
testing.(*common).FailNow(0xc000449b00)
/home/travis/.gimme/versions/go1.12.6.linux.amd64/src/testing/testing.go:609 +0x39
github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/require.Fail(0x18348e0, 0xc000449b00, 0x15fc0e0, 0x1a, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/require/require.go:285 +0xf0
github.com/hashicorp/nomad/client/allocrunner.TestAllocRunner_Restore_CompletedBatch(0xc000449b00)
/home/travis/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner_unix_test.go:204 +0xb22
testing.tRunner(0xc000449b00, 0x1639ae0)
/home/travis/.gimme/versions/go1.12.6.linux.amd64/src/testing/testing.go:865 +0xc0
created by testing.(*T).Run
/home/travis/.gimme/versions/go1.12.6.linux.amd64/src/testing/testing.go:916 +0x35a
As seen in stack trace, we fail in line 204 [1] because AR.wait() times out, then times out again in destroy defer call.
I'll follow up in another PR to change the destroy defer call so that it errors rather than blocks indefinitely on failures to make tracking these errors better.
[1] https://travis-ci.org/hashicorp/nomad/jobs/553113545
[2] https://github.com/hashicorp/nomad/compare/b-dont-start-completed-allocs-2-test-only?expand=1
[3] https://github.com/hashicorp/nomad/compare/b-dont-start-completed-allocs-2-test-only?expand=1#diff-41decefd2f35059b5c0b95166e275653R204
if err := tr.stop(); err != nil { | ||
tr.logger.Error("stop failed on terminal task", "error", err) | ||
} | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may also want to call tr.TaskStateUpdated()
since task states are persisted before the AR is notified. Therefore I think the following could happen:
- 2 tasks in an alloc start: a leader service, and a sidecar
- Leader task exits, persists TaskStateDead
- agent crashes before TaskStateUpdated is called
- agent restarts, returns here due to TaskStateDead
At this point I do not think anything will have told the sidecar service to exit despite its leader dying. If you call TaskStateUpdated
here, then all of the leader died detection logic in AR will be run: https://github.com/hashicorp/nomad/blob/master/client/allocrunner/alloc_runner.go#L415-L438
This could be done in a followup PR as well since I think your changes improve the situation.
This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes #5984 Related to #5890
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.
Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.
Here, we preserve much of the invariants such that
tr.Run()
is alwaysrun, and don't change the overall alloc runner and task runner
lifecycles.
Fixes #5883