Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of fix deadlock in plan_apply into release/1.1.x #13469

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #13407 to be assessed for backporting due to the inclusion of the label backport/1.1.x.

The below text is copied from the body of the original PR.


The plan applier has to get a snapshot with a minimum index for the
plan it's working on in order to ensure consistency. Under heavy raft
loads, we can exceed the timeout. When this happens, we hit a bug
where the plan applier blocks waiting on the indexCh forever, and
all schedulers will block in Plan.Submit.

Closing the indexCh when the asyncPlanWait is done with it will
prevent the deadlock without impacting correctness of the previous
snapshot index.


Note for reviewers: the test code here can't actually exercise this bug,
because there's no interface for us to introduce this timeout. But this
code path was not covered by unit testing, so with these tests we at least
exercise the normal state.

6fab937 is how we verified the behavior and that the fix works, but we'll
want to circle back later to build out some fault injection into the state
store code at some point.

The deadlocked goroutine stack looks like this:

goroutine 625541538 [chan receive, 514 minutes]:
github.com/hashicorp/nomad/nomad.(*planner).planApply(0xc000a410a0)
	github.com/hashicorp/nomad/nomad/plan_apply.go:156 +0x35e
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
	github.com/hashicorp/nomad/nomad/leader.go:250 +0x21e

cc @jazzyfresh @lgfa29 @mikenomitch
Fixes #10289

@hc-github-team-nomad-core hc-github-team-nomad-core merged commit 9f14e71 into release/1.1.x Jun 23, 2022
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/plan-apply-deadlock/precisely-huge-anemone branch June 23, 2022 16:07
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants