Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of fix deadlock in plan_apply into release/1.3.x #13471

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #13407 to be assessed for backporting due to the inclusion of the label backport/1.3.x.

The below text is copied from the body of the original PR.


The plan applier has to get a snapshot with a minimum index for the
plan it's working on in order to ensure consistency. Under heavy raft
loads, we can exceed the timeout. When this happens, we hit a bug
where the plan applier blocks waiting on the indexCh forever, and
all schedulers will block in Plan.Submit.

Closing the indexCh when the asyncPlanWait is done with it will
prevent the deadlock without impacting correctness of the previous
snapshot index.


Note for reviewers: the test code here can't actually exercise this bug,
because there's no interface for us to introduce this timeout. But this
code path was not covered by unit testing, so with these tests we at least
exercise the normal state.

6fab937 is how we verified the behavior and that the fix works, but we'll
want to circle back later to build out some fault injection into the state
store code at some point.

The deadlocked goroutine stack looks like this:

goroutine 625541538 [chan receive, 514 minutes]:
github.com/hashicorp/nomad/nomad.(*planner).planApply(0xc000a410a0)
	github.com/hashicorp/nomad/nomad/plan_apply.go:156 +0x35e
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
	github.com/hashicorp/nomad/nomad/leader.go:250 +0x21e

cc @jazzyfresh @lgfa29 @mikenomitch
Fixes #10289

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/plan-apply-deadlock/optionally-ace-lacewing branch from 02bc3cb to 646b6fa Compare June 23, 2022 16:07
@hc-github-team-nomad-core hc-github-team-nomad-core merged commit 86fd5d9 into release/1.3.x Jun 23, 2022
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/plan-apply-deadlock/optionally-ace-lacewing branch June 23, 2022 16:07
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants