-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consul: fix deadlock in check-based restarts #5975
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ import ( | |
"github.com/hashicorp/consul/api" | ||
"github.com/hashicorp/nomad/helper/testlog" | ||
"github.com/hashicorp/nomad/nomad/structs" | ||
"github.com/stretchr/testify/require" | ||
) | ||
|
||
// checkRestartRecord is used by a testFakeCtx to record when restarts occur | ||
|
@@ -194,6 +195,28 @@ func TestCheckWatcher_Healthy(t *testing.T) { | |
} | ||
} | ||
|
||
// TestCheckWatcher_Unhealthy asserts unhealthy tasks are restarted exactly once. | ||
func TestCheckWatcher_Unhealthy(t *testing.T) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm missing some context for this test - does it trigger the deadlock issue here? Is it a relatively easy thing to test for? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, this test just asserts checks are only restarted once. I added a new test for the deadlock in 1763672 and confirmed it does cause the deadlock before my changes. |
||
t.Parallel() | ||
|
||
fakeAPI, cw := testWatcherSetup(t) | ||
|
||
check1 := testCheck() | ||
restarter1 := newFakeCheckRestarter(cw, "testalloc1", "testtask1", "testcheck1", check1) | ||
cw.Watch("testalloc1", "testtask1", "testcheck1", check1, restarter1) | ||
|
||
// Check has always been failing | ||
fakeAPI.add("testcheck1", "critical", time.Time{}) | ||
|
||
// Run | ||
ctx, cancel := context.WithTimeout(context.Background(), 500*time.Millisecond) | ||
defer cancel() | ||
cw.Run(ctx) | ||
|
||
// Ensure restart was called exactly once | ||
require.Len(t, restarter1.restarts, 1) | ||
} | ||
|
||
// TestCheckWatcher_HealthyWarning asserts checks in warning with | ||
// ignore_warnings=true do not restart tasks. | ||
func TestCheckWatcher_HealthyWarning(t *testing.T) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: does it make sense to use a semaphore or channel blocking technique used elsewhere, so we don't call
task.Restart
concurrently and if we get a spike of Restarts applies, we only restart once?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question!
The checkWatcher.Run loop removes a check after Restart is called, so the same task won't be restarted more than once (until it completes the restart and re-registers the check).
In 0.8 TR.Restart just ticked a chan and so was async without having to create a new goroutine. This seemed like the least risky way of replicating that behavior.
Tasks that fail in a tight loop (
check_restart.grace=0
andrestart.delay=0
) could in theory spin up lots of goroutines, but the goroutines for a single task should rarely if ever overlap and restarting a task already involves creating a lot of resources more expensive than a goroutine.That being said I hate this "fire and forget" pattern, so I'm open to ideas as long as they can't block
checkWatcher.Run
/checkRestart.apply
. (checkWatcher
should probably be refactored to separate Watch/Unwatch mutations from check watching, but that seemed way too big a risk for a point release)