-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi task-groups jobs hit canary progress deadline in promote #7058
Comments
I am encountering the same issue as well. We have a particular job with 20 allocations of a single task group type and another different task group with a single task/allocation. We typically like to deploy a canary (with a single alloc of each task group) and then try to promote the next day. The behavior we see is that the deployment fails immediately with the same error message |
Hey there Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this. Thanks! |
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍 |
Stalebot hit this and shouldn't have. We need to do some follow-up here. |
Ok, so I've spent a good bit of the week trying to figure this one out and finally with some clues from my colleague @rc407's initial investigation I think I've got it. So I want to brain-dump what I know at this point. An important detail to note here is that Nomad tracks the progress deadline for each task group separately even though there's only a single deployment for a given version of a job. You can configure each task group with its own In the easy case without canaries, once we pass the progress deadline the deployment watcher's When canaries are in play, Any update to a deployment triggers the deployment status update channel in our The problem that's happening is when the group that set the deadline finishes, the next Consider a job with two task groups: job "example" {
group "group1" {
count = 1
update {
auto_revert = false
canary = 1
progress_deadine = "10m"
}
}
group "group2" {
count = 10
update {
auto_revert = false
canary = 1
progress_deadine = "10m"
}
}
} The first group's canary becomes healthy T01:00 and sets the group's The second group's canary becomes healthy a minute later and sets the group's The progress deadline expires, and some time later at T15:00 we promote the deployment. The promotion triggers the deployment update channel, but note that we don't yet have any new allocations. When we compare the groups, the we query the first group and find out it's done. So it doesn't contribute a value. So we only use the second group to set the deadline, and this is where we hit the bug. That value is a change from what we had, so we update the deadline and set it into the past! This immediately triggers our timeout and the deployment fails. So what do we do about it? I think we can fix this by having manual promotion immediately reset the Another option would be to change the I've now finally got a failing test here which proves the problem out. I'm at the end of my week here but I'll be able to drop a patch sometime next week; it looks like it should be a pretty small intervention. My working branch is test patchdiff --git a/nomad/deploymentwatcher/deployments_watcher_test.go b/nomad/deploymentwatcher/deployments_watcher_test.go
index 8777a398f..8fa635577 100644
--- a/nomad/deploymentwatcher/deployments_watcher_test.go
+++ b/nomad/deploymentwatcher/deployments_watcher_test.go
@@ -1334,6 +1334,216 @@ func TestDeploymentWatcher_PromotedCanary_UpdatedAllocs(t *testing.T) {
})
}
+func TestDeploymentWatcher_ProgressDeadline_LatePromote(t *testing.T) {
+ t.Parallel()
+ require := require.New(t)
+ mtype := structs.MsgTypeTestSetup
+
+ w, m := defaultTestDeploymentWatcher(t)
+ w.SetEnabled(true, m.state)
+
+ m.On("UpdateDeploymentStatus", mocker.MatchedBy(func(args *structs.DeploymentStatusUpdateRequest) bool {
+ return true
+ })).Return(nil).Maybe()
+
+ progressTimeout := time.Millisecond * 10000
+ j := mock.Job()
+ j.TaskGroups[0].Name = "group1"
+ j.TaskGroups[0].Update = structs.DefaultUpdateStrategy.Copy()
+ j.TaskGroups[0].Update.MaxParallel = 2
+ j.TaskGroups[0].Update.AutoRevert = false
+ j.TaskGroups[0].Update.ProgressDeadline = progressTimeout
+ j.TaskGroups = append(j.TaskGroups, j.TaskGroups[0].Copy())
+ j.TaskGroups[0].Name = "group2"
+
+ d := mock.Deployment()
+ d.JobID = j.ID
+ d.TaskGroups = map[string]*structs.DeploymentState{
+ "group1": {
+ ProgressDeadline: progressTimeout,
+ Promoted: false,
+ PlacedCanaries: []string{},
+ DesiredCanaries: 1,
+ DesiredTotal: 3,
+ PlacedAllocs: 0,
+ HealthyAllocs: 0,
+ UnhealthyAllocs: 0,
+ },
+ "group2": {
+ ProgressDeadline: progressTimeout,
+ Promoted: false,
+ PlacedCanaries: []string{},
+ DesiredCanaries: 1,
+ DesiredTotal: 1,
+ PlacedAllocs: 0,
+ HealthyAllocs: 0,
+ UnhealthyAllocs: 0,
+ },
+ }
+
+ require.NoError(m.state.UpsertJob(mtype, m.nextIndex(), j))
+ require.NoError(m.state.UpsertDeployment(m.nextIndex(), d))
+
+ // require that we get a call to UpsertDeploymentPromotion
+ matchConfig := &matchDeploymentPromoteRequestConfig{
+ Promotion: &structs.DeploymentPromoteRequest{
+ DeploymentID: d.ID,
+ All: true,
+ },
+ Eval: true,
+ }
+ matcher := matchDeploymentPromoteRequest(matchConfig)
+ m.On("UpdateDeploymentPromotion", mocker.MatchedBy(matcher)).Return(nil).Run(func(args mocker.Arguments) {
+ fmt.Println(d.ID)
+ // spew.Dump(d)
+ })
+
+ m1 := matchUpdateAllocDesiredTransitions([]string{d.ID})
+ m.On("UpdateAllocDesiredTransition", mocker.MatchedBy(m1)).Return(nil)
+
+ // create canaries
+
+ now := time.Now()
+
+ canary1 := mock.Alloc()
+ canary1.Job = j
+ canary1.DeploymentID = d.ID
+ canary1.TaskGroup = "group1"
+ canary1.DesiredStatus = structs.AllocDesiredStatusRun
+ canary1.ModifyTime = now.UnixNano()
+
+ canary2 := mock.Alloc()
+ canary2.Job = j
+ canary2.DeploymentID = d.ID
+ canary2.TaskGroup = "group2"
+ canary2.DesiredStatus = structs.AllocDesiredStatusRun
+ canary2.ModifyTime = now.UnixNano()
+
+ allocs := []*structs.Allocation{canary1, canary2}
+ err := m.state.UpsertAllocs(mtype, m.nextIndex(), allocs)
+ require.NoError(err)
+
+ // 2nd group's canary becomes healthy
+
+ now = time.Now()
+
+ canary2 = canary2.Copy()
+ canary2.ModifyTime = now.UnixNano()
+ canary2.DeploymentStatus = &structs.AllocDeploymentStatus{
+ Canary: true,
+ Healthy: helper.BoolToPtr(true),
+ Timestamp: now,
+ }
+
+ allocs = []*structs.Allocation{canary2}
+ err = m.state.UpdateAllocsFromClient(mtype, m.nextIndex(), allocs)
+ require.NoError(err)
+
+ // wait for long enough to ensure we read deployment update channel
+ // this sleep creates the race condition associated with #7058
+ time.Sleep(50 * time.Millisecond)
+
+ // 1st group's canary becomes healthy
+ now = time.Now()
+
+ canary1 = canary1.Copy()
+ canary1.ModifyTime = now.UnixNano()
+ canary1.DeploymentStatus = &structs.AllocDeploymentStatus{
+ Canary: true,
+ Healthy: helper.BoolToPtr(true),
+ Timestamp: now,
+ }
+
+ allocs = []*structs.Allocation{canary1}
+ err = m.state.UpdateAllocsFromClient(mtype, m.nextIndex(), allocs)
+ require.NoError(err)
+
+ // ensure progress_deadline has definitely expired
+ time.Sleep(progressTimeout)
+
+ // promote the deployment
+
+ req := &structs.DeploymentPromoteRequest{
+ DeploymentID: d.ID,
+ All: true,
+ }
+ err = w.PromoteDeployment(req, &structs.DeploymentUpdateResponse{})
+ require.NoError(err)
+
+ // wait for long enough to ensure we read deployment update channel
+ time.Sleep(50 * time.Millisecond)
+
+ // create new allocs for promoted deployment
+ // (these come from plan_apply, not a client update)
+ now = time.Now()
+
+ alloc1a := mock.Alloc()
+ alloc1a.Job = j
+ alloc1a.DeploymentID = d.ID
+ alloc1a.TaskGroup = "group1"
+ alloc1a.ClientStatus = structs.AllocClientStatusPending
+ alloc1a.DesiredStatus = structs.AllocDesiredStatusRun
+ alloc1a.ModifyTime = now.UnixNano()
+
+ alloc1b := mock.Alloc()
+ alloc1b.Job = j
+ alloc1b.DeploymentID = d.ID
+ alloc1b.TaskGroup = "group1"
+ alloc1b.ClientStatus = structs.AllocClientStatusPending
+ alloc1b.DesiredStatus = structs.AllocDesiredStatusRun
+ alloc1b.ModifyTime = now.UnixNano()
+
+ allocs = []*structs.Allocation{alloc1a, alloc1b}
+ err = m.state.UpsertAllocs(mtype, m.nextIndex(), allocs)
+ require.NoError(err)
+
+ // allocs become healthy
+
+ now = time.Now()
+
+ alloc1a = alloc1a.Copy()
+ alloc1a.ClientStatus = structs.AllocClientStatusRunning
+ alloc1a.ModifyTime = now.UnixNano()
+ alloc1a.DeploymentStatus = &structs.AllocDeploymentStatus{
+ Canary: false,
+ Healthy: helper.BoolToPtr(true),
+ Timestamp: now,
+ }
+
+ alloc1b = alloc1b.Copy()
+ alloc1b.ClientStatus = structs.AllocClientStatusRunning
+ alloc1b.ModifyTime = now.UnixNano()
+ alloc1b.DeploymentStatus = &structs.AllocDeploymentStatus{
+ Canary: false,
+ Healthy: helper.BoolToPtr(true),
+ Timestamp: now,
+ }
+
+ allocs = []*structs.Allocation{alloc1a, alloc1b}
+ err = m.state.UpdateAllocsFromClient(mtype, m.nextIndex(), allocs)
+ require.NoError(err)
+
+ // ensure any progress deadline has expired
+ time.Sleep(progressTimeout * 1)
+
+ // without a scheduler running we'll never mark the deployment as
+ // successful, so test that healthy == desired and that we haven't failed
+ deployment, err := m.state.DeploymentByID(nil, d.ID)
+ require.NoError(err)
+ require.Equal(structs.DeploymentStatusRunning, deployment.Status)
+
+ group1 := deployment.TaskGroups["group1"]
+
+ require.Equal(group1.DesiredTotal, group1.HealthyAllocs, "not enough healthy")
+ require.Equal(group1.DesiredTotal, group1.PlacedAllocs, "not enough placed")
+ require.Equal(0, group1.UnhealthyAllocs)
+
+ group2 := deployment.TaskGroups["group2"]
+ require.Equal(group2.DesiredTotal, group2.HealthyAllocs, "not enough healthy")
+ require.Equal(group2.DesiredTotal, group2.PlacedAllocs, "not enough placed")
+ require.Equal(0, group2.UnhealthyAllocs)
+}
+
// Test scenario where deployment initially has no progress deadline
// After the deployment is updated, a failed alloc's DesiredTransition should be set
func TestDeploymentWatcher_Watch_StartWithoutProgressDeadline(t *testing.T) { Failing results for that test, demonstrating the bug:
|
Fantastic troubleshooting here @tgross , thanks so much for digging into this. This bug has plagued us when we deploy one of our largest services with 2 task groups and canaries, where we want to wait a day until the canary promotion happens to allow for more time to determine if we want to promote our new version. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.9.5
Operating system and Environment details
Ubuntu 16.04
Issue
Multi task-groups jobs are failing due to progress deadline
Reproduction steps
Create a job with 2 task groups, use default progress deadline of 10m and canary deployment for both task groups. Start deployment, after canary are running healthy wait til you hit the deadline (should not be valid/used by now as it refers to healthy instances). Promote the deployment. Deployment fails with
Failed due to progress deadline
.I am not able to reproduce this at every attempt hinting at possibly a secondary root cause that I haven't identified just yet.
We initially thought it was due to #4738 but we have now upgraded to a version that includes that fix and we are still seeing this. Furthermore I was able to reproduce this also for deployments where no canary instances have failed.
Last but not least I haven't been able to reproduce this at all for jobs with only 1 task group.
Job file
Solutions discussion
I've already verified that promoting prior to the deadline and setting the deadline to 0 (job will fail whenever a new instance fails) work as expected.
I am going to try and provide you all with some DEBUG logs.
The text was updated successfully, but these errors were encountered: