You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a deployment fails, the reversion is treated like a new deployment that goes through the canary phase. To make things worse, it triggers the bug in #10098 so it leaves the bad allocations intact!
I expect a reversion to be immediate. Also, considering that the target version was marked "Stable", I'd expect the rollback to bypass canary phase to get service up quickly. It may be reasonable to even skip auto_revert logic: If an alloc becomes unhealthy spuriously, we probably shouldn't revert to even a previous update.
That being said, reverts can be complicated in real life. If the earlier version isn't forward compatible with the reverted deployment (e.g. new deployment migrated database or committed data incompatible with old one), the previous stable version will fail to start and require a manual intervention. Not sure what we can do there though.
Promote the new deployment nomad deployment promote ... after canaries pass
Fail deployment shortly after promotion nomad deployment fail
Here we mimic discovering a subtle bug that wasn't detected by health-checks or canary smoke testing.
Expected Result
Immediately, we revert to version 0, version 1 allocations get stopped and are replaced by version 0 allocations.
Actual Result
A new Job Version 2 (copy of Version 0) deployment is created. The deployment has 2 canary allocations placed and all version 1 allocations are left intact until operator promote version 2 deployment.
mars:gh-10098 notnoop$ nomad deployment fail e7d21958
Deployment "e7d21958-7c86-4ad0-52f1-6ef03c79bfcd" failed. Auto-reverted to job version 0.
==> 2021-07-09T13:15:21-04:00: Monitoring evaluation "203219de"
2021-07-09T13:15:21-04:00: Evaluation triggered by job "example"
2021-07-09T13:15:21-04:00: Evaluation within deployment: "e7d21958"
==> 2021-07-09T13:15:22-04:00: Monitoring evaluation "203219de"
2021-07-09T13:15:22-04:00: Allocation "ac07d510" created: node "c6c45bc1", group "web"
2021-07-09T13:15:22-04:00: Allocation "670ea311" modified: node "c6c45bc1", group "web"
2021-07-09T13:15:22-04:00: Allocation "9382e28d" modified: node "c6c45bc1", group "web"
2021-07-09T13:15:22-04:00: Allocation "e6e6a226" modified: node "c6c45bc1", group "web"
2021-07-09T13:15:22-04:00: Allocation "6f4c7bb2" created: node "c6c45bc1", group "web"
2021-07-09T13:15:22-04:00: Evaluation status changed: "pending" -> "complete"
==> 2021-07-09T13:15:22-04:00: Evaluation "203219de" finished with status "complete"
==> 2021-07-09T13:15:22-04:00: Monitoring deployment "70946bcb"
⠋ Deployment "70946bcb" in progress...
2021-07-09T13:25:58-04:00
ID = 70946bcb
Job ID = example
Job Version = 2
Status = running
Description = Deployment is running but requires manual promotion
Deployed
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline
web true false 10 2 5 5 0 2021-07-09T13:16:32-04:00^C
mars:gh-10098 notnoop$ nomad job status example
ID = example
Name = example
Submit Date = 2021-07-09T13:14:13-04:00
Type = service
Priority = 50
Datacenters = dc1
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
web 0 0 12 0 7 0
Latest Deployment
ID = 70946bcb
Status = running
Description = Deployment is running but requires manual promotion
Deployed
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline
web true false 10 2 5 5 0 2021-07-09T13:16:32-04:00
Allocations
ID Node ID Task Group Version Desired Status Created Modified
ac07d510 c6c45bc1 web 2 run running 10m55s ago 10m44s ago
6f4c7bb2 c6c45bc1 web 2 run running 10m55s ago 10m44s ago
6b9087e1 c6c45bc1 web 1 run running 11m5s ago 10m48s ago
bd057f14 c6c45bc1 web 1 run running 11m5s ago 10m48s ago
ad4596af c6c45bc1 web 1 run running 11m5s ago 10m48s ago
5794a9ce c6c45bc1 web 1 run running 11m5s ago 10m48s ago
1c44c3f3 c6c45bc1 web 1 run running 11m5s ago 10m48s ago
e1b7fc6b c6c45bc1 web 1 run running 11m42s ago 11m31s ago
1d0ed499 c6c45bc1 web 1 run running 11m42s ago 11m31s ago
01031df3 c6c45bc1 web 0 stop complete 12m3s ago 10m59s ago
670ea311 c6c45bc1 web 2 run running 12m3s ago 10m45s ago
9382e28d c6c45bc1 web 2 run running 12m3s ago 10m45s ago
93dc930a c6c45bc1 web 0 stop complete 12m3s ago 10m59s ago
94912128 c6c45bc1 web 0 stop complete 12m3s ago 10m59s ago
640ee17b c6c45bc1 web 0 stop complete 12m3s ago 10m59s ago
55e628cb c6c45bc1 web 0 stop complete 12m3s ago 10m59s ago
201ac1fb c6c45bc1 web 0 stop complete 12m3s ago 10m59s ago
1d1c0d4b c6c45bc1 web 0 stop complete 12m3s ago 10m59s ago
e6e6a226 c6c45bc1 web 2 run running 12m3s ago 10m45s ago
The text was updated successfully, but these errors were encountered:
tgross
changed the title
deployment reverts are not immediate and go through canary phase
'deployment fail' is not immediate and goes through canary phase
Aug 22, 2022
Issue
When a deployment fails, the reversion is treated like a new deployment that goes through the canary phase. To make things worse, it triggers the bug in #10098 so it leaves the bad allocations intact!
I expect a reversion to be immediate. Also, considering that the target version was marked "Stable", I'd expect the rollback to bypass canary phase to get service up quickly. It may be reasonable to even skip auto_revert logic: If an alloc becomes unhealthy spuriously, we probably shouldn't revert to even a previous update.
That being said, reverts can be complicated in real life. If the earlier version isn't forward compatible with the reverted deployment (e.g. new deployment migrated database or committed data incompatible with old one), the previous stable version will fail to start and require a manual intervention. Not sure what we can do there though.
Reproduction steps
nomad job run example.hcl
example job file with canaries and max_parallel
value0
tovalue1
and submit updated jobnomad deployment promote ...
after canaries passnomad deployment fail
Here we mimic discovering a subtle bug that wasn't detected by health-checks or canary smoke testing.
Expected Result
Immediately, we revert to version 0, version 1 allocations get stopped and are replaced by version 0 allocations.
Actual Result
A new Job Version 2 (copy of Version 0) deployment is created. The deployment has 2 canary allocations placed and all version 1 allocations are left intact until operator promote version 2 deployment.
The text was updated successfully, but these errors were encountered: