'deployment fail' is not immediate and goes through canary phase #10882

notnoop · 2021-07-09T17:32:41Z

Issue

When a deployment fails, the reversion is treated like a new deployment that goes through the canary phase. To make things worse, it triggers the bug in #10098 so it leaves the bad allocations intact!

I expect a reversion to be immediate. Also, considering that the target version was marked "Stable", I'd expect the rollback to bypass canary phase to get service up quickly. It may be reasonable to even skip auto_revert logic: If an alloc becomes unhealthy spuriously, we probably shouldn't revert to even a previous update.

That being said, reverts can be complicated in real life. If the earlier version isn't forward compatible with the reverted deployment (e.g. new deployment migrated database or committed data incompatible with old one), the previous stable version will fail to start and require a manual intervention. Not sure what we can do there though.

Reproduction steps

Submit the job: nomad job run example.hcl

example job file with canaries and max_parallel

"testn.hcl" 50L, 825B written
job "example" {
  datacenters = ["dc1"]

  meta {
    key = "value1"
  }

  update {
    max_parallel      = 5
    # min_healthy_time  = "10s"    <- default
    healthy_deadline  = "30s"
    progress_deadline = "1m"
    auto_revert       = true
    # auto_promote      = false <- default
    canary            = 2
  }

  group "web" {

    count = 10

    network {
      port "www" {
        to = 8001
      }
    }

    task "httpd" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "/bin/sh"
        args    = ["-c", "httpd -v -f -p 8001 -h /local"]
        ports   = ["www"]
      }

      template {
        data        = "<html>hello, world</html>"
        destination = "local/index.html"
      }

      resources {
        cpu    = 128
        memory = 64
      }
    }
  }
}

Wait until all allocations are healthy
Update value0 to value1 and submit updated job
Promote the new deployment nomad deployment promote ... after canaries pass
Fail deployment shortly after promotion nomad deployment fail

Here we mimic discovering a subtle bug that wasn't detected by health-checks or canary smoke testing.

Expected Result

Immediately, we revert to version 0, version 1 allocations get stopped and are replaced by version 0 allocations.

Actual Result

A new Job Version 2 (copy of Version 0) deployment is created. The deployment has 2 canary allocations placed and all version 1 allocations are left intact until operator promote version 2 deployment.

mars:gh-10098 notnoop$ nomad deployment fail e7d21958
Deployment "e7d21958-7c86-4ad0-52f1-6ef03c79bfcd" failed. Auto-reverted to job version 0.

==> 2021-07-09T13:15:21-04:00: Monitoring evaluation "203219de"
    2021-07-09T13:15:21-04:00: Evaluation triggered by job "example"
    2021-07-09T13:15:21-04:00: Evaluation within deployment: "e7d21958"
==> 2021-07-09T13:15:22-04:00: Monitoring evaluation "203219de"
    2021-07-09T13:15:22-04:00: Allocation "ac07d510" created: node "c6c45bc1", group "web"
    2021-07-09T13:15:22-04:00: Allocation "670ea311" modified: node "c6c45bc1", group "web"
    2021-07-09T13:15:22-04:00: Allocation "9382e28d" modified: node "c6c45bc1", group "web"
    2021-07-09T13:15:22-04:00: Allocation "e6e6a226" modified: node "c6c45bc1", group "web"
    2021-07-09T13:15:22-04:00: Allocation "6f4c7bb2" created: node "c6c45bc1", group "web"
    2021-07-09T13:15:22-04:00: Evaluation status changed: "pending" -> "complete"
==> 2021-07-09T13:15:22-04:00: Evaluation "203219de" finished with status "complete"
==> 2021-07-09T13:15:22-04:00: Monitoring deployment "70946bcb"
  ⠋ Deployment "70946bcb" in progress...

    2021-07-09T13:25:58-04:00
    ID          = 70946bcb
    Job ID      = example
    Job Version = 2
    Status      = running
    Description = Deployment is running but requires manual promotion

    Deployed
    Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
    web         true         false     10       2         5       5        0          2021-07-09T13:16:32-04:00^C
mars:gh-10098 notnoop$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2021-07-09T13:14:13-04:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
web         0       0         12       0       7         0

Latest Deployment
ID          = 70946bcb
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
web         true         false     10       2         5       5        0          2021-07-09T13:16:32-04:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
ac07d510  c6c45bc1  web         2        run      running   10m55s ago  10m44s ago
6f4c7bb2  c6c45bc1  web         2        run      running   10m55s ago  10m44s ago
6b9087e1  c6c45bc1  web         1        run      running   11m5s ago   10m48s ago
bd057f14  c6c45bc1  web         1        run      running   11m5s ago   10m48s ago
ad4596af  c6c45bc1  web         1        run      running   11m5s ago   10m48s ago
5794a9ce  c6c45bc1  web         1        run      running   11m5s ago   10m48s ago
1c44c3f3  c6c45bc1  web         1        run      running   11m5s ago   10m48s ago
e1b7fc6b  c6c45bc1  web         1        run      running   11m42s ago  11m31s ago
1d0ed499  c6c45bc1  web         1        run      running   11m42s ago  11m31s ago
01031df3  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
670ea311  c6c45bc1  web         2        run      running   12m3s ago   10m45s ago
9382e28d  c6c45bc1  web         2        run      running   12m3s ago   10m45s ago
93dc930a  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
94912128  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
640ee17b  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
55e628cb  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
201ac1fb  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
1d1c0d4b  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
e6e6a226  c6c45bc1  web         2        run      running   12m3s ago   10m45s ago

The text was updated successfully, but these errors were encountered:

notnoop added type/bug stage/thinking theme/deployments labels Jul 9, 2021

tgross changed the title ~~deployment reverts are not immediate and go through canary phase~~ 'deployment fail' is not immediate and goes through canary phase Aug 22, 2022

tgross mentioned this issue Aug 22, 2022

Job revert should respect the update stanza #14200

Closed

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'deployment fail' is not immediate and goes through canary phase #10882

'deployment fail' is not immediate and goes through canary phase #10882

notnoop commented Jul 9, 2021 •

edited

Loading

'deployment fail' is not immediate and goes through canary phase #10882

'deployment fail' is not immediate and goes through canary phase #10882

Comments

notnoop commented Jul 9, 2021 • edited Loading

Issue

Reproduction steps

Expected Result

Actual Result

notnoop commented Jul 9, 2021 •

edited

Loading