Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'deployment fail' is not immediate and goes through canary phase #10882

Open
notnoop opened this issue Jul 9, 2021 · 0 comments
Open

'deployment fail' is not immediate and goes through canary phase #10882

notnoop opened this issue Jul 9, 2021 · 0 comments

Comments

@notnoop
Copy link
Contributor

notnoop commented Jul 9, 2021

Issue

When a deployment fails, the reversion is treated like a new deployment that goes through the canary phase. To make things worse, it triggers the bug in #10098 so it leaves the bad allocations intact!

I expect a reversion to be immediate. Also, considering that the target version was marked "Stable", I'd expect the rollback to bypass canary phase to get service up quickly. It may be reasonable to even skip auto_revert logic: If an alloc becomes unhealthy spuriously, we probably shouldn't revert to even a previous update.

That being said, reverts can be complicated in real life. If the earlier version isn't forward compatible with the reverted deployment (e.g. new deployment migrated database or committed data incompatible with old one), the previous stable version will fail to start and require a manual intervention. Not sure what we can do there though.

Reproduction steps

  1. Submit the job: nomad job run example.hcl
example job file with canaries and max_parallel
"testn.hcl" 50L, 825B written
job "example" {
  datacenters = ["dc1"]

  meta {
    key = "value1"
  }

  update {
    max_parallel      = 5
    # min_healthy_time  = "10s"    <- default
    healthy_deadline  = "30s"
    progress_deadline = "1m"
    auto_revert       = true
    # auto_promote      = false <- default
    canary            = 2
  }

  group "web" {

    count = 10

    network {
      port "www" {
        to = 8001
      }
    }

    task "httpd" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "/bin/sh"
        args    = ["-c", "httpd -v -f -p 8001 -h /local"]
        ports   = ["www"]
      }

      template {
        data        = "<html>hello, world</html>"
        destination = "local/index.html"
      }

      resources {
        cpu    = 128
        memory = 64
      }
    }
  }
}
  1. Wait until all allocations are healthy
  2. Update value0 to value1 and submit updated job
  3. Promote the new deployment nomad deployment promote ... after canaries pass
  4. Fail deployment shortly after promotion nomad deployment fail

Here we mimic discovering a subtle bug that wasn't detected by health-checks or canary smoke testing.

Expected Result

Immediately, we revert to version 0, version 1 allocations get stopped and are replaced by version 0 allocations.

Actual Result

A new Job Version 2 (copy of Version 0) deployment is created. The deployment has 2 canary allocations placed and all version 1 allocations are left intact until operator promote version 2 deployment.

mars:gh-10098 notnoop$ nomad deployment fail e7d21958
Deployment "e7d21958-7c86-4ad0-52f1-6ef03c79bfcd" failed. Auto-reverted to job version 0.

==> 2021-07-09T13:15:21-04:00: Monitoring evaluation "203219de"
    2021-07-09T13:15:21-04:00: Evaluation triggered by job "example"
    2021-07-09T13:15:21-04:00: Evaluation within deployment: "e7d21958"
==> 2021-07-09T13:15:22-04:00: Monitoring evaluation "203219de"
    2021-07-09T13:15:22-04:00: Allocation "ac07d510" created: node "c6c45bc1", group "web"
    2021-07-09T13:15:22-04:00: Allocation "670ea311" modified: node "c6c45bc1", group "web"
    2021-07-09T13:15:22-04:00: Allocation "9382e28d" modified: node "c6c45bc1", group "web"
    2021-07-09T13:15:22-04:00: Allocation "e6e6a226" modified: node "c6c45bc1", group "web"
    2021-07-09T13:15:22-04:00: Allocation "6f4c7bb2" created: node "c6c45bc1", group "web"
    2021-07-09T13:15:22-04:00: Evaluation status changed: "pending" -> "complete"
==> 2021-07-09T13:15:22-04:00: Evaluation "203219de" finished with status "complete"
==> 2021-07-09T13:15:22-04:00: Monitoring deployment "70946bcb"
  ⠋ Deployment "70946bcb" in progress...

    2021-07-09T13:25:58-04:00
    ID          = 70946bcb
    Job ID      = example
    Job Version = 2
    Status      = running
    Description = Deployment is running but requires manual promotion

    Deployed
    Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
    web         true         false     10       2         5       5        0          2021-07-09T13:16:32-04:00^C
mars:gh-10098 notnoop$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2021-07-09T13:14:13-04:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
web         0       0         12       0       7         0

Latest Deployment
ID          = 70946bcb
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
web         true         false     10       2         5       5        0          2021-07-09T13:16:32-04:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
ac07d510  c6c45bc1  web         2        run      running   10m55s ago  10m44s ago
6f4c7bb2  c6c45bc1  web         2        run      running   10m55s ago  10m44s ago
6b9087e1  c6c45bc1  web         1        run      running   11m5s ago   10m48s ago
bd057f14  c6c45bc1  web         1        run      running   11m5s ago   10m48s ago
ad4596af  c6c45bc1  web         1        run      running   11m5s ago   10m48s ago
5794a9ce  c6c45bc1  web         1        run      running   11m5s ago   10m48s ago
1c44c3f3  c6c45bc1  web         1        run      running   11m5s ago   10m48s ago
e1b7fc6b  c6c45bc1  web         1        run      running   11m42s ago  11m31s ago
1d0ed499  c6c45bc1  web         1        run      running   11m42s ago  11m31s ago
01031df3  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
670ea311  c6c45bc1  web         2        run      running   12m3s ago   10m45s ago
9382e28d  c6c45bc1  web         2        run      running   12m3s ago   10m45s ago
93dc930a  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
94912128  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
640ee17b  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
55e628cb  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
201ac1fb  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
1d1c0d4b  c6c45bc1  web         0        stop     complete  12m3s ago   10m59s ago
e6e6a226  c6c45bc1  web         2        run      running   12m3s ago   10m45s ago
@tgross tgross changed the title deployment reverts are not immediate and go through canary phase 'deployment fail' is not immediate and goes through canary phase Aug 22, 2022
@tgross tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

1 participant