Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

submitting jobs during deployment promotion results in 3 versions #10098

Closed
tgross opened this issue Feb 26, 2021 · 2 comments
Closed

submitting jobs during deployment promotion results in 3 versions #10098

tgross opened this issue Feb 26, 2021 · 2 comments
Assignees

Comments

@tgross
Copy link
Member

tgross commented Feb 26, 2021

When canary deployments are pending manual promotion, submitting another job can result in 3 simultaneous versions of the job running until the most recent deployment is promoted. What the correct behavior should be here is a little complex to determine, but in any case we should make sure we understand it at least.

This has been reported on versions of Nomad as early as 0.10.5 and 0.12.0, and I've just verified it with the current development head (1.0.4-dev). Possibly related to #6939 #8439

To reproduce on a current version of Nomad, consider the following job spec with canaries deployments and manual promotion:

jobspec
job "example" {
  datacenters = ["dc1"]

  meta {
    key = "value0"
  }

  update {
    max_parallel      = 3
    # health_check      = "checks" <- default
    # min_healthy_time  = "10s"    <- default
    healthy_deadline  = "30s"
    progress_deadline = "1m"
    auto_revert       = true
    # auto_promote      = false <- default
    canary            = 1
  }

  group "web" {

    count = 10

    network {
      port "www" {
        to = 8001
      }
    }

    task "httpd" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      template {
        data        = "<html>hello, world</html>"
        destination = "local/index.html"
      }

      resources {
        cpu    = 128
        memory = 64
      }
    }
  }
}

After running the job and waiting for its deployment to be marked successful, make the following modification to the jobspec:

   meta {
-    key = "value0"
+    key = "value1"
   }

Run the job, wait for the canary to become healthy:

$ nomad job status ex
...

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
web         0       0         11       0       0         0

Latest Deployment
ID          = 17fc028b
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
web         true         false     10       1         1       1        0          2021-02-26T19:45:49Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created   Modified
083e2f83  2bca72d3  web         1        run      running  21s ago   10s ago
0cc54602  2bca72d3  web         0        run      running  1m7s ago  55s ago
3c6a746a  2bca72d3  web         0        run      running  1m7s ago  55s ago
532759a9  2bca72d3  web         0        run      running  1m7s ago  55s ago
5b1bf2c2  2bca72d3  web         0        run      running  1m7s ago  55s ago
641bf9c1  2bca72d3  web         0        run      running  1m7s ago  55s ago
64585431  2bca72d3  web         0        run      running  1m7s ago  55s ago
aaee1e47  2bca72d3  web         0        run      running  1m7s ago  55s ago
b69655d0  2bca72d3  web         0        run      running  1m7s ago  55s ago
cb41f7ab  2bca72d3  web         0        run      running  1m7s ago  55s ago
df518015  2bca72d3  web         0        run      running  1m7s ago  55s ago

Then modify the jobspec again (but don't run it yet!):

   meta {
-    key = "value1"
+    key = "value2"
   }

Promote the deployment, and immediately run the job with the new jobspec:

$ nomad deployment promote 17f; nomad job run ./example.nomad
==> Monitoring evaluation "2b71a0e3"
    Evaluation triggered by job "example"
    Evaluation within deployment: "17fc028b"
    Allocation "08059832" created: node "2bca72d3", group "web"
    Allocation "bf0ebf00" created: node "2bca72d3", group "web"
    Allocation "db194b08" created: node "2bca72d3", group "web"
==> Monitoring evaluation "2b71a0e3"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "2b71a0e3" finished with status "complete"
==> Monitoring evaluation "75e263c0"
    Evaluation triggered by job "example"
    Allocation "01d8aad2" created: node "2bca72d3", group "web"
==> Monitoring evaluation "75e263c0"
    Evaluation within deployment: "582a89dd"
    Allocation "01d8aad2" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "75e263c0" finished with status "complete"

The job status at this point is a mix of version 0, version 1, and version 2 allocations:

$ nomad job status ex
...
Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
web         0       0         11       0       4         0

Latest Deployment
ID          = 582a89dd
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
web         true         false     10       1         1       1        0          2021-02-26T19:46:52Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
01d8aad2  2bca72d3  web         2        run      running   11s ago    0s ago
08059832  2bca72d3  web         1        run      running   12s ago    5s ago
db194b08  2bca72d3  web         1        run      running   12s ago    5s ago
bf0ebf00  2bca72d3  web         1        run      running   12s ago    5s ago
083e2f83  2bca72d3  web         1        run      running   1m14s ago  1m3s ago
641bf9c1  2bca72d3  web         0        stop     complete  2m ago     6s ago
5b1bf2c2  2bca72d3  web         0        run      running   2m ago     1m48s ago
532759a9  2bca72d3  web         0        run      running   2m ago     1m48s ago
64585431  2bca72d3  web         0        stop     complete  2m ago     6s ago
aaee1e47  2bca72d3  web         0        run      running   2m ago     1m48s ago
b69655d0  2bca72d3  web         0        stop     complete  2m ago     6s ago
3c6a746a  2bca72d3  web         0        run      running   2m ago     1m48s ago
cb41f7ab  2bca72d3  web         0        run      running   2m ago     1m48s ago
0cc54602  2bca72d3  web         0        run      running   2m ago     1m48s ago
df518015  2bca72d3  web         0        stop     complete  2m ago     6s ago

This will gradually stabilize to version 1 and version 2, but fortunately if at any point the latest deployment is promoted, the job appears to converge to version 2 with a successful deployment as expected:

ID            = example
Name          = example
Submit Date   = 2021-02-26T19:45:42Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
web         0       0         10       0       14        0

Latest Deployment
ID          = 582a89dd
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
web         true         true      10       1         10      10       0          2021-02-26T19:48:2
9Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
3c424f6a  2bca72d3  web         2        run      running   19s ago    3s ago
896a1b5c  2bca72d3  web         2        run      running   19s ago    3s ago
1b727979  2bca72d3  web         2        run      running   19s ago    3s ago
393b4e30  2bca72d3  web         2        run      running   37s ago    21s ago
6ecd4eff  2bca72d3  web         2        run      running   37s ago    21s ago
3371412b  2bca72d3  web         2        run      running   37s ago    21s ago
ff7a26c5  2bca72d3  web         2        run      running   55s ago    38s ago
ed32a383  2bca72d3  web         2        run      running   55s ago    38s ago
a10d49a6  2bca72d3  web         2        run      running   55s ago    38s ago
01d8aad2  2bca72d3  web         2        run      running   1m51s ago  1m40s ago
08059832  2bca72d3  web         1        stop     complete  1m52s ago  49s ago
db194b08  2bca72d3  web         1        stop     complete  1m52s ago  49s ago
bf0ebf00  2bca72d3  web         1        stop     complete  1m52s ago  49s ago
083e2f83  2bca72d3  web         1        stop     complete  2m54s ago  49s ago
5b1bf2c2  2bca72d3  web         0        stop     complete  3m40s ago  14s ago
0cc54602  2bca72d3  web         0        stop     complete  3m40s ago  31s ago
aaee1e47  2bca72d3  web         0        stop     complete  3m40s ago  14s ago
b69655d0  2bca72d3  web         0        stop     complete  3m40s ago  1m46s ago
3c6a746a  2bca72d3  web         0        stop     complete  3m40s ago  31s ago
cb41f7ab  2bca72d3  web         0        stop     complete  3m40s ago  31s ago
532759a9  2bca72d3  web         0        stop     complete  3m40s ago  13s ago
df518015  2bca72d3  web         0        stop     complete  3m40s ago  1m46s ago
641bf9c1  2bca72d3  web         0        stop     complete  3m40s ago  1m46s ago
64585431  2bca72d3  web         0        stop     complete  3m40s ago  1m46s ago
@Juanadelacuesta
Copy link
Member

This behaviour was modified by introducing a blocking query for deployments on PR #10661.

@Juanadelacuesta Juanadelacuesta removed the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jan 8, 2024
Copy link

github-actions bot commented Jan 2, 2025

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants