System Jobs never marked as stable and redeploys occur #9804

idrennanvmware · 2021-01-13T18:04:01Z

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

PhotonOS3

Issue

We have noticed that system jobs are not being marked as stable (Fabio in our case) even when they are running as expected.
In addition, we suspect related to this, subsequent attempts to deploy the same job with no changes results in multiple deployment history versions (with no changes)

Reproduction steps

First we Create a Nomad Cluster and deploy a system job (example command we are using). Note we randomly generate a file name, there's nothing sensitive here.

nomad job run -hcl1 -vault-token=[redacted] -output /services/fabio/tmp/fabio.nomad.hcl > 6d056518-ea90-4c10-b73a-3c1200405356.json

Next, lets get the status of the system job. In this case we have 2 potential clients, 1 windows and 1 Linux but we don't really care if there are no nodes (in this case for windows)

./nomad job status fabio

and that results in an output like we'd expect

ID            = fabio
Name          = fabio
Submit Date   = 2021-01-13T17:32:03Z
Type          = system
Priority      = 50
Datacenters   = dev-setup
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost
fabio-linux    0       0         1        0       0         0
fabio-windows  0       0         0        0       0         0

Allocations
ID        Node ID   Task Group   Version  Desired  Status   Created  Modified
28024d3d  ba872e8e  fabio-linux  0        run      running  29s ago  17s ago

So far, so good. Now lets get the deployment status:

/nomad job deployments fabio

and we get back

No deployments found

Ok - that's not what we expect, but lets check out the UI. Interesting we see "stable" marked as false here even though everything looks healthy and happy

Just in case, we add the windows node and we see Fabio get scheduled there

./nomad job status fabio

and get back what we'd expect

ID            = fabio
Name          = fabio
Submit Date   = 2021-01-13T17:53:49Z
Type          = system
Priority      = 50
Datacenters   = dev-setup
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost
fabio-linux    0       0         1        0       0         0
fabio-windows  0       0         1        0       0         0

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created     Modified
bd59c72c  6bfd758f  fabio-windows  3        run      running  12m47s ago  6m47s ago
28024d3d  ba872e8e  fabio-linux    3        run      running  28m33s ago  6m47s ago

BUT when we run

 ./nomad job history fabio

we see this....just like the UI shows. Not stable but everything is healthy. Consul HC's are happy and Nomad is happily running the job

Version     = 0
Stable      = false
Submit Date = 2021-01-13T17:32:03Z

Nomad

Consul

Now, if we run deploy again (a few times with ZERO changes to the job file) we start to see deployments in the UI that look like this AND the allocations all get started over and over again with each deploy

Lets run

./nomad job history fabio

and now we see (kinda as we expect since I'm guessing the scheduler is just trying to get this to "stable true"

Version     = 3
Stable      = false
Submit Date = 2021-01-13T17:53:49Z

Version     = 2
Stable      = false
Submit Date = 2021-01-13T17:53:28Z

Version     = 1
Stable      = false
Submit Date = 2021-01-13T17:53:16Z

Version     = 0
Stable      = false
Submit Date = 2021-01-13T17:32:03Z

and the UI shows this

Job file (if appropriate)

Here's the fabio job file (windows removed for brevity but it's the same for both groups)

job "fabio" {
  region      = "global"
  datacenters = ["[[ .cluster_name ]]"]
  type        = "system"

  update {
    stagger      = "11s"
    max_parallel = 1
  }

  vault {
    policies = ["service-fabio"]
  }

  group "fabio-linux" {

    constraint {
      attribute = "${attr.kernel.name}"
      value     = "linux"
    }

    task "fabio-linux" {

      driver = "docker"

      template {
        data = <<EOH
          {{ with secret "consul/creds/service-fabio" }}
          registry.consul.token = {{ .Data.token }}
          {{ end }}
          registry.consul.kvpath = <redacted>/config
          registry.consul.noroutehtmlpath = <redacted>/noroute.html
          proxy.addr = :{{ env "NOMAD_PORT_lb" }}
          ui.addr = :{{ env "NOMAD_PORT_ui" }}
          registry.consul.register.addr = {{ env "NOMAD_ADDR_ui" }}
          metrics.target = statsd
          metrics.statsd.addr = 127.0.0.1:8125
          metrics.prefix = fabio
          metrics.names = services.\{\{clean .Service\}\}
        EOH

        destination = "secrets/fabio.properties"
      }

      config {
        image        = "[[ .default_docker_registry ]]/<redacted>"
        network_mode = "host"
        args = ["-cfg", "secrets/fabio.properties"]
      }

      resources {
        cpu    = "[[ .fabio.resources.cpu ]]"    # Mhz
        memory = "[[ .fabio.resources.memory ]]" # MB

        network {

          port "lb" {
            static = 29999
          }

          port "ui" {
            static = 29998
          }
        }
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

tgross · 2021-01-14T16:14:30Z

Thanks for reporting this. We'll see if we can make a minimal reproduction from these notes.

tgross · 2021-02-03T15:12:34Z

I was able to only partially reproduce this issue. I used the following minimal reproduction jobspec:

jobspec

job "example" {
  datacenters = ["dc1"]
  type        = "system"

  update {
    stagger      = "11s"
    max_parallel = 1
  }

  group "group" {

    task "task" {

      driver = "docker"

      template {
        data        = "<html>hello, world</html>"
        destination = "index.html"
      }

      config {
        image        = "busybox:1"
        network_mode = "host"
        command      = "httpd"
        args         = ["-v", "-f", "-p", "8001", "-h", "/local"]
      }

      resources {
        cpu    = 256
        memory = 128

        network {

          port "lb" {
            static = 8001
          }

          port "ui" {
            static = 29998
          }
        }
      }
    }
  }
}

I ran the job and got the warnings I'd expect to see for the network block:

$ nomad job run ./example.nomad
Job Warnings:
1 warning(s):

* Group "group" has warnings: 1 error occurred:
        * 1 error occurred:
        * Task "task": task network resources have been deprecated as of Nomad 0.12.0. Please configure networking via group network block.


==> Monitoring evaluation "2e1acd06"
    Evaluation triggered by job "example"
    Allocation "cf56c3a4" created: node "aa0bae38", group "group"
    Allocation "f9548a68" created: node "f46a408a", group "group"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "2e1acd06" finished with status "complete"

The allocations come up and are running, as we'd expect:

detailed allocation status

$ nomad alloc status cf56
ID                  = cf56c3a4-bba3-0083-21b2-8d8dfe8e9999
Eval ID             = 2e1acd06
Name                = example.group[0]
Node ID             = aa0bae38
Node Name           = ip-172-31-3-49
Job ID              = example
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 8s ago
Modified            = 5s ago

Task "task" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/256 MHz  124 KiB/128 MiB  300 MiB  lb: 172.31.3.49:8001
                                     ui: 172.31.3.49:29998

Task Events:
Started At     = 2021-02-03T14:54:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-02-03T09:54:09-05:00  Started     Task started by client
2021-02-03T09:54:07-05:00  Driver      Downloading image
2021-02-03T09:54:07-05:00  Task Setup  Building Task Directory
2021-02-03T09:54:07-05:00  Received    Task received by client


$ nomad alloc status f954
ID                  = f9548a68-35b8-8a39-df10-b272af5f3f34
Eval ID             = 2e1acd06
Name                = example.group[0]
Node ID             = f46a408a
Node Name           = ip-172-31-5-141
Job ID              = example
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 24s ago
Modified            = 21s ago

Task "task" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/256 MHz  100 KiB/128 MiB  300 MiB  lb: 172.31.5.141:8001
                                     ui: 172.31.5.141:29998

Task Events:
Started At     = 2021-02-03T14:54:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-02-03T09:54:09-05:00  Started     Task started by client
2021-02-03T09:54:07-05:00  Driver      Downloading image
2021-02-03T09:54:07-05:00  Task Setup  Building Task Directory
2021-02-03T09:54:07-05:00  Received    Task received by client

So far so good. If we look at the deployment list and job history, we indeed see that we're not marked as stable and we don't see the deployment in the list. That being said, system jobs don't actually support deployments, so this may be a combination of some bad UX and a documentation gap.

$ nomad deployment list
No deployments found

$ nomad job history example
Version     = 0
Stable      = false
Submit Date = 2021-02-03T09:54:07-05:00

The more troubling symptom you reported was that the job was re-running. I wasn't able to reproduce this.

$ nomad job run ./example.nomad
Job Warnings:
1 warning(s):

* Group "group" has warnings: 1 error occurred:
        * 1 error occurred:
        * Task "task": task network resources have been deprecated as of Nomad 0.12.0. Please configure networking via group network block.


==> Monitoring evaluation "c9b9cf86"
    Evaluation triggered by job "example"
==> Monitoring evaluation "c9b9cf86"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "c9b9cf86" finished with status "complete"

$ nomad job history example
Version     = 0
Stable      = false
Submit Date = 2021-02-03T09:54:07-05:00

I tried that a few times and got the same result every time. I've tested that on both 1.0.1 and 1.0.3

In the meantime, I'll get fixing the stability marked onto the backlog.

idrennanvmware · 2021-02-03T16:46:43Z

@tgross - thanks for looking in to this. Do you think the different behavior may be explained due to the fact that we are always using the HCL1 parser flag? We are also on 1.0.2 (if that helps)

tgross · 2021-02-03T18:54:17Z

That's a good idea to check! But I just checked that (both making sure I was testing against the 1.0.2 release and/or with -hcl1) and it didn't change the behavior. I'd be curious to see if the output of nomad job inspect :job_id is changing between runs. It might also be worth verifying that the template data isn't changing between job runs.

idrennanvmware · 2021-02-05T11:48:56Z

@tgross - I am totally at a loss now lol.

We can confirm the template is not changing between runs (it does render every time in our pipeline but the resulting file is the same every time).

Our allocs are littered with things like below

A diff between 2 versions /usr/local/bin/nomad job inspect -version=484 fabio and /usr/local/bin/nomad job inspect -version=485 fabio

Show the following 4 fields as different

JobModifyIndex
ModifyIndex
SubmitTime
Version

tgross · 2021-02-05T13:45:37Z

Ok, I think my next step is to try to verify that the Vault token isn't an issue here. This is beginning to remind me of #6471 which I recently spent some time on unsuccessfully reproducing; the issue there was similar that the scheduler would make a placement even though there was no diff. I did just discover a bug in the diffs for volumes (#9973) but maybe there's a similar bug elsewhere that we're just not hitting in most cases.

I see things like [[ .cluster_name ]] in your jobspec... are you using a tool for templating that, or is that just redaction for this issue?

idrennanvmware · 2021-02-05T13:47:39Z

Hey @tgross - that's just the levant stuff we use for rendering.

I was wondering if the vault template was maybe a suspect....that's an interesting observation

tgross · 2021-02-05T20:08:53Z

I did a first-pass test using our E2E environment to stand up a cluster with Vault, and checked both static (kv-2) and dynamic (pki) secrets with a job like the one here in the vaultsecrets.go E2E test, and still wasn't able to reproduce. I have trouble thinking that it would be specific to the Consul backend, but I'll check that next (need to get a Vault->Consul ACL setup). Probably will get to that early next week.

luckymike · 2021-05-19T21:53:24Z

We're seeing the same behavior on 0.12.

The job in our example makes no use of Vault.

luckymike · 2021-07-10T19:38:29Z

@tgross Based on the changelog, I wonder if this bug was introduced in 12.4 via #8687 which is called out in the changelog as plan: Fixed a bug where plans always included a change for the NomadTokenID though the Pull Request links to an issue (#8200) with a completely different description.

tgross · 2021-07-11T01:41:45Z

@luckymike it looks like the changelog got accidentally updated incorrectly in #8808. The original pointed to #8559 which was fixed by #8773. That fix only adds the NomadTokenID to the list of fields we filter when detecting a change... I don't really see a mechanism where that could cause extra diffs.

(Also, just a heads up, I'm no longer a maintainer on Nomad or at HashiCorp, although I may continue to contribute now and then in my spare time... you'll want to ping other folks to nudge issues along from here on out 😀 )

idrennanvmware · 2021-07-16T16:14:32Z

So - little more data. We're seeing this behavior at times on more than just system jobs. Here's a snapshot of the zookeeper job I just happened to run. Is it possible this is related to token changes or something behind the scene's we are missing?

Seems like it's something at the group level?

Edit: This did NOT happen the next time I tried to deploy the job with no changes

BijanJohn · 2021-07-21T16:03:00Z

r a maintainer on Nomad or at HashiCorp, although I may continue to contribute now and then in my spare time... you'll want to ping

sorry to hear you are not a maintainer @tgross . We appreciate all that you have done for the project!

Juanadelacuesta · 2024-01-03T14:26:49Z

Hi @BijanJohn!
I've been trying to reproduce this bug unsuccessfully. Your report contained two problems:

Jobs marked as unstable: The stable field refers to deployments, which is a concept only applicable to service job type. system jobs don't have deployments, hence the false value on the stable status. We will remove this field from the UI to avoid confusions in the future.
Unstable jobs are redeployed: I wasn't able to reproduce this part at all, we tried following you instructions and with some variations that could potentially trigger a bug, but it was fruitless.

Can you please verify if you are still running into the same problem and if you do, provide us with more context?

Juanadelacuesta · 2024-01-09T12:00:30Z

@BijanJohn Im gonna go ahead and close this ticket, if you are still running into difficulties don't hesitate to reach out to the team again.

tgross added theme/deployments type/bug stage/needs-investigation labels Jan 14, 2021

tgross self-assigned this Feb 3, 2021

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/bad-ux theme/docs Documentation issues and enhancements and removed stage/needs-investigation labels Feb 3, 2021

tgross removed their assignment Feb 3, 2021

tgross self-assigned this Mar 4, 2021

tgross removed their assignment May 20, 2021

luckymike mentioned this issue Jul 10, 2021

Plan returns an erroneous in-place update in diff when task group has a constraint #10836

Closed

shoenig self-assigned this Jul 12, 2021

davemay99 added the hcc/cst Admin - internal label Jul 16, 2021

jrasell unassigned shoenig Aug 30, 2021

rvalentini mentioned this issue Sep 28, 2021

Simplify deployment code by using Nomad's new auto_promote cljdoc/cljdoc#300

Open

Juanadelacuesta self-assigned this Dec 4, 2023

Juanadelacuesta added stage/waiting-for-reply and removed stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jan 3, 2024

Juanadelacuesta closed this as completed Jan 9, 2024

Juanadelacuesta removed the stage/waiting-for-reply label Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System Jobs never marked as stable and redeploys occur #9804

System Jobs never marked as stable and redeploys occur #9804

idrennanvmware commented Jan 13, 2021 •

edited

Loading

tgross commented Jan 14, 2021

tgross commented Feb 3, 2021 •

edited

Loading

idrennanvmware commented Feb 3, 2021 •

edited

Loading

tgross commented Feb 3, 2021

idrennanvmware commented Feb 5, 2021

tgross commented Feb 5, 2021

idrennanvmware commented Feb 5, 2021 •

edited

Loading

tgross commented Feb 5, 2021

luckymike commented May 19, 2021

luckymike commented Jul 10, 2021

tgross commented Jul 11, 2021

idrennanvmware commented Jul 16, 2021 •

edited

Loading

BijanJohn commented Jul 21, 2021

Juanadelacuesta commented Jan 3, 2024

Juanadelacuesta commented Jan 9, 2024

System Jobs never marked as stable and redeploys occur #9804

System Jobs never marked as stable and redeploys occur #9804

Comments

idrennanvmware commented Jan 13, 2021 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

tgross commented Jan 14, 2021

tgross commented Feb 3, 2021 • edited Loading

idrennanvmware commented Feb 3, 2021 • edited Loading

tgross commented Feb 3, 2021

idrennanvmware commented Feb 5, 2021

tgross commented Feb 5, 2021

idrennanvmware commented Feb 5, 2021 • edited Loading

tgross commented Feb 5, 2021

luckymike commented May 19, 2021

luckymike commented Jul 10, 2021

tgross commented Jul 11, 2021

idrennanvmware commented Jul 16, 2021 • edited Loading

BijanJohn commented Jul 21, 2021

Juanadelacuesta commented Jan 3, 2024

Juanadelacuesta commented Jan 9, 2024

idrennanvmware commented Jan 13, 2021 •

edited

Loading

tgross commented Feb 3, 2021 •

edited

Loading

idrennanvmware commented Feb 3, 2021 •

edited

Loading

idrennanvmware commented Feb 5, 2021 •

edited

Loading

idrennanvmware commented Jul 16, 2021 •

edited

Loading