Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System Jobs never marked as stable and redeploys occur #9804

Closed
idrennanvmware opened this issue Jan 13, 2021 · 15 comments
Closed

System Jobs never marked as stable and redeploys occur #9804

idrennanvmware opened this issue Jan 13, 2021 · 15 comments
Assignees
Labels
hcc/cst Admin - internal theme/bad-ux theme/deployments theme/docs Documentation issues and enhancements type/bug

Comments

@idrennanvmware
Copy link
Contributor

idrennanvmware commented Jan 13, 2021

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

PhotonOS3

Issue

We have noticed that system jobs are not being marked as stable (Fabio in our case) even when they are running as expected.
In addition, we suspect related to this, subsequent attempts to deploy the same job with no changes results in multiple deployment history versions (with no changes)

Reproduction steps

First we Create a Nomad Cluster and deploy a system job (example command we are using). Note we randomly generate a file name, there's nothing sensitive here.

nomad job run -hcl1 -vault-token=[redacted] -output /services/fabio/tmp/fabio.nomad.hcl > 6d056518-ea90-4c10-b73a-3c1200405356.json

Next, lets get the status of the system job. In this case we have 2 potential clients, 1 windows and 1 Linux but we don't really care if there are no nodes (in this case for windows)

./nomad job status fabio

and that results in an output like we'd expect

ID            = fabio
Name          = fabio
Submit Date   = 2021-01-13T17:32:03Z
Type          = system
Priority      = 50
Datacenters   = dev-setup
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost
fabio-linux    0       0         1        0       0         0
fabio-windows  0       0         0        0       0         0

Allocations
ID        Node ID   Task Group   Version  Desired  Status   Created  Modified
28024d3d  ba872e8e  fabio-linux  0        run      running  29s ago  17s ago

So far, so good. Now lets get the deployment status:

/nomad job deployments fabio

and we get back

No deployments found

Ok - that's not what we expect, but lets check out the UI. Interesting we see "stable" marked as false here even though everything looks healthy and happy

Screen Shot 2021-01-13 at 12 33 33 PM

Just in case, we add the windows node and we see Fabio get scheduled there

./nomad job status fabio

and get back what we'd expect

ID            = fabio
Name          = fabio
Submit Date   = 2021-01-13T17:53:49Z
Type          = system
Priority      = 50
Datacenters   = dev-setup
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost
fabio-linux    0       0         1        0       0         0
fabio-windows  0       0         1        0       0         0

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created     Modified
bd59c72c  6bfd758f  fabio-windows  3        run      running  12m47s ago  6m47s ago
28024d3d  ba872e8e  fabio-linux    3        run      running  28m33s ago  6m47s ago

BUT when we run

 ./nomad job history fabio

we see this....just like the UI shows. Not stable but everything is healthy. Consul HC's are happy and Nomad is happily running the job

Version     = 0
Stable      = false
Submit Date = 2021-01-13T17:32:03Z

Nomad
Screen Shot 2021-01-13 at 12 51 38 PM

Consul
Screen Shot 2021-01-13 at 12 52 04 PM

Now, if we run deploy again (a few times with ZERO changes to the job file) we start to see deployments in the UI that look like this AND the allocations all get started over and over again with each deploy

Screen Shot 2021-01-13 at 12 55 09 PM

Lets run

./nomad job history fabio

and now we see (kinda as we expect since I'm guessing the scheduler is just trying to get this to "stable true"

Version     = 3
Stable      = false
Submit Date = 2021-01-13T17:53:49Z

Version     = 2
Stable      = false
Submit Date = 2021-01-13T17:53:28Z

Version     = 1
Stable      = false
Submit Date = 2021-01-13T17:53:16Z

Version     = 0
Stable      = false
Submit Date = 2021-01-13T17:32:03Z

and the UI shows this
Screen Shot 2021-01-13 at 12 55 34 PM

Job file (if appropriate)

Here's the fabio job file (windows removed for brevity but it's the same for both groups)

job "fabio" {
  region      = "global"
  datacenters = ["[[ .cluster_name ]]"]
  type        = "system"

  update {
    stagger      = "11s"
    max_parallel = 1
  }

  vault {
    policies = ["service-fabio"]
  }

  group "fabio-linux" {

    constraint {
      attribute = "${attr.kernel.name}"
      value     = "linux"
    }

    task "fabio-linux" {

      driver = "docker"

      template {
        data = <<EOH
          {{ with secret "consul/creds/service-fabio" }}
          registry.consul.token = {{ .Data.token }}
          {{ end }}
          registry.consul.kvpath = <redacted>/config
          registry.consul.noroutehtmlpath = <redacted>/noroute.html
          proxy.addr = :{{ env "NOMAD_PORT_lb" }}
          ui.addr = :{{ env "NOMAD_PORT_ui" }}
          registry.consul.register.addr = {{ env "NOMAD_ADDR_ui" }}
          metrics.target = statsd
          metrics.statsd.addr = 127.0.0.1:8125
          metrics.prefix = fabio
          metrics.names = services.\{\{clean .Service\}\}
        EOH

        destination = "secrets/fabio.properties"
      }

      config {
        image        = "[[ .default_docker_registry ]]/<redacted>"
        network_mode = "host"
        args = ["-cfg", "secrets/fabio.properties"]
      }

      resources {
        cpu    = "[[ .fabio.resources.cpu ]]"    # Mhz
        memory = "[[ .fabio.resources.memory ]]" # MB

        network {

          port "lb" {
            static = 29999
          }

          port "ui" {
            static = 29998
          }
        }
      }
    }
  }
}

@tgross
Copy link
Member

tgross commented Jan 14, 2021

Thanks for reporting this. We'll see if we can make a minimal reproduction from these notes.

@tgross tgross self-assigned this Feb 3, 2021
@tgross
Copy link
Member

tgross commented Feb 3, 2021

I was able to only partially reproduce this issue. I used the following minimal reproduction jobspec:

jobspec
job "example" {
  datacenters = ["dc1"]
  type        = "system"

  update {
    stagger      = "11s"
    max_parallel = 1
  }

  group "group" {

    task "task" {

      driver = "docker"

      template {
        data        = "<html>hello, world</html>"
        destination = "index.html"
      }

      config {
        image        = "busybox:1"
        network_mode = "host"
        command      = "httpd"
        args         = ["-v", "-f", "-p", "8001", "-h", "/local"]
      }

      resources {
        cpu    = 256
        memory = 128

        network {

          port "lb" {
            static = 8001
          }

          port "ui" {
            static = 29998
          }
        }
      }
    }
  }
}

I ran the job and got the warnings I'd expect to see for the network block:

$ nomad job run ./example.nomad
Job Warnings:
1 warning(s):

* Group "group" has warnings: 1 error occurred:
        * 1 error occurred:
        * Task "task": task network resources have been deprecated as of Nomad 0.12.0. Please configure networking via group network block.


==> Monitoring evaluation "2e1acd06"
    Evaluation triggered by job "example"
    Allocation "cf56c3a4" created: node "aa0bae38", group "group"
    Allocation "f9548a68" created: node "f46a408a", group "group"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "2e1acd06" finished with status "complete"

The allocations come up and are running, as we'd expect:

detailed allocation status
$ nomad alloc status cf56
ID                  = cf56c3a4-bba3-0083-21b2-8d8dfe8e9999
Eval ID             = 2e1acd06
Name                = example.group[0]
Node ID             = aa0bae38
Node Name           = ip-172-31-3-49
Job ID              = example
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 8s ago
Modified            = 5s ago

Task "task" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/256 MHz  124 KiB/128 MiB  300 MiB  lb: 172.31.3.49:8001
                                     ui: 172.31.3.49:29998

Task Events:
Started At     = 2021-02-03T14:54:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-02-03T09:54:09-05:00  Started     Task started by client
2021-02-03T09:54:07-05:00  Driver      Downloading image
2021-02-03T09:54:07-05:00  Task Setup  Building Task Directory
2021-02-03T09:54:07-05:00  Received    Task received by client


$ nomad alloc status f954
ID                  = f9548a68-35b8-8a39-df10-b272af5f3f34
Eval ID             = 2e1acd06
Name                = example.group[0]
Node ID             = f46a408a
Node Name           = ip-172-31-5-141
Job ID              = example
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 24s ago
Modified            = 21s ago

Task "task" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/256 MHz  100 KiB/128 MiB  300 MiB  lb: 172.31.5.141:8001
                                     ui: 172.31.5.141:29998

Task Events:
Started At     = 2021-02-03T14:54:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-02-03T09:54:09-05:00  Started     Task started by client
2021-02-03T09:54:07-05:00  Driver      Downloading image
2021-02-03T09:54:07-05:00  Task Setup  Building Task Directory
2021-02-03T09:54:07-05:00  Received    Task received by client

So far so good. If we look at the deployment list and job history, we indeed see that we're not marked as stable and we don't see the deployment in the list. That being said, system jobs don't actually support deployments, so this may be a combination of some bad UX and a documentation gap.

$ nomad deployment list
No deployments found

$ nomad job history example
Version     = 0
Stable      = false
Submit Date = 2021-02-03T09:54:07-05:00

The more troubling symptom you reported was that the job was re-running. I wasn't able to reproduce this.

$ nomad job run ./example.nomad
Job Warnings:
1 warning(s):

* Group "group" has warnings: 1 error occurred:
        * 1 error occurred:
        * Task "task": task network resources have been deprecated as of Nomad 0.12.0. Please configure networking via group network block.


==> Monitoring evaluation "c9b9cf86"
    Evaluation triggered by job "example"
==> Monitoring evaluation "c9b9cf86"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "c9b9cf86" finished with status "complete"

$ nomad job history example
Version     = 0
Stable      = false
Submit Date = 2021-02-03T09:54:07-05:00

I tried that a few times and got the same result every time. I've tested that on both 1.0.1 and 1.0.3

In the meantime, I'll get fixing the stability marked onto the backlog.

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/bad-ux theme/docs Documentation issues and enhancements and removed stage/needs-investigation labels Feb 3, 2021
@tgross tgross removed their assignment Feb 3, 2021
@idrennanvmware
Copy link
Contributor Author

idrennanvmware commented Feb 3, 2021

@tgross - thanks for looking in to this. Do you think the different behavior may be explained due to the fact that we are always using the HCL1 parser flag? We are also on 1.0.2 (if that helps)

@tgross
Copy link
Member

tgross commented Feb 3, 2021

That's a good idea to check! But I just checked that (both making sure I was testing against the 1.0.2 release and/or with -hcl1) and it didn't change the behavior. I'd be curious to see if the output of nomad job inspect :job_id is changing between runs. It might also be worth verifying that the template data isn't changing between job runs.

@idrennanvmware
Copy link
Contributor Author

@tgross - I am totally at a loss now lol.

We can confirm the template is not changing between runs (it does render every time in our pipeline but the resulting file is the same every time).

Our allocs are littered with things like below

Screen Shot 2021-02-05 at 6 36 14 AM

A diff between 2 versions /usr/local/bin/nomad job inspect -version=484 fabio and /usr/local/bin/nomad job inspect -version=485 fabio

Show the following 4 fields as different

JobModifyIndex
ModifyIndex
SubmitTime
Version

@tgross
Copy link
Member

tgross commented Feb 5, 2021

Ok, I think my next step is to try to verify that the Vault token isn't an issue here. This is beginning to remind me of #6471 which I recently spent some time on unsuccessfully reproducing; the issue there was similar that the scheduler would make a placement even though there was no diff. I did just discover a bug in the diffs for volumes (#9973) but maybe there's a similar bug elsewhere that we're just not hitting in most cases.

I see things like [[ .cluster_name ]] in your jobspec... are you using a tool for templating that, or is that just redaction for this issue?

@idrennanvmware
Copy link
Contributor Author

idrennanvmware commented Feb 5, 2021

Hey @tgross - that's just the levant stuff we use for rendering.

I was wondering if the vault template was maybe a suspect....that's an interesting observation

@tgross
Copy link
Member

tgross commented Feb 5, 2021

I did a first-pass test using our E2E environment to stand up a cluster with Vault, and checked both static (kv-2) and dynamic (pki) secrets with a job like the one here in the vaultsecrets.go E2E test, and still wasn't able to reproduce. I have trouble thinking that it would be specific to the Consul backend, but I'll check that next (need to get a Vault->Consul ACL setup). Probably will get to that early next week.

@tgross tgross self-assigned this Mar 4, 2021
@luckymike
Copy link

We're seeing the same behavior on 0.12.

image

The job in our example makes no use of Vault.

@tgross tgross removed their assignment May 20, 2021
@luckymike
Copy link

@tgross Based on the changelog, I wonder if this bug was introduced in 12.4 via #8687 which is called out in the changelog as plan: Fixed a bug where plans always included a change for the NomadTokenID though the Pull Request links to an issue (#8200) with a completely different description.

@tgross
Copy link
Member

tgross commented Jul 11, 2021

@luckymike it looks like the changelog got accidentally updated incorrectly in #8808. The original pointed to #8559 which was fixed by #8773. That fix only adds the NomadTokenID to the list of fields we filter when detecting a change... I don't really see a mechanism where that could cause extra diffs.

(Also, just a heads up, I'm no longer a maintainer on Nomad or at HashiCorp, although I may continue to contribute now and then in my spare time... you'll want to ping other folks to nudge issues along from here on out 😀 )

@shoenig shoenig self-assigned this Jul 12, 2021
@idrennanvmware
Copy link
Contributor Author

idrennanvmware commented Jul 16, 2021

So - little more data. We're seeing this behavior at times on more than just system jobs. Here's a snapshot of the zookeeper job I just happened to run. Is it possible this is related to token changes or something behind the scene's we are missing?

Seems like it's something at the group level?

Edit: This did NOT happen the next time I tried to deploy the job with no changes

Screen Shot 2021-07-16 at 12 08 17 PM

@davemay99 davemay99 added the hcc/cst Admin - internal label Jul 16, 2021
@BijanJohn
Copy link
Contributor

r a maintainer on Nomad or at HashiCorp, although I may continue to contribute now and then in my spare time... you'll want to ping

sorry to hear you are not a maintainer @tgross . We appreciate all that you have done for the project!

@Juanadelacuesta
Copy link
Member

Hi @BijanJohn!
I've been trying to reproduce this bug unsuccessfully. Your report contained two problems:

  • Jobs marked as unstable: The stable field refers to deployments, which is a concept only applicable to service job type. system jobs don't have deployments, hence the false value on the stable status. We will remove this field from the UI to avoid confusions in the future.

  • Unstable jobs are redeployed: I wasn't able to reproduce this part at all, we tried following you instructions and with some variations that could potentially trigger a bug, but it was fruitless.

Can you please verify if you are still running into the same problem and if you do, provide us with more context?

@Juanadelacuesta Juanadelacuesta added stage/waiting-for-reply and removed stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jan 3, 2024
@Juanadelacuesta
Copy link
Member

@BijanJohn Im gonna go ahead and close this ticket, if you are still running into difficulties don't hesitate to reach out to the team again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal theme/bad-ux theme/deployments theme/docs Documentation issues and enhancements type/bug
Projects
None yet
Development

No branches or pull requests

7 participants