Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting nomad agent restarts successfully completed allocs #5883

Closed
notnoop opened this issue Jun 25, 2019 · 1 comment · Fixed by #5890
Closed

Restarting nomad agent restarts successfully completed allocs #5883

notnoop opened this issue Jun 25, 2019 · 1 comment · Fixed by #5890

Comments

@notnoop
Copy link
Contributor

notnoop commented Jun 25, 2019

Nomad version

Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)

Operating system and Environment details

macOS, but OS is irrelevant

Issue

When a client restarts, it may restart successfully completed batch allocs unexpectedly. When the alloc is restarted, it will keep running until completion without being terminated. This seems to affect allocations that haven't been garbage collection.

This is a regression introduced in 0.9.0. Nomad 0.8.7 did not attempt to re-run such tasks. Nomad 0.9.0 and 0.9.1/0.9.2 attempt to re-run tasks, though 0.9.0 had logmon reattachment related bugs that complicates the matter a bit.

The cause seems to be an interaction of client state tracking, client storage, and garv

Reproduction steps

  1. Start nomad agent with the client config attached below: nomad agent --config=./config.hcl
  2. Submit the attached job nomad job run ./example-restored.hcl
  3. Wait until alloc finishes, then kill nomad agent
  4. Start nomad again, and inspect alloc status
  5. Inspect alloc status and file /tmp/debug-alloc-restore.

One would expect alloc to only run once, but one would notice that the alloc was re-running again and moved from a dead status into a running one. /tmp/debug-alloc-restore ran twice.

Job file (if appropriate)

job "example-restored" {
  datacenters = ["dc1"]
  type = "batch"

  group "helloworld" {
    count = 1

    task "helloworld" {
      driver = "raw_exec"

      config {
        command = "/bin/sh"
        args = ["-c", "echo launched; echo $NOMAD_ALLOC_ID  I am alive at $(date) >> /tmp/debug-alloc-restore; sleep 20"]
      }

      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
      }

    }
  }
}

Nomad config file

data_dir  = "/tmp/nomad-1"
log_level = "debug"

bind_addr = "127.0.0.1"

advertise {
  # Defaults to the first private IP address.
  http = "127.0.0.1:4646"
  rpc  = "127.0.0.1:4647"
  serf = "127.0.0.1:4648" # non-default ports may be specified
}


ports {
  http = 4646
  rpc  = 4647
  serf = 4648
}

server {
  enabled          = true
  bootstrap_expect = 1

  server_join {
    retry_join = [ "127.0.0.1" ]
  }
}

client {
  enabled = true
  options {
    "driver.raw_exec.enable" = "true"
  }
}

Reproduction script

Here is a full script and the output, note that on restarting the nomad agent, the alloc will be re-run to a full completion and /tmp/debug-alloc-restore contains the alloc id twice:

mars-2:0.9.2 notnoop$ rm -rf /tmp/nomad-1; rm -rf /tmp/debug-alloc-restore
mars-2:0.9.2 notnoop$ ./nomad agent -config=./agent-config.hcl -network-interface=lo0  >> nomad.log 2>&1 &
[1] 14953
mars-2:0.9.2 notnoop$ NOMAD_PID=$!
mars-2:0.9.2 notnoop$ nomad job run ./example-restored.hcl
==> Monitoring evaluation "92ac71fd"
    Evaluation triggered by job "example-restored"
    Allocation "c7015319" created: node "aaa94871", group "helloworld"
    Allocation "c7015319" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "92ac71fd" finished with status "complete"
mars-2:0.9.2 notnoop$ nomad alloc status c7015319
ID                  = c7015319
Eval ID             = 92ac71fd
Name                = example-restored.helloworld[0]
Node ID             = aaa94871
Node Name           = mars-2.local
Job ID              = example-restored
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 6s ago
Modified            = 6s ago

Task "helloworld" is "running"
Task Resources
CPU         Memory          Disk     Addresses
43/500 MHz  25 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2019-06-25T11:08:46Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2019-06-25T19:08:46+08:00  Started     Task started by client
2019-06-25T19:08:46+08:00  Task Setup  Building Task Directory
2019-06-25T19:08:46+08:00  Received    Task received by client
mars-2:0.9.2 notnoop$ cat /tmp/debug-alloc-restore
c7015319-4d82-0cf4-04d3-172709c21d66 I am alive at Tue Jun 25 19:08:46 PST 2019
mars-2:0.9.2 notnoop$ sleep 20
mars-2:0.9.2 notnoop$ nomad alloc status c7015319
ID                  = c7015319
Eval ID             = 92ac71fd
Name                = example-restored.helloworld[0]
Node ID             = aaa94871
Node Name           = mars-2.local
Job ID              = example-restored
Job Version         = 0
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 39s ago
Modified            = 19s ago

Task "helloworld" is "dead"
Task Resources
CPU         Memory          Disk     Addresses
21/500 MHz  25 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2019-06-25T11:08:46Z
Finished At    = 2019-06-25T11:09:06Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2019-06-25T19:09:06+08:00  Terminated  Exit Code: 0
2019-06-25T19:08:46+08:00  Started     Task started by client
2019-06-25T19:08:46+08:00  Task Setup  Building Task Directory
2019-06-25T19:08:46+08:00  Received    Task received by client
mars-2:0.9.2 notnoop$ kill $NOMAD_PID
mars-2:0.9.2 notnoop$
[1]+  Exit 1                  ./nomad agent -config=./agent-config.hcl -network-interface=lo0 >> nomad.log 2>&1
mars-2:0.9.2 notnoop$ # restart process again
mars-2:0.9.2 notnoop$ ./nomad agent -config=./agent-config.hcl -network-interface=lo0  >> nomad.log 2>&1 &
[1] 15245
mars-2:0.9.2 notnoop$ nomad alloc status c7015319
ID                  = c7015319
Eval ID             = 92ac71fd
Name                = example-restored.helloworld[0]
Node ID             = aaa94871
Node Name           = mars-2.local
Job ID              = example-restored
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 1m16s ago
Modified            = 2s ago

Task "helloworld" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/500 MHz  23 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2019-06-25T11:09:59Z
Finished At    = 2019-06-25T11:09:06Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2019-06-25T19:09:59+08:00  Started     Task started by client
2019-06-25T19:09:06+08:00  Terminated  Exit Code: 0
2019-06-25T19:08:46+08:00  Started     Task started by client
2019-06-25T19:08:46+08:00  Task Setup  Building Task Directory
2019-06-25T19:08:46+08:00  Received    Task received by client
mars-2:0.9.2 notnoop$ sleep 20
mars-2:0.9.2 notnoop$ nomad alloc status c7015319
ID                  = c7015319
Eval ID             = 92ac71fd
Name                = example-restored.helloworld[0]
Node ID             = aaa94871
Node Name           = mars-2.local
Job ID              = example-restored
Job Version         = 0
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 1m45s ago
Modified            = 11s ago

Task "helloworld" is "dead"
Task Resources
CPU         Memory          Disk     Addresses
44/500 MHz  26 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2019-06-25T11:09:59Z
Finished At    = 2019-06-25T11:09:06Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2019-06-25T19:10:19+08:00  Terminated  Exit Code: 0
2019-06-25T19:09:59+08:00  Started     Task started by client
2019-06-25T19:09:06+08:00  Terminated  Exit Code: 0
2019-06-25T19:08:46+08:00  Started     Task started by client
2019-06-25T19:08:46+08:00  Task Setup  Building Task Directory
2019-06-25T19:08:46+08:00  Received    Task received by client
mars-2:0.9.2 notnoop$ cat /tmp/debug-alloc-restore
c7015319-4d82-0cf4-04d3-172709c21d66 I am alive at Tue Jun 25 19:08:46 PST 2019
c7015319-4d82-0cf4-04d3-172709c21d66 I am alive at Tue Jun 25 19:09:59 PST 2019
notnoop pushed a commit that referenced this issue Jun 26, 2019
This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.

Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.

Here, we preserve much of the invariants such that `tr.Run()` is always
run, and don't change the overall alloc runner and task runner
lifecycles.

Fixes #5883
notnoop pushed a commit that referenced this issue Jun 27, 2019
This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.

Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.

Here, we preserve much of the invariants such that `tr.Run()` is always
run, and don't change the overall alloc runner and task runner
lifecycles.

Fixes #5883
chrisboulton pushed a commit to bigcommerce/nomad that referenced this issue Jul 15, 2019
This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.

Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.

Here, we preserve much of the invariants such that `tr.Run()` is always
run, and don't change the overall alloc runner and task runner
lifecycles.

Fixes hashicorp#5883
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant