Restarting nomad agent restarts successfully completed allocs #5883

notnoop · 2019-06-25T11:14:54Z

Nomad version

Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)

Operating system and Environment details

macOS, but OS is irrelevant

Issue

When a client restarts, it may restart successfully completed batch allocs unexpectedly. When the alloc is restarted, it will keep running until completion without being terminated. This seems to affect allocations that haven't been garbage collection.

This is a regression introduced in 0.9.0. Nomad 0.8.7 did not attempt to re-run such tasks. Nomad 0.9.0 and 0.9.1/0.9.2 attempt to re-run tasks, though 0.9.0 had logmon reattachment related bugs that complicates the matter a bit.

The cause seems to be an interaction of client state tracking, client storage, and garv

Reproduction steps

Start nomad agent with the client config attached below: nomad agent --config=./config.hcl
Submit the attached job nomad job run ./example-restored.hcl
Wait until alloc finishes, then kill nomad agent
Start nomad again, and inspect alloc status
Inspect alloc status and file /tmp/debug-alloc-restore.

One would expect alloc to only run once, but one would notice that the alloc was re-running again and moved from a dead status into a running one. /tmp/debug-alloc-restore ran twice.

Job file (if appropriate)

job "example-restored" {
  datacenters = ["dc1"]
  type = "batch"

  group "helloworld" {
    count = 1

    task "helloworld" {
      driver = "raw_exec"

      config {
        command = "/bin/sh"
        args = ["-c", "echo launched; echo $NOMAD_ALLOC_ID  I am alive at $(date) >> /tmp/debug-alloc-restore; sleep 20"]
      }

      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
      }

    }
  }
}

Nomad config file

data_dir  = "/tmp/nomad-1"
log_level = "debug"

bind_addr = "127.0.0.1"

advertise {
  # Defaults to the first private IP address.
  http = "127.0.0.1:4646"
  rpc  = "127.0.0.1:4647"
  serf = "127.0.0.1:4648" # non-default ports may be specified
}


ports {
  http = 4646
  rpc  = 4647
  serf = 4648
}

server {
  enabled          = true
  bootstrap_expect = 1

  server_join {
    retry_join = [ "127.0.0.1" ]
  }
}

client {
  enabled = true
  options {
    "driver.raw_exec.enable" = "true"
  }
}

Reproduction script

Here is a full script and the output, note that on restarting the nomad agent, the alloc will be re-run to a full completion and /tmp/debug-alloc-restore contains the alloc id twice:

mars-2:0.9.2 notnoop$ rm -rf /tmp/nomad-1; rm -rf /tmp/debug-alloc-restore
mars-2:0.9.2 notnoop$ ./nomad agent -config=./agent-config.hcl -network-interface=lo0  >> nomad.log 2>&1 &
[1] 14953
mars-2:0.9.2 notnoop$ NOMAD_PID=$!
mars-2:0.9.2 notnoop$ nomad job run ./example-restored.hcl
==> Monitoring evaluation "92ac71fd"
    Evaluation triggered by job "example-restored"
    Allocation "c7015319" created: node "aaa94871", group "helloworld"
    Allocation "c7015319" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "92ac71fd" finished with status "complete"
mars-2:0.9.2 notnoop$ nomad alloc status c7015319
ID                  = c7015319
Eval ID             = 92ac71fd
Name                = example-restored.helloworld[0]
Node ID             = aaa94871
Node Name           = mars-2.local
Job ID              = example-restored
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 6s ago
Modified            = 6s ago

Task "helloworld" is "running"
Task Resources
CPU         Memory          Disk     Addresses
43/500 MHz  25 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2019-06-25T11:08:46Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2019-06-25T19:08:46+08:00  Started     Task started by client
2019-06-25T19:08:46+08:00  Task Setup  Building Task Directory
2019-06-25T19:08:46+08:00  Received    Task received by client
mars-2:0.9.2 notnoop$ cat /tmp/debug-alloc-restore
c7015319-4d82-0cf4-04d3-172709c21d66 I am alive at Tue Jun 25 19:08:46 PST 2019
mars-2:0.9.2 notnoop$ sleep 20
mars-2:0.9.2 notnoop$ nomad alloc status c7015319
ID                  = c7015319
Eval ID             = 92ac71fd
Name                = example-restored.helloworld[0]
Node ID             = aaa94871
Node Name           = mars-2.local
Job ID              = example-restored
Job Version         = 0
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 39s ago
Modified            = 19s ago

Task "helloworld" is "dead"
Task Resources
CPU         Memory          Disk     Addresses
21/500 MHz  25 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2019-06-25T11:08:46Z
Finished At    = 2019-06-25T11:09:06Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2019-06-25T19:09:06+08:00  Terminated  Exit Code: 0
2019-06-25T19:08:46+08:00  Started     Task started by client
2019-06-25T19:08:46+08:00  Task Setup  Building Task Directory
2019-06-25T19:08:46+08:00  Received    Task received by client
mars-2:0.9.2 notnoop$ kill $NOMAD_PID
mars-2:0.9.2 notnoop$
[1]+  Exit 1                  ./nomad agent -config=./agent-config.hcl -network-interface=lo0 >> nomad.log 2>&1
mars-2:0.9.2 notnoop$ # restart process again
mars-2:0.9.2 notnoop$ ./nomad agent -config=./agent-config.hcl -network-interface=lo0  >> nomad.log 2>&1 &
[1] 15245
mars-2:0.9.2 notnoop$ nomad alloc status c7015319
ID                  = c7015319
Eval ID             = 92ac71fd
Name                = example-restored.helloworld[0]
Node ID             = aaa94871
Node Name           = mars-2.local
Job ID              = example-restored
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 1m16s ago
Modified            = 2s ago

Task "helloworld" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/500 MHz  23 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2019-06-25T11:09:59Z
Finished At    = 2019-06-25T11:09:06Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2019-06-25T19:09:59+08:00  Started     Task started by client
2019-06-25T19:09:06+08:00  Terminated  Exit Code: 0
2019-06-25T19:08:46+08:00  Started     Task started by client
2019-06-25T19:08:46+08:00  Task Setup  Building Task Directory
2019-06-25T19:08:46+08:00  Received    Task received by client
mars-2:0.9.2 notnoop$ sleep 20
mars-2:0.9.2 notnoop$ nomad alloc status c7015319
ID                  = c7015319
Eval ID             = 92ac71fd
Name                = example-restored.helloworld[0]
Node ID             = aaa94871
Node Name           = mars-2.local
Job ID              = example-restored
Job Version         = 0
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 1m45s ago
Modified            = 11s ago

Task "helloworld" is "dead"
Task Resources
CPU         Memory          Disk     Addresses
44/500 MHz  26 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2019-06-25T11:09:59Z
Finished At    = 2019-06-25T11:09:06Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2019-06-25T19:10:19+08:00  Terminated  Exit Code: 0
2019-06-25T19:09:59+08:00  Started     Task started by client
2019-06-25T19:09:06+08:00  Terminated  Exit Code: 0
2019-06-25T19:08:46+08:00  Started     Task started by client
2019-06-25T19:08:46+08:00  Task Setup  Building Task Directory
2019-06-25T19:08:46+08:00  Received    Task received by client
mars-2:0.9.2 notnoop$ cat /tmp/debug-alloc-restore
c7015319-4d82-0cf4-04d3-172709c21d66 I am alive at Tue Jun 25 19:08:46 PST 2019
c7015319-4d82-0cf4-04d3-172709c21d66 I am alive at Tue Jun 25 19:09:59 PST 2019

The text was updated successfully, but these errors were encountered:

This change fixes a bug where nomad would avoid running alloc tasks if the alloc is client terminal but the server copy on the client isn't marked as running. Here, we fix the case by having task runner uses the allocRunner.shouldRun() instead of only checking the server updated alloc. Here, we preserve much of the invariants such that `tr.Run()` is always run, and don't change the overall alloc runner and task runner lifecycles. Fixes #5883

This change fixes a bug where nomad would avoid running alloc tasks if the alloc is client terminal but the server copy on the client isn't marked as running. Here, we fix the case by having task runner uses the allocRunner.shouldRun() instead of only checking the server updated alloc. Here, we preserve much of the invariants such that `tr.Run()` is always run, and don't change the overall alloc runner and task runner lifecycles. Fixes hashicorp#5883

github-actions · 2022-11-21T02:29:36Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop added type/bug theme/client labels Jun 26, 2019

notnoop mentioned this issue Jun 26, 2019

task runner to avoid running task if terminal #5890

Merged

notnoop mentioned this issue Jul 1, 2019

Crash on restart with 0.9.1 #5840

Closed

notnoop closed this as completed in #5890 Jul 2, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting nomad agent restarts successfully completed allocs #5883

Restarting nomad agent restarts successfully completed allocs #5883

notnoop commented Jun 25, 2019

github-actions bot commented Nov 21, 2022

Restarting nomad agent restarts successfully completed allocs #5883

Restarting nomad agent restarts successfully completed allocs #5883

Comments

notnoop commented Jun 25, 2019

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

Nomad config file

Reproduction script

github-actions bot commented Nov 21, 2022