Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad server panics on evaluation of higher-priority system job #5777

Closed
cheeseprocedure opened this issue Jun 4, 2019 · 4 comments
Closed

Comments

@cheeseprocedure
Copy link
Contributor

Nomad version

$ nomad version
Nomad v0.9.1 (4b2bdbd9ab68a27b10c2ee781cceaaf62e114399)

Operating system and Environment details

Ubuntu 16.04LTS amd64

Issue

When attempting to run a new system job, the Nomad cluster leader panics if another system job is already running which occupies nearly all Nomad client resources and has a lower priority than the new job. This panic appears to occur on the next cluster leader, and so on, effectively rendering the cluster unmanageable.

Running nomad plan against the new system job definition causes a similar panic. In our case, Nomad is respawned by systemd and the cluster is otherwise healthy, but the job is never scheduled.

Reproduction steps

Cluster setup: 3x Nomad servers, 5x Nomad clients w/6192MHz CPU available for tasks. (Scale resources in job definitions below accordingly.)

Execute nomad run systemjob1.hcl, then execute nomad run systemjob2.hcl or nomad plan systemjob2.hcl.

Job file (if appropriate)

systemjob1.hcl

job "systemjob1" {
  datacenters = ["usw2-lab"]
  type = "system"
  priority = "1"
  group "group1" {
    count = 1
    task "dummy_task" {
      kill_signal = "SIGKILL"
      driver = "raw_exec"
      config {
        command = "/bin/bash"
        args = ["-c", "while true ; do date ; sleep 300 ; done" ]
      }
      resources {
        cpu    = 2000
        memory = 50
      }
    }
  }
  group "group2" {
    count = 1
    task "dummy_task" {
      kill_signal = "SIGKILL"
      driver = "raw_exec"
      config {
        command = "/bin/bash"
        args = ["-c", "while true ; do date ; sleep 300 ; done" ]
      }
      resources {
        cpu    = 2000
        memory = 50
      }
    }
  }
  group "group3" {
    count = 1
    task "dummy_task" {
      kill_signal = "SIGKILL"
      driver = "raw_exec"
      config {
        command = "/bin/bash"
        args = ["-c", "while true ; do date ; sleep 300 ; done" ]
      }
      resources {
        cpu    = 2000
        memory = 50
      }
    }
  }
}

systemjob2.hcl

job "systemjob2" {
  datacenters = ["usw2-lab"]
  type = "system"
  priority = "99"
  group "group1" {
    count = 1
    task "dummy_task" {
      kill_signal = "SIGKILL"
      driver = "raw_exec"
      config {
        command = "/bin/bash"
        args = ["-c", "while true ; do date ; sleep 300 ; done" ]
      }
      resources {
        cpu    = 2000
        memory = 50
      }
    }
  }
  group "group2" {
    count = 1
    task "dummy_task" {
      kill_signal = "SIGKILL"
      driver = "raw_exec"
      config {
        command = "/bin/bash"
        args = ["-c", "while true ; do date ; sleep 300 ; done" ]
      }
      resources {
        cpu    = 2000
        memory = 50
      }
    }
  }
  group "group3" {
    count = 1
    task "dummy_task" {
      kill_signal = "SIGKILL"
      driver = "raw_exec"
      config {
        command = "/bin/bash"
        args = ["-c", "while true ; do date ; sleep 300 ; done" ]
      }
      resources {
        cpu    = 2000
        memory = 50
      }
    }
  }
}

Nomad Client logs (if appropriate)

n/a

Nomad Server logs (if appropriate)

After executing nomad plan systemjob2.hcl and receiving Error during plan: Unexpected response code: 500 (rpc error: EOF), the Nomad cluster leader logs the following:

Jun  4 21:08:24 c2 nomad[31706]:     2019-06-04T21:08:24.215Z [DEBUG] nomad.job.system_sched: reconciled current state with desired state: eval_id=28c9f688-c9ff-c24a-3961-4f5d49d5b897 job_id=systemjob2 namespace=default place=15 update=
0 migrate=0 stop=0 ignore=0 lost=0
Jun  4 21:08:24 c2 nomad[31706]: nomad.job.system_sched: reconciled current state with desired state: eval_id=28c9f688-c9ff-c24a-3961-4f5d49d5b897 job_id=systemjob2 namespace=default place=15 update=0 migrate=0 stop=0 ignore=0 lost=0
Jun  4 21:08:24 c2 nomad[31706]: panic: runtime error: invalid memory address or nil pointer dereference
Jun  4 21:08:24 c2 nomad[31706]: [signal SIGSEGV: segmentation violation code=0x1 addr=0xd8 pc=0xb23106]
Jun  4 21:08:24 c2 nomad[31706]: goroutine 3540 [running]:
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/nomad/structs.(*Job).LookupTaskGroup(0x0, 0xc0017a2610, 0x6, 0x24)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/structs/structs.go:3357 +0x26
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*Preemptor).SetCandidates(0xc00179ce10, 0xc0005df100, 0x3, 0x3)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/preemption.go:146 +0xa0
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*BinPackIterator).Next(0xc0008f2100, 0x40eced)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/rank.go:360 +0x21b0
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*ScoreNormalizationIterator).Next(0xc0005dea80, 0x222a780)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/rank.go:623 +0x38
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*SystemStack).Select(0xc000ecb960, 0xc000514fd0, 0x0, 0x24)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/stack.go:339 +0x186
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computePlacements(0xc000c7cb40, 0xc000995b00, 0xf, 0x18, 0xc0012179c8, 0x0)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:284 +0x2ea
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computeJobAllocs(0xc000c7cb40, 0xc000bbab40, 0xc000ecb960)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:262 +0x101a
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).process(0xc000c7cb40, 0x1a7c0a0, 0xc0017d6090, 0x1a7c0a0)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:128 +0x42f
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).process-fm(0xc0017d60b0, 0x0, 0x0)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:74 +0x2a
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc00179d678, 0xc00179d688, 0xc, 0x0)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:271 +0x40
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).Process(0xc000c7cb40, 0xc000fb66e0, 0x22538e0, 0xc0002cede0)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:74 +0x2b2
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/nomad.(*Job).Plan(0xc00028eb80, 0xc000a5bef0, 0xc000ecb8f0, 0x0, 0x0)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/job_endpoint.go:1221 +0x91a
Jun  4 21:08:24 c2 nomad[31706]: reflect.Value.call(0xc0004e67e0, 0xc0007baf18, 0x13, 0x1f96f71, 0x4, 0xc0004c9da8, 0x3, 0x3, 0x1e4a620, 0x19, ...)
Jun  4 21:08:24 c2 nomad[31706]: #011/usr/local/go/src/reflect/value.go:447 +0x454
Jun  4 21:08:24 c2 nomad[31706]: reflect.Value.Call(0xc0004e67e0, 0xc0007baf18, 0x13, 0xc0004c9da8, 0x3, 0x3, 0x16, 0x16, 0x16)
Jun  4 21:08:24 c2 nomad[31706]: #011/usr/local/go/src/reflect/value.go:308 +0xa4
Jun  4 21:08:24 c2 nomad[31706]: net/rpc.(*service).call(0xc000adf7c0, 0xc00066f8b0, 0xc0012b68d8, 0x0, 0xc00001fa00, 0xc0006278a0, 0x1cff060, 0xc000a5bef0, 0x16, 0x19d3800, ...)
Jun  4 21:08:24 c2 nomad[31706]: #011/usr/local/go/src/net/rpc/server.go:384 +0x14e
Jun  4 21:08:24 c2 nomad[31706]: net/rpc.(*Server).ServeRequest(0xc00066f8b0, 0x2240700, 0xc0001b2e80, 0x2, 0x0)
Jun  4 21:08:24 c2 nomad[31706]: #011/usr/local/go/src/net/rpc/server.go:505 +0x242
Jun  4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/nomad.(*rpcHandler).handleNomadConn(0xc00028e7a0, 0x223f5c0, 0xc0003065c0, 0x224d360, 0xc00012e480, 0xc00066f8b0)
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:290 +0x1f0
Jun  4 21:08:24 c2 nomad[31706]: created by github.com/hashicorp/nomad/nomad.(*rpcHandler).handleMultiplex
Jun  4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:272 +0x306
Jun  4 21:08:24 c2 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

nomad run systemjob2.hcl hangs at Evaluation triggered by job "systemjob2" while a similar trace appears in the leader's logs:

Jun  4 21:18:52 c2 nomad[7297]: worker.system_sched: reconciled current state with desired state: eval_id=bba8476a-e6aa-08d0-fb69-d56c3220bc31 job_id=systemjob2 namespace=default place=15 update=0 migrate=0 stop=0 ignore=0 lost=0
Jun  4 21:18:52 c2 nomad[7297]: panic: runtime error: invalid memory address or nil pointer dereference
Jun  4 21:18:52 c2 nomad[7297]: [signal SIGSEGV: segmentation violation code=0x1 addr=0xd8 pc=0xb23106]
Jun  4 21:18:52 c2 nomad[7297]: goroutine 155 [running]:
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/nomad/structs.(*Job).LookupTaskGroup(0x0, 0xc000f62a60, 0x6, 0x24)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/structs/structs.go:3357 +0x26
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*Preemptor).SetCandidates(0xc000631560, 0xc000017a80, 0x3, 0x3)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/preemption.go:146 +0xa0
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*BinPackIterator).Next(0xc000397840, 0x40eced)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/rank.go:360 +0x21b0
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*ScoreNormalizationIterator).Next(0xc000017580, 0x222a780)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/rank.go:623 +0x38
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*SystemStack).Select(0xc000d651f0, 0xc000905130, 0x0, 0x24)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/stack.go:339 +0x186
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computePlacements(0xc000376b40, 0xc000172c00, 0xf, 0x18, 0xc00095f268, 0x0)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:284 +0x2ea
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computeJobAllocs(0xc000376b40, 0xc0003923c0, 0xc000d651f0)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:262 +0x101a
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).process(0xc000376b40, 0x1a7c0a0, 0xc000e1d580, 0x1a7c0a0)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:128 +0x42f
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).process-fm(0xc000e1d5a0, 0x0, 0x0)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:74 +0x2a
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc000631dc8, 0xc000631dd8, 0xc, 0x0)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:271 +0x40
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).Process(0xc000376b40, 0xc00023fa20, 0x22538e0, 0xc0006d22a0)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:74 +0x2b2
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc00034a150, 0xc00023fa20, 0xc000849200, 0x24, 0x0, 0x0)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:294 +0x419
Jun  4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/nomad.(*Worker).run(0xc00034a150)
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:129 +0x2d3
Jun  4 21:18:52 c2 nomad[7297]: created by github.com/hashicorp/nomad/nomad.NewWorker
Jun  4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x14f
Jun  4 21:18:52 c2 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

The outcome here is much worse than the nomad plan, as panics continue on the current cluster leader:

Jun  4 21:22:31 c3 nomad[13261]: nomad: cluster leadership acquired
Jun  4 21:22:31 c3 nomad[13261]: panic: runtime error: invalid memory address or nil pointer dereference
Jun  4 21:22:35 c3 nomad[13284]: nomad: cluster leadership acquired
Jun  4 21:22:35 c3 nomad[13284]: panic: runtime error: invalid memory address or nil pointer dereference
Jun  4 21:22:37 c1 nomad[10371]: nomad: cluster leadership acquired
Jun  4 21:22:37 c1 nomad[10371]: panic: runtime error: invalid memory address or nil pointer dereference
@notnoop
Copy link
Contributor

notnoop commented Jun 5, 2019

Thanks @cheeseprocedure for reporting this issue, and this is a serious regression. We believe this was addressed incidentally in GH-5545. The fix will be released under 0.9.2 (to be released shortly), and you can test it against the 0.9.2-rc1 binaries in https://releases.hashicorp.com/nomad/0.9.2-rc1/ . Please let us know of your findings.

@cheeseprocedure
Copy link
Contributor Author

Thanks very much @notnoop - we'll try it with 0.9.2-rc1 shortly!

@cheeseprocedure
Copy link
Contributor Author

@notnoop looks good - I was unable to reproduce this with the Nomad 0.9.2 release. Thanks again!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants