You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$ nomad version
Nomad v0.9.1 (4b2bdbd9ab68a27b10c2ee781cceaaf62e114399)
Operating system and Environment details
Ubuntu 16.04LTS amd64
Issue
When attempting to run a new system job, the Nomad cluster leader panics if another system job is already running which occupies nearly all Nomad client resources and has a lower priority than the new job. This panic appears to occur on the next cluster leader, and so on, effectively rendering the cluster unmanageable.
Running nomad plan against the new system job definition causes a similar panic. In our case, Nomad is respawned by systemd and the cluster is otherwise healthy, but the job is never scheduled.
Reproduction steps
Cluster setup: 3x Nomad servers, 5x Nomad clients w/6192MHz CPU available for tasks. (Scale resources in job definitions below accordingly.)
Execute nomad run systemjob1.hcl, then execute nomad run systemjob2.hcl or nomad plan systemjob2.hcl.
After executing nomad plan systemjob2.hcl and receiving Error during plan: Unexpected response code: 500 (rpc error: EOF), the Nomad cluster leader logs the following:
Jun 4 21:08:24 c2 nomad[31706]: 2019-06-04T21:08:24.215Z [DEBUG] nomad.job.system_sched: reconciled current state with desired state: eval_id=28c9f688-c9ff-c24a-3961-4f5d49d5b897 job_id=systemjob2 namespace=default place=15 update=
0 migrate=0 stop=0 ignore=0 lost=0
Jun 4 21:08:24 c2 nomad[31706]: nomad.job.system_sched: reconciled current state with desired state: eval_id=28c9f688-c9ff-c24a-3961-4f5d49d5b897 job_id=systemjob2 namespace=default place=15 update=0 migrate=0 stop=0 ignore=0 lost=0
Jun 4 21:08:24 c2 nomad[31706]: panic: runtime error: invalid memory address or nil pointer dereference
Jun 4 21:08:24 c2 nomad[31706]: [signal SIGSEGV: segmentation violation code=0x1 addr=0xd8 pc=0xb23106]
Jun 4 21:08:24 c2 nomad[31706]: goroutine 3540 [running]:
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/nomad/structs.(*Job).LookupTaskGroup(0x0, 0xc0017a2610, 0x6, 0x24)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/structs/structs.go:3357 +0x26
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*Preemptor).SetCandidates(0xc00179ce10, 0xc0005df100, 0x3, 0x3)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/preemption.go:146 +0xa0
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*BinPackIterator).Next(0xc0008f2100, 0x40eced)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/rank.go:360 +0x21b0
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*ScoreNormalizationIterator).Next(0xc0005dea80, 0x222a780)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/rank.go:623 +0x38
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*SystemStack).Select(0xc000ecb960, 0xc000514fd0, 0x0, 0x24)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/stack.go:339 +0x186
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computePlacements(0xc000c7cb40, 0xc000995b00, 0xf, 0x18, 0xc0012179c8, 0x0)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:284 +0x2ea
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computeJobAllocs(0xc000c7cb40, 0xc000bbab40, 0xc000ecb960)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:262 +0x101a
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).process(0xc000c7cb40, 0x1a7c0a0, 0xc0017d6090, 0x1a7c0a0)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:128 +0x42f
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).process-fm(0xc0017d60b0, 0x0, 0x0)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:74 +0x2a
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc00179d678, 0xc00179d688, 0xc, 0x0)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:271 +0x40
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).Process(0xc000c7cb40, 0xc000fb66e0, 0x22538e0, 0xc0002cede0)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:74 +0x2b2
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/nomad.(*Job).Plan(0xc00028eb80, 0xc000a5bef0, 0xc000ecb8f0, 0x0, 0x0)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/job_endpoint.go:1221 +0x91a
Jun 4 21:08:24 c2 nomad[31706]: reflect.Value.call(0xc0004e67e0, 0xc0007baf18, 0x13, 0x1f96f71, 0x4, 0xc0004c9da8, 0x3, 0x3, 0x1e4a620, 0x19, ...)
Jun 4 21:08:24 c2 nomad[31706]: #011/usr/local/go/src/reflect/value.go:447 +0x454
Jun 4 21:08:24 c2 nomad[31706]: reflect.Value.Call(0xc0004e67e0, 0xc0007baf18, 0x13, 0xc0004c9da8, 0x3, 0x3, 0x16, 0x16, 0x16)
Jun 4 21:08:24 c2 nomad[31706]: #011/usr/local/go/src/reflect/value.go:308 +0xa4
Jun 4 21:08:24 c2 nomad[31706]: net/rpc.(*service).call(0xc000adf7c0, 0xc00066f8b0, 0xc0012b68d8, 0x0, 0xc00001fa00, 0xc0006278a0, 0x1cff060, 0xc000a5bef0, 0x16, 0x19d3800, ...)
Jun 4 21:08:24 c2 nomad[31706]: #011/usr/local/go/src/net/rpc/server.go:384 +0x14e
Jun 4 21:08:24 c2 nomad[31706]: net/rpc.(*Server).ServeRequest(0xc00066f8b0, 0x2240700, 0xc0001b2e80, 0x2, 0x0)
Jun 4 21:08:24 c2 nomad[31706]: #011/usr/local/go/src/net/rpc/server.go:505 +0x242
Jun 4 21:08:24 c2 nomad[31706]: github.com/hashicorp/nomad/nomad.(*rpcHandler).handleNomadConn(0xc00028e7a0, 0x223f5c0, 0xc0003065c0, 0x224d360, 0xc00012e480, 0xc00066f8b0)
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:290 +0x1f0
Jun 4 21:08:24 c2 nomad[31706]: created by github.com/hashicorp/nomad/nomad.(*rpcHandler).handleMultiplex
Jun 4 21:08:24 c2 nomad[31706]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:272 +0x306
Jun 4 21:08:24 c2 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
nomad run systemjob2.hcl hangs at Evaluation triggered by job "systemjob2" while a similar trace appears in the leader's logs:
Jun 4 21:18:52 c2 nomad[7297]: worker.system_sched: reconciled current state with desired state: eval_id=bba8476a-e6aa-08d0-fb69-d56c3220bc31 job_id=systemjob2 namespace=default place=15 update=0 migrate=0 stop=0 ignore=0 lost=0
Jun 4 21:18:52 c2 nomad[7297]: panic: runtime error: invalid memory address or nil pointer dereference
Jun 4 21:18:52 c2 nomad[7297]: [signal SIGSEGV: segmentation violation code=0x1 addr=0xd8 pc=0xb23106]
Jun 4 21:18:52 c2 nomad[7297]: goroutine 155 [running]:
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/nomad/structs.(*Job).LookupTaskGroup(0x0, 0xc000f62a60, 0x6, 0x24)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/structs/structs.go:3357 +0x26
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*Preemptor).SetCandidates(0xc000631560, 0xc000017a80, 0x3, 0x3)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/preemption.go:146 +0xa0
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*BinPackIterator).Next(0xc000397840, 0x40eced)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/rank.go:360 +0x21b0
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*ScoreNormalizationIterator).Next(0xc000017580, 0x222a780)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/rank.go:623 +0x38
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*SystemStack).Select(0xc000d651f0, 0xc000905130, 0x0, 0x24)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/stack.go:339 +0x186
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computePlacements(0xc000376b40, 0xc000172c00, 0xf, 0x18, 0xc00095f268, 0x0)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:284 +0x2ea
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computeJobAllocs(0xc000376b40, 0xc0003923c0, 0xc000d651f0)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:262 +0x101a
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).process(0xc000376b40, 0x1a7c0a0, 0xc000e1d580, 0x1a7c0a0)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:128 +0x42f
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).process-fm(0xc000e1d5a0, 0x0, 0x0)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:74 +0x2a
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc000631dc8, 0xc000631dd8, 0xc, 0x0)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:271 +0x40
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).Process(0xc000376b40, 0xc00023fa20, 0x22538e0, 0xc0006d22a0)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:74 +0x2b2
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc00034a150, 0xc00023fa20, 0xc000849200, 0x24, 0x0, 0x0)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:294 +0x419
Jun 4 21:18:52 c2 nomad[7297]: github.com/hashicorp/nomad/nomad.(*Worker).run(0xc00034a150)
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:129 +0x2d3
Jun 4 21:18:52 c2 nomad[7297]: created by github.com/hashicorp/nomad/nomad.NewWorker
Jun 4 21:18:52 c2 nomad[7297]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x14f
Jun 4 21:18:52 c2 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
The outcome here is much worse than the nomad plan, as panics continue on the current cluster leader:
Jun 4 21:22:31 c3 nomad[13261]: nomad: cluster leadership acquired
Jun 4 21:22:31 c3 nomad[13261]: panic: runtime error: invalid memory address or nil pointer dereference
Jun 4 21:22:35 c3 nomad[13284]: nomad: cluster leadership acquired
Jun 4 21:22:35 c3 nomad[13284]: panic: runtime error: invalid memory address or nil pointer dereference
Jun 4 21:22:37 c1 nomad[10371]: nomad: cluster leadership acquired
Jun 4 21:22:37 c1 nomad[10371]: panic: runtime error: invalid memory address or nil pointer dereference
The text was updated successfully, but these errors were encountered:
Thanks @cheeseprocedure for reporting this issue, and this is a serious regression. We believe this was addressed incidentally in GH-5545. The fix will be released under 0.9.2 (to be released shortly), and you can test it against the 0.9.2-rc1 binaries in https://releases.hashicorp.com/nomad/0.9.2-rc1/ . Please let us know of your findings.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Operating system and Environment details
Ubuntu 16.04LTS amd64
Issue
When attempting to run a new system job, the Nomad cluster leader panics if another system job is already running which occupies nearly all Nomad client resources and has a lower priority than the new job. This panic appears to occur on the next cluster leader, and so on, effectively rendering the cluster unmanageable.
Running
nomad plan
against the new system job definition causes a similar panic. In our case, Nomad is respawned by systemd and the cluster is otherwise healthy, but the job is never scheduled.Reproduction steps
Cluster setup: 3x Nomad servers, 5x Nomad clients w/6192MHz CPU available for tasks. (Scale
resources
in job definitions below accordingly.)Execute
nomad run systemjob1.hcl
, then executenomad run systemjob2.hcl
ornomad plan systemjob2.hcl
.Job file (if appropriate)
systemjob1.hcl
systemjob2.hcl
Nomad Client logs (if appropriate)
n/a
Nomad Server logs (if appropriate)
After executing
nomad plan systemjob2.hcl
and receivingError during plan: Unexpected response code: 500 (rpc error: EOF)
, the Nomad cluster leader logs the following:nomad run systemjob2.hcl
hangs atEvaluation triggered by job "systemjob2"
while a similar trace appears in the leader's logs:The outcome here is much worse than the
nomad plan
, as panics continue on the current cluster leader:The text was updated successfully, but these errors were encountered: