Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Panic in nomad server #2583

Closed
justenwalker opened this issue Apr 25, 2017 · 10 comments
Closed

[Bug] Panic in nomad server #2583

justenwalker opened this issue Apr 25, 2017 · 10 comments

Comments

@justenwalker
Copy link
Contributor

justenwalker commented Apr 25, 2017

Nomad version

Nomad v0.5.6

Operating system and Environment details

$ uname -a
Linux server-01 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"

Issue

Nomad server panics under some (as yet unknown) conditions

Reproduction steps

Possibly in a split-brain or network situation - no concrete reproduction steps yet.

Nomad Client logs

    2017/04/25 17:13:51.794568 [INFO] client.consul: discovered following Servers: 10.140.6.24:4647,10.140.6.23:4647,10.
140.6.22:4647
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x90 pc=0xa1318d]

goroutine 346 [running]:
github.com/hashicorp/nomad/nomad/structs.(*Job).LookupTaskGroup(0x0, 0xc04241f420, 0x17, 0xc042556f00)
        /opt/gopath/src/github.com/hashicorp/nomad/nomad/structs/structs.go:1316 +0x2d
github.com/hashicorp/nomad/client.(*Client).blockForRemoteAlloc(0xc042366000, 0xc042556f00)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1566 +0x97
created by github.com/hashicorp/nomad/client.(*Client).runAllocs
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1523 +0xac4

Nomad Server logs

Apr 25 17:01:21 server-01 nomad[40143]:     2017/04/25 17:01:21.037375 [WARN] nomad.heartbeat: node '3faf980a-4a79-4c70-a01a-ebbfa130ccb8' TTL expired
Apr 25 17:01:21 server-01 nomad[40143]: panic: runtime error: invalid memory address or nil pointer dereference
Apr 25 17:01:21 server-01 nomad[40143]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0xceffa3]
Apr 25 17:01:21 server-01 nomad[40143]: goroutine 5987 [running]:
Apr 25 17:01:21 server-01 nomad[40143]: github.com/hashicorp/nomad/nomad.(*Node).createNodeEvals(0xc4203a7540, 0xc43e25a0f0, 0x24, 0x9ac1dd, 0x1, 0x0, 0x9ac1dd, 0x0, 0x0,0xc451485c00)
Apr 25 17:01:21 server-01 nomad[40143]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/node_endpoint.go:847 +0x753
Apr 25 17:01:21 server-01 nomad[40143]: github.com/hashicorp/nomad/nomad.(*Node).UpdateStatus(0xc4203a7540, 0xc43caebe60, 0xc436db1700, 0x0, 0x0)
Apr 25 17:01:21 server-01 nomad[40143]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/node_endpoint.go:326 +0x625
Apr 25 17:01:21 server-01 nomad[40143]: github.com/hashicorp/nomad/nomad.(*Server).invalidateHeartbeat(0xc4203d24e0, 0xc43e25a0f0, 0x24)
Apr 25 17:01:21 server-01 nomad[40143]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/heartbeat.go:103 +0x32d
Apr 25 17:01:21 server-01 nomad[40143]: github.com/hashicorp/nomad/nomad.(*Server).resetHeartbeatTimerLocked.func1()
Apr 25 17:01:21 server-01 nomad[40143]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/heartbeat.go:79 +0x3c
Apr 25 17:01:21 server-01 nomad[40143]: created by time.goFunc
Apr 25 17:01:21 server-01 nomad[40143]:         /opt/go/src/time/sleep.go:170 +0x44
@dadgar
Copy link
Contributor

dadgar commented Apr 25, 2017

@justenwalker Can we get more logs around both panics.

@dadgar
Copy link
Contributor

dadgar commented Apr 25, 2017

Can you also give all the allocations for the node that TTL'd: 3faf980a-4a79-4c70-a01a-ebbfa130ccb8

curl http://127.0.0.1:4646/v1/node/3faf980a-4a79-4c70-a01a-ebbfa130ccb8/allocations

@dadgar
Copy link
Contributor

dadgar commented Apr 25, 2017

Looks like both panics are from alloc.Job == nil where it shouldn't be. Would be nice to confirm via the allocation dump and the logs to show what the scheduler was doing.

@justenwalker
Copy link
Contributor Author

The client error crops up when we delete the alloc folder and start up nomad. The server error seems to randomly happen at some point - most likely due to network errors.

@dadgar
Copy link
Contributor

dadgar commented Apr 25, 2017

@justenwalker When you say you delete the alloc dir are you deleting alloc/* or deleting the data underneath each alloc?

state_dir/ 
   |-> alloc/
        |-> 3faf980a-4a79-4c70-a01a-ebbfa130ccb8
             |-> state.json
        |-> repeated ... 

@justenwalker
Copy link
Contributor Author

On the client node: both data_dir/alloc and data_dir/client/alloc

@dadgar
Copy link
Contributor

dadgar commented Apr 25, 2017

Hmm okay. Any chance of getting the logs and alloc dump?

@dadgar
Copy link
Contributor

dadgar commented Apr 26, 2017

@justenwalker Are you running with both client and server enabled?

@dadgar
Copy link
Contributor

dadgar commented Apr 27, 2017

For posterities sake a description of the issue:

// TTLs that happened roughly before crash
"Apr 25 15:29:36 q2-gog-nmd-ns0 nomad[49852]:     2017/04/25 15:29:36.611148 [WARN] nomad.heartbeat: node '46831227-14f9-4dff-8a88-39736d2d1329' TTL expired","2017-04-25T11:29:36.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:29:37 q2-gog-nmd-ns0 nomad[49852]:     2017/04/25 15:29:37.708765 [WARN] nomad.heartbeat: node 'f6e7a9cb-9936-4076-b9be-e8a2c3527e5b' TTL expired","2017-04-25T11:29:37.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:29:48 q2-gog-nmd-ns0 nomad[49852]:     2017/04/25 15:29:48.449245 [WARN] nomad.heartbeat: node '2386546b-e891-a046-b005-d420b54433ba' TTL expired","2017-04-25T11:29:48.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:29:50 q2-gog-nmd-ns0 nomad[49852]:     2017/04/25 15:29:50.634437 [WARN] nomad.heartbeat: node 'b97293b5-e9bb-4d07-8f55-e9d4d44d95bc' TTL expired","2017-04-25T11:29:50.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:29:51 q2-gog-nmd-ns0 nomad[49852]:     2017/04/25 15:29:51.896766 [WARN] nomad.heartbeat: node '6e65723e-d8c1-4192-b944-3c2f624a6daa' TTL expired","2017-04-25T11:29:51.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:29:54 q2-gog-nmd-ns0 nomad[49852]:     2017/04/25 15:29:54.763682 [WARN] nomad.heartbeat: node 'f0f78462-f12f-457f-9542-038aa7203876' TTL expired","2017-04-25T11:29:54.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:02 q2-gog-nmd-ns0 nomad[49852]:     2017/04/25 15:30:02.198775 [WARN] nomad.heartbeat: node '8b94a75a-a2b3-4d8a-a356-a3a6f1fd18ce' TTL expired","2017-04-25T11:30:02.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"

// There was a force GC roughly 20 seconds before. However the partitioned reap takes
// 23 seconds to complete as indicated by the force node GC.
"Apr 25 15:29:43 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:29:43.909522 [DEBUG] sched.core: eval GC: 23350 evaluations, 4318 allocs eligible","2017-04-25T11:29:43.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:29:44 q2-gog-nmd-ns2 nomad[26617]:     2017/04/25 15:29:44 [DEBUG] memberlist: TCP connection from=10.140.6.23:34888","2017-04-25T11:29:44.000-0400","q2-gog-nmd-ns2","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:29:56 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:29:56.710780 [DEBUG] sched.core: forced node GC","2017-04-25T11:29:56.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"

// Scheduler sets an eval to lost
"Apr 25 15:30:03 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:03.476190 [DEBUG] worker: dequeued evaluation 04b3cc1f-9988-dfb8-d0c9-3b2d3b6c2b5f","2017-04-25T11:30:03.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:03 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:03.476337 [DEBUG] sched: <Eval '04b3cc1f-9988-dfb8-d0c9-3b2d3b6c2b5f' JobID: 'thor/inventory-accuracy/fc000/periodic-1493071200'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 1)","2017-04-25T11:30:03.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:03 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:03.612990 [DEBUG] worker: submitted plan for evaluation 04b3cc1f-9988-dfb8-d0c9-3b2d3b6c2b5f","2017-04-25T11:30:03.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:03 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:03.613291 [DEBUG] sched: <Eval '04b3cc1f-9988-dfb8-d0c9-3b2d3b6c2b5f' JobID: 'thor/inventory-accuracy/fc000/periodic-1493071200'>: setting status to complete","2017-04-25T11:30:03.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:03 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:03.831116 [DEBUG] worker: updated evaluation <Eval '04b3cc1f-9988-dfb8-d0c9-3b2d3b6c2b5f' JobID: 'thor/inventory-accuracy/fc000/periodic-1493071200'>","2017-04-25T11:30:03.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:03 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:03.832269 [DEBUG] worker: ack for evaluation 04b3cc1f-9988-dfb8-d0c9-3b2d3b6c2b5f","2017-04-25T11:30:03.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"

// Panic
"Apr 25 15:30:04 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:04.475748 [ERR] worker: failed to update evaluation <Eval '761f90aa-c7f0-95f6-9d0b-8928891be941' JobID: 'thor/inventory-accuracy/fc000/periodic-1493071200'>: rpc error: EOF","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:04.476524 [ERR] worker: failed to nack evaluation '761f90aa-c7f0-95f6-9d0b-8928891be941': rpc error: stream closed","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:04.476737 [ERR] worker: failed to dequeue evaluation: rpc error: stream closed","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:04.477383 [ERR] worker: failed to update evaluation <Eval 'bff5ff23-99cf-7f61-cc7a-23a23ed89197' JobID: 'batman/webapi/native'>: rpc error: EOF","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:04.477578 [ERR] worker: failed to nack evaluation 'bff5ff23-99cf-7f61-cc7a-23a23ed89197': rpc error: stream closed","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:04.477825 [ERR] worker: failed to dequeue evaluation: rpc error: stream closed","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:04.497118 [ERR] worker: failed to dequeue evaluation: rpc error: stream closed","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:04.498574 [ERR] worker: failed to dequeue evaluation: rpc error: stream closed","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:04.577429 [ERR] worker: failed to dequeue evaluation: rpc error: stream closed","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:04.578923 [ERR] worker: failed to dequeue evaluation: rpc error: stream closed","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:04.897938 [ERR] worker: failed to dequeue evaluation: rpc error: stream closed","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns1 nomad[21222]:     2017/04/25 15:30:04.899297 [ERR] worker: failed to dequeue evaluation: rpc error: stream closed","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns1","qa-guardians",1,"/var/log/syslog",syslog,"idx12.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]:     2017/04/25 15:30:04.113837 [DEBUG] worker: updated evaluation <Eval '1d4e73e9-3c61-1bdc-7c15-bfef56b74771' JobID: 'batman/webapi/native/1.0.3075'>","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]:     2017/04/25 15:30:04.113909 [DEBUG] worker: ack for evaluation 1d4e73e9-3c61-1bdc-7c15-bfef56b74771","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]:     2017/04/25 15:30:04.114048 [DEBUG] worker: dequeued evaluation c3b89c27-2e3a-b39d-c750-e5db943136c4","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]:     2017/04/25 15:30:04.114208 [DEBUG] sched: <Eval 'c3b89c27-2e3a-b39d-c750-e5db943136c4' JobID: 'batman/webapi/native/1.0.3075'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]:     2017/04/25 15:30:04.114222 [DEBUG] sched: <Eval 'c3b89c27-2e3a-b39d-c750-e5db943136c4' JobID: 'batman/webapi/native/1.0.3075'>: setting status to complete","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: panic: runtime error: invalid memory address or nil pointer dereference","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0xceffa3]","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: goroutine 14073 [running]:","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: github.com/hashicorp/nomad/nomad.(*Node).createNodeEvals(0xc42010b340, 0xc43dbbd2c0, 0x24, 0x99d66a, 0x0, 0x0, 0x99d66a, 0x0, 0x0, 0x402400)","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/node_endpoint.go:847 +0x753","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: github.com/hashicorp/nomad/nomad.(*Node).UpdateStatus(0xc42010b340, 0xc43dbbc930, 0xc4243d7800, 0x0, 0x0)","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/node_endpoint.go:326 +0x625","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: reflect.Value.call(0xc42041daa0, 0xc420108570, 0x13, 0x12615e4, 0x4, 0xc471c67e28, 0x3, 0x3, 0x116fee0, 0xc4243d7800, ...)","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: #011/opt/go/src/reflect/value.go:434 +0x91f","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: reflect.Value.Call(0xc42041daa0, 0xc420108570, 0x13, 0xc471c67e28, 0x3, 0x3, 0xc4400cdfa0, 0xc43dcd6a01, 0x0)","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: #011/opt/go/src/reflect/value.go:302 +0xa4","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: net/rpc.(*service).call(0xc42010b440, 0xc42010b300, 0xc441ff3af8, 0xc420422580, 0xc4400cdfa0, 0x116fee0, 0xc43dbbc930, 0x16, 0x110aca0, 0xc4243d7800, ...)","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: #011/opt/go/src/net/rpc/server.go:387 +0x144","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: net/rpc.(*Server).ServeRequest(0xc42010b300, 0x1a68580, 0xc445cf86c0, 0x3f800000, 0x0)","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: #011/opt/go/src/net/rpc/server.go:502 +0x267","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc420112d00, 0x1a6d700, 0xc458a68f70)","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:165 +0x11f","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"
"Apr 25 15:30:04 q2-gog-nmd-ns0 nomad[49852]: #011/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:150 +0x19e","2017-04-25T11:30:04.000-0400","q2-gog-nmd-ns0","qa-guardians",1,"/var/log/syslog",syslog,"idx11.jet.splunkcloud.com"

The issue is a racy interaction between garbage collection and the scheduler. The issue is garbage collection removes an allocation and then the scheduler updates its status to terminal state. The way the job is normalized in the plan results in this to upsert the allocation without a job. This was not guarded properly in the case that it is a new allocation.

That breaks an invariant which caused panics. The PR that fixes this guards this and will result in the scheduler failing the evaluation which will force it to be retried with updated state, thereby avoiding this issue.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants