Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nil pointer when an allocation's task group is no longer found on the job #4560

Closed
jippi opened this issue Aug 7, 2018 · 4 comments
Closed

Comments

@jippi
Copy link
Contributor

jippi commented Aug 7, 2018

Nomad version

Nomad v0.8.4 (dbee1d7)

Issue

Nil pointer on all servers when they get raft leadership

I'm not sure what was done to get into this state, but a nil pointer should never happen :)

Created Deployment: "2039a10e-5984-c912-346c-ac0d700603f5"
Desired Changes for "server": (place 1) (inplace 1) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x99a734]

goroutine 84 [running]:
github.com/hashicorp/nomad/nomad/structs.(*Allocation).NextDelay(0xc43a76db00, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/structs/structs.go:6024 +0x34
github.com/hashicorp/nomad/scheduler.updateRescheduleTracker(0xc43e17f800, 0xc43a76db00, 0xbed28a80200d1d66, 0x7a9387161, 0x2b96de0)
	/opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:588 +0x331
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computePlacements(0xc434fe3400, 0x2bb8aa0, 0x0, 0x0, 0xc432b94b00, 0x1, 0x1, 0x1, 0x9)
	/opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:497 +0xd1f
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computeJobAllocs(0xc434fe3400, 0xc43a68ab60, 0xc434234e40)
	/opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:410 +0x1586
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).process(0xc434fe3400, 0xe, 0xc42598f070, 0x7)
	/opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:245 +0x471
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).(github.com/hashicorp/nomad/scheduler.process)-fm(0xc43e035750, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:144 +0x2a
github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc43e0358b0, 0xc43e0358c0, 0xc, 0xffffffffffffffff)
	/opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:271 +0x43
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc434fe3400, 0xc434205b80, 0xc420200b90, 0x1d7cde0)
	/opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:144 +0x123
github.com/hashicorp/nomad/nomad.(*nomadFSM).reconcileQueuedAllocations(0xc420387080, 0x1ca3ae7, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:1383 +0x8ed
github.com/hashicorp/nomad/nomad.(*nomadFSM).applyReconcileSummaries(0xc420387080, 0xc434302075, 0x8, 0x8, 0x1ca3ae7, 0xc42e5aff76, 0x2b96de0)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:758 +0x7e
github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0xc420387080, 0xc42be382d0, 0x2b96de0, 0x3)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:210 +0x6f1
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc43c2275a0)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/fsm.go:57 +0x15a
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runFSM(0xc42018e000)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/fsm.go:120 +0x2fa
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).(github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.runFSM)-fm()
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/api.go:506 +0x2a
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc42018e000, 0xc42021a710)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66
@preetapan
Copy link
Contributor

@jippi Digging through the code, looks like this happens when an allocation's task group is no longer found on the job - https://github.com/hashicorp/nomad/blob/v0.8.4/nomad/structs/structs.go#L5987. Would be curious to see the alloc/job details if you have them.

Nomad should not have panicked though, will fix this in the upcoming release.

@jippi
Copy link
Contributor Author

jippi commented Aug 7, 2018

My current fix is this

diff --git a/nomad/structs/structs.go b/nomad/structs/structs.go
index 969f11338..e8d36df06 100644
--- a/nomad/structs/structs.go
+++ b/nomad/structs/structs.go
@@ -5984,8 +5984,11 @@ func (a *Allocation) LastEventTime() time.Time {
 // ReschedulePolicy returns the reschedule policy based on the task group
 func (a *Allocation) ReschedulePolicy() *ReschedulePolicy {
        tg := a.Job.LookupTaskGroup(a.TaskGroup)
-       if tg == nil {
-               return nil
+       if tg == nil || tg.ReschedulePolicy == nil {
+               if a.Job.Type == JobTypeService {
+                       return &DefaultServiceJobReschedulePolicy
+               }
+               return &DefaultBatchJobReschedulePolicy
        }
        return tg.ReschedulePolicy
 }

@jippi
Copy link
Contributor Author

jippi commented Aug 7, 2018

@preetapan based on the stack trace and the logs just before it, i can't find any reference to what job exactly that is causing it to crash - guidance is welcome if you got some pro-tips :)

@chelseakomlo chelseakomlo changed the title nil pointer on all servers when they get raft leadership Nil pointer when an allocation's task group is no longer found on the job Aug 14, 2018
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants