-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix panic while deleting CSI plugins for missing job #7758
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
But I think I want to tag @langmartin on this to see this was intended as an assert that we shouldn't otherwise be seeing.
@@ -1187,15 +1187,14 @@ func (s *StateStore) deleteJobFromPlugin(index uint64, txn *memdb.Txn, job *stru | |||
plugins := map[string]*structs.CSIPlugin{} | |||
|
|||
for _, a := range allocs { | |||
tg := job.LookupTaskGroup(a.TaskGroup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @michaeldwan I think the error we made here is that this line should have read:
tg := a.job.LookupTaskGroup(a.TaskGroup)
It seemed strange to me that a job can not have the taskgroup for one of its own allocs, but my colleague @notnoop has pointed out to me that the alloc can have a view of the job that's stale relative to the job we've pulled from the state store, and that's why we're running into this panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's roughly what we think happened too. We were actually working on a network issue that impacted consul but nomad seemed healthy through it.
I can make that change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah, I typoed my example. It's a.Job
(capital letter) 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made the change. There's a few other places LookupTaskGroup is being called but they appear to get both the job and alloc from the same snapshot and should be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the same typo and pushed before waiting for tests in vagrant to run :)
3785251
to
a956a17
Compare
a956a17
to
e08b70f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix is LGTM. Thank you so much for reporting and fixing the issue! We probably should also follow up with more tests covering this area and we, the Nomad team, can take care of it.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This is what we deployed to production to fix the panic described in #7757.