Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix panic while deleting CSI plugins for missing job #7758

Merged
merged 1 commit into from
Apr 20, 2020

Conversation

michaeldwan
Copy link
Contributor

@michaeldwan michaeldwan commented Apr 20, 2020

This is what we deployed to production to fix the panic described in #7757.

@hashicorp-cla
Copy link

hashicorp-cla commented Apr 20, 2020

CLA assistant check
All committers have signed the CLA.

@tgross tgross requested a review from langmartin April 20, 2020 17:41
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

But I think I want to tag @langmartin on this to see this was intended as an assert that we shouldn't otherwise be seeing.

@@ -1187,15 +1187,14 @@ func (s *StateStore) deleteJobFromPlugin(index uint64, txn *memdb.Txn, job *stru
plugins := map[string]*structs.CSIPlugin{}

for _, a := range allocs {
tg := job.LookupTaskGroup(a.TaskGroup)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @michaeldwan I think the error we made here is that this line should have read:

tg := a.job.LookupTaskGroup(a.TaskGroup)

It seemed strange to me that a job can not have the taskgroup for one of its own allocs, but my colleague @notnoop has pointed out to me that the alloc can have a view of the job that's stale relative to the job we've pulled from the state store, and that's why we're running into this panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's roughly what we think happened too. We were actually working on a network issue that impacted consul but nomad seemed healthy through it.

I can make that change here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah, I typoed my example. It's a.Job (capital letter) 😊

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made the change. There's a few other places LookupTaskGroup is being called but they appear to get both the job and alloc from the same snapshot and should be okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the same typo and pushed before waiting for tests in vagrant to run :)

@michaeldwan michaeldwan force-pushed the fix-deregister-job-panic branch from 3785251 to a956a17 Compare April 20, 2020 20:29
@michaeldwan michaeldwan force-pushed the fix-deregister-job-panic branch from a956a17 to e08b70f Compare April 20, 2020 20:35
Copy link
Contributor

@langmartin langmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix looks good to me.

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is LGTM. Thank you so much for reporting and fixing the issue! We probably should also follow up with more tests covering this area and we, the Nomad team, can take care of it.

@tgross tgross merged commit c364732 into hashicorp:master Apr 20, 2020
@tgross tgross mentioned this pull request May 4, 2020
@michaeldwan michaeldwan deleted the fix-deregister-job-panic branch June 5, 2020 04:45
@github-actions
Copy link

github-actions bot commented Jan 3, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants