Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: recover from panic #12009

Merged
merged 2 commits into from
Feb 7, 2022
Merged

scheduler: recover from panic #12009

merged 2 commits into from
Feb 7, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Feb 4, 2022

If processing a specific evaluation causes the scheduler (and
therefore the entire server) to panic, that evaluation will never
get a chance to be nack'd and cleared from the state store. It will
get dequeued by another scheduler, causing that server to panic, and
so forth until all servers are in a panic loop. This prevents the
operator from intervening to remove the evaluation or update the
state.

Recover the goroutine from the top-level Process methods for each
scheduler so that this condition can be detected without panicking the
server process. This will lead to a loop of recovering the scheduler
goroutine until the eval can be removed or nack'd, but that's much
better than taking a downtime.

If processing a specific evaluation causes the scheduler (and
therefore the entire server) to panic, that evaluation will never
get a chance to be nack'd and cleared from the state store. It will
get dequeued by another scheduler, causing that server to panic, and
so forth until all servers are in a panic loop. This prevents the
operator from intervening to remove the evaluation or update the
state.

Recover the goroutine from the top-level `Process` methods for each
scheduler so that this condition can be detected without panicking the
server process. This will lead to a loop of recovering the scheduler
goroutine until the eval can be removed or nack'd, but that's much
better than taking a downtime.
@tgross tgross force-pushed the recoverable-scheduler-processing branch from a17c49d to 674a1b5 Compare February 4, 2022 18:48
@tgross tgross requested a review from schmichael February 4, 2022 18:48
@vercel vercel bot temporarily deployed to Preview – nomad February 4, 2022 18:48 Inactive
@tgross tgross requested review from shoenig and jazzyfresh February 4, 2022 18:48
@tgross tgross marked this pull request as ready for review February 4, 2022 18:48
Copy link
Contributor

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; It's been so long I had to read up on how recover works

https://go.dev/blog/defer-panic-and-recover

@vercel vercel bot temporarily deployed to Preview – nomad February 7, 2022 16:11 Inactive
@tgross tgross merged commit f811169 into main Feb 7, 2022
@tgross tgross deleted the recoverable-scheduler-processing branch February 7, 2022 16:47
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants