-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.11 panic deregistering job #7757
Comments
Hi @michaeldwan! Sorry to hear about this. I ran into this just an hour ago myself while working on #7708. Thanks for the PR! |
Out of curiosity, is there a way to remove an entry from the raft log that's preventing servers from starting? |
I can't say that I've ever seen it done, but it's not impossible. It's just much safer to fix it with a patch that's aware of the application's schema so that you don't get dangling cross-references. Our raft implementation uses https://github.com/hashicorp/raft-boltdb as the backing store and hypothetically it'd be possible to edit the backing store directly (on a stopped server, wiping out the other servers and syncing the results to them manually). There'd be a change to the "bucket" (table) for both the object you want to edit and then one for the index table for that object type. It'd be pretty dangerous so you'd want to backup the store before hand. |
The patch for this will ship in the 0.11.1 release. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.11
Operating system and Environment details
Issue
Our nomad servers are crashing while applying a delete job request from the raft log.
Based on our logs, a request to deregister+purge a job failed with a 500, though I can’t find any more details why. Within a few seconds we began seeing errors like this on each server
Followed by panics:
Based on the stack trace this line (and appropriate comment above) was the culprit. We're not using any CSI plugins and this job had no other plugin configurations.
We were able to bring our servers back online with a patch that checked for nil task groups before ranging.
Reproduction steps
Job file (if appropriate)
Nomad Client logs (if appropriate)
Nomad Server logs (if appropriate)
I can share logs before and after the patch if you need.
The text was updated successfully, but these errors were encountered: