-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Panic in nomad server #2583
Comments
@justenwalker Can we get more logs around both panics. |
Can you also give all the allocations for the node that TTL'd:
|
Looks like both panics are from |
The client error crops up when we delete the alloc folder and start up nomad. The server error seems to randomly happen at some point - most likely due to network errors. |
@justenwalker When you say you delete the alloc dir are you deleting
|
On the client node: both |
Hmm okay. Any chance of getting the logs and alloc dump? |
@justenwalker Are you running with both client and server enabled? |
For posterities sake a description of the issue:
The issue is a racy interaction between garbage collection and the scheduler. The issue is garbage collection removes an allocation and then the scheduler updates its status to terminal state. The way the job is normalized in the plan results in this to upsert the allocation without a job. This was not guarded properly in the case that it is a new allocation. That breaks an invariant which caused panics. The PR that fixes this guards this and will result in the scheduler failing the evaluation which will force it to be retried with updated state, thereby avoiding this issue. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.5.6
Operating system and Environment details
Issue
Nomad server panics under some (as yet unknown) conditions
Reproduction steps
Possibly in a split-brain or network situation - no concrete reproduction steps yet.
Nomad Client logs
Nomad Server logs
The text was updated successfully, but these errors were encountered: