-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad server failed and fails to recover - panic on start #4207
Comments
Just upgraded the binary to 0.8.1 and tried again: Nomad VersionNomad v0.8.1 (46aa11b)
|
I was able to get past this by building a binary that checks if See PR #4208 |
Thanks for the PR, and the detailed description. Would you mind also including your agent and job configuration? We would like to reproduce on our end as well. Were you starting from a fresh cluster state? |
We were not. This cluster was active and had tens of jobs running over 100 allocations. We did have an autoscaling agent here (replicator) which is responsible for scaling jobs and nodes in and out. I believe draining an allocation and terminating a node consecutively caused the bad state, though that’s a pretty normal operation that has worked hundreds of times over the last few months.
DJ Enriquez
… On Apr 25, 2018, at 6:15 AM, Chelsea Komlo ***@***.***> wrote:
Thanks for the PR, and the detailed description. Would you mind also including your agent and job configuration? We would like to reproduce on our end as well.
Were you starting from a fresh cluster state?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@chelseakomlo Looks like that deployment is in that bad state again. I need to fix it by doing what I did yesterday with the dev binary, but should have logs to help out if needed. |
This is where it all came crashing down, not sure if these logs are useful though...
This server then retried elections unsuccessfully until I corrected it a few minutes ago. One server reported:
The third and final server reported:
I believe 10.2.11.3 was the leader as it had the most recent log stating.
vs
and
EDIT:
Only after 10.2.19.112 crashed did 10.2.11.3 acquire leadership at |
just received this in production cluster =( |
We're working on getting a release out with this fix in it soon (on the order of days). |
One thing to add. one of the nomad clients for some reason was added to raft peer list:
|
UPD: |
How old we talking @burdandrei? Terminated just hours/days ago? Is this something the garbage collector usually catches but the IP was reused too soon? |
ip was reused within hours |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.8.0 (751b42a)
Operating system and Environment details
Issue
We have a development (thank goodness) environment go down all of a sudden...not really sure what happened, but none of the servers are able to establish leadership. Attempting to restart nomad shows the following panic:
Can't get a leader up, cant get things running, nomad outage recovery doesn't work either. Looks like theres bad state somewhere?
Reproduction steps
Not really sure how to reproduce this, we found this cluster this way.
The text was updated successfully, but these errors were encountered: