-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nomad server panic: runtime error: invalid memory address or nil pointer dereference #4463
Comments
I think I might have it narrowed down to a job in our cluster causing it - but I'm not sure how to delete/kill this job since the servers aren't up long enough for me to stop it. |
@dcparker88 I'm looking into this now, what about the job makes you think its causing it? |
I might be way off - but I turned on Debug logs, and it lists out jobs, but never lists out a batch job that we have. The batch job also seems to be "flapping" - appearing and disappearing in the job status list, etc. This could just be a symptom of the nomad servers constantly restarting, however. |
here is a full log coming from a clean start (deleted everything in nomad data dir and started it fresh)
|
Looking at the code it seems to lookup a node based on id of the allocation (in Some inconsistency somewhere, to get your servers (probably) running again you could add a nomad/nomad/structs/structs.go Line 1431 in 1eedb77
But maybe hashicorp wants you to try other stuff first. :) |
@dcparker88 I think I found the bug, but I don't have a work around for fixing your state yet (and I'm not sure if I will (but I'm trying!)). I think a job (maybe the problematic one you mentioned) trying to get scheduled to a node that for some reason doesn't exist in the state store. You could ultimately fix this by stopping all nomad server nodes, wiping the datadir and starting them backup. I'll let you know as I get more info. |
thanks - resetting the data dir on all my servers did work. I lost all my jobs - but that's ok for now since we can recreate them quickly in terraform. |
@dcparker88 glad you’re working again and it wasn’t too much of an impact, never a route you should have to take though. The current hypothesis is that it’s related to sticky volumes. Did the job you mentioned have a sticky enabled volume by chance? |
one of our jobs does, yes. the one I thought was the impact did not, but again I might be wrong about what actual job it was. the sticky job also has a distinct_host constraint turned on. |
@dcparker88 Can you please include the job files for the one which requires sticky volumes, and the other job that you thought was suspect mentioned above? |
yeah - here is the relevant group (with the sticky volumes): https://gist.github.com/dcparker88/2f450f8976a43490db0654e738b4e5ba |
the one I thought was potentially causing it is here: https://gist.github.com/dcparker88/705effd1b374bfc51399e3c54f25e571 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
If you have a question, prepend your issue with
[question]
or preferably use the nomad mailing list.If filing a bug please include the following:
Nomad version
Nomad v0.8.3 (c85483d)
Operating system and Environment details
Linux nomad-97d52edaa6767264 2.6.32-696.30.1.el6.centos.plus.x86_64 #1 SMP Wed May 23 20:32:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Issue
Our Nomad cluster went it to a weird state over the weekend, all 3 servers started crashing on startup with the following:
The servers join together in a cluster, and a leader is elected, but the Nomad boxes crash instantly afterward.
peers.json
recovery doesn't seem to work either, it crashes with the same error.I am assuming I can fix this by fully cleaning my data-dir and restarting, but ideally we wouldn't need to do that.
Reproduction steps
This is the only time this has happened to us - so I'm not sure what the reproduction would be.
Nomad Server logs (if appropriate)
posted above - can post more if needed.
Nomad Client logs (if appropriate)
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: