-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad 0.5.1 rc2 panic - alloc runner #2089
Comments
Hey @justinwalz Thanks for filing this! Would you mind getting the logs of the clients/servers from when they were upgraded to when this behavior was exhibited! Thanks, |
I just rotated the machines out to get our cluster back online. For the most part, the logs looked fine, see the original post for the client machine errors. Server logs are really long - grep for ERR and WARN you see:
|
What I wanted to get a glimpse at is what the client looked like before? Did they just come up, where they just restarted, or just sitting there when the panics happened. |
Pretty much just sitting there. They had been up for about 6 hours, and everything was running smoothly (host metrics and file usage) and not much in the logs up until that point. If/when it happens again I'll zip it up and attach. |
I'm also getting this error on Nomad version v0.5.0
|
This PR should fix it. We cut a release with this fix, would you mind testing with it? |
@dadgar Sure I'll roll it out to our test env right now. |
@dadgar To follow up, things look alright with dev running 0.5.2-rc1. We'll continue to monitor, but I will close for the time being. Thanks. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Hi,
There’s a problem with nomad-0.5.1-rc2. I was testing in the bug fix (ref #2024) (I am a colleague of @sheldonkwok)
All but one of our nomad clients have exhibited this behavior - I’ve pasted the relevant logs for three unique cases. Happy to provide more info, just let me know what you need. This also happened for a wide variety of services (not specific to one job), I just replaced the names with .
-Justin
/cc @dadgar @diptanu
Nomad version
$ nomad -v
Nomad v0.5.1-rc2 ('6f2ccf22be738a31cb2153c7e43422c4ba9a0e3f+CHANGES')
Operating system and Environment details
$ uname -a
Linux ip-10-XXX-X-XX 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Reproduction steps
After around 6 hours successfully running the release candidate, this occurred simultaneously on most machines.
Nomad Server logs (if appropriate)
At crash time, nothing unexpected (mostly this:
nomad.heartbeat: node 'b1c470b9-3129-537e-ffba-10f5ebd0c1d2' TTL expired
)Nomad Client logs (if appropriate)
See below
Machine A
When trying to remove the alloc_dir of one with issues…
When trying to remove the entire /var/lib/nomad directory (all four were different apps).
The text was updated successfully, but these errors were encountered: