-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.6.0 [master] - Nil job on allocation #2605
Comments
I am not running in dev mode, but I do have every node configured as a client and server in a 5 node cluster. |
@clinta can you share the commit you are running |
When we first encountered this issue I was running a build of 976f58b. Since encountering this I tried running d1bb92e, which is v0.5.6 with #2535 cherry-picked. I continued to get this nil-pointer exception. But the exception was still due to the allocation that was created under 0.6. So at this point I don't know if v0.5.6 is capable of creating the broken allocation or not. Right now I'm running 2000f61 which is a workaround for the nil pointer issue so that we can keep this cluster running. Also I've looked at the state.json for the broken allocation and it appears to have the correct job data. I can share that privately if it would help. |
Okay thanks for sharing! I believe it is all server side so the state.json won't be of too much help but thank you! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Somehow a nil job is being attached to a job which can cause various panics.
One such report:
I'm running from master, because I need the feature in #2535 which has not yet made it into a release.
Here's some logs that provide context of the allocation that got into this state. It was for the task group hdfs-namenode2. This container wouldn't start due to an issue with a docker volume driver. It appears that after it failed to start, nomad did not remove the container and tried to create it again, the subsequent create operations failing because the old container already existed.
The text was updated successfully, but these errors were encountered: