-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad doesn't re-schedule jobs if Docker daemon is restarted #629
Comments
Yeah this will hopefully be fixed by Nomad 0.3 and in a generic way too. We currently don't migrate failed allocs to new nodes and once we do that, this problem should be fixed |
@dadgar is this still on track for 0.3? |
No this will not be fixed in 0.3. We decided to not tackle server side restarts in this release. |
@dadgar, Nomad making sure the number of instances is always honored is the only thing right now stopping me from using Nomad in production, instead of Kubernetes. Are there any plans to fix this in the short/mid term? |
@c4milo So I just restarted a docker daemon that was running containers started by Nomad. Nomad detected it and restarted the task locally |
There are two separate issues:
|
@ketzacoatl: Nomad actually does both of these. There is a node failure detection window and once the node hasn't talked to the servers for a period of time, Nomad will reschedule what it is running |
@c4milo can you reproduce the behavior you were experiencing? If not we should probably close this |
@dadgar I haven't tested but feel free to close it. I will re-open it if I'm able to reproduce. |
when I last tested, nomad did not reschedule the job on a node that failed,. Maybe that has changed, I will squeeze in a test in the time to come.. but I agree that is not what this issue intends to be about. |
Cool! Thanks guys! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I've been testing the ability of a cluster to recover its state, especially when the Docker daemon is restarted. But, every time this happens, the allocations moved to
failed
status and Nomad didn't restart them ever again, not even if I re-run the job. In order to fix it, I had to stop the job and start over.The text was updated successfully, but these errors were encountered: