-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System scheduler does not support rescheduling #4267
Comments
Hi @shantanugadgil . Thank you so much for your patience - wondering if this is still an issue. I have tested a failing system job and it seems to be honoring restart policy now, the same way as service job. When the alloc start attempts fail, nomad will stop attempting to start an alloc any further until a new job is posted or an evaluation is forced. The number of restarts is controlled by the restart policy. With service and batch jobs, Nomad attempts to reschedule failed allocations according to In 0.12.0 binary, I have tested a system job with a bad image with (and without) a custom restart policy. I noticed that the allocation is restarted few times. Once an allocation attempts more times than specified in the restart stanza, we mark the allocation as failed and never attempt to restart again until a job update. Does this behavior match your expectation? Are you suggesting that we default system jobs to restart infinite times (with a delay) rather than giving up at some point? |
@notnoop thanks for the followup, but the original use-case that I was doing this for, no longer exists. I have seen a few bugs fixed around the main error (expired AWS ECR login) being fixed, though wasn't sure until now if it applied to the system job as well.
as far as I can remember, my original intent was a request for |
Thanks for following up. Indeed, if the system job should retry indefinitely, you can set an appropriate If it's OK with you, I'll close the ticket. We'd welcome new tickets if you or anyone notice something odd. Thank you so much again for your patience. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.8.3 (c85483d)
Operating system and Environment details
CentOS 7.4+
Issue
A Docker system job stays stuck in failed mode if the docker image pull failed from an ECR repo.
Reproduction steps
Nomad: 0.8.3
Driver: "driver"
Job Type: "system" / "service"
My Nomad client is correctly setup to use
~/.docker/config.json
as the information file.I have NOT done a docker login from the client yet.
I submit a system job (docker); this tries to start on the client but is then stuck in an
"irrecoverable error"
Now I perform a "docker login" on the client machine.
I expect that job will start soon, but it never starts.
(I suspect the "restart" stanza is not applicable for "system" jobs)
Doing the usual set of things like restarting the Nomad client, or running "nomad-helper reevaluate-all" works
(https://github.com/seatgeek/nomad-helper#reevaluate-all)
Further experiment:
I changed the job type to "service" and I noticed that the "restart" stanza which I have specified does come into effect.
Though, I observed an unexpected behavior (I think):
Nomad doesn't seem to "pull + run" on every attempt.
It only seems to try and start the image (which doesn't exist) and keeps failing till the end of the current interval.
When the next interval starts, the "pull + run" works correctly.
So, am I missing something obvious, or is there is any simpler way to make the system job restart?
Also, is my observation about the pull + run accurate ?
Script to reset docker information during testing
Regards,
Shantanu
The text was updated successfully, but these errors were encountered: