Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System scheduler does not support rescheduling #4267

Closed
shantanugadgil opened this issue May 8, 2018 · 4 comments
Closed

System scheduler does not support rescheduling #4267

shantanugadgil opened this issue May 8, 2018 · 4 comments

Comments

@shantanugadgil
Copy link
Contributor

Nomad version

Nomad v0.8.3 (c85483d)

Operating system and Environment details

CentOS 7.4+

Issue

A Docker system job stays stuck in failed mode if the docker image pull failed from an ECR repo.

Reproduction steps

Nomad: 0.8.3
Driver: "driver"
Job Type: "system" / "service"

My Nomad client is correctly setup to use ~/.docker/config.json as the information file.
I have NOT done a docker login from the client yet.
I submit a system job (docker); this tries to start on the client but is then stuck in an
"irrecoverable error"
Now I perform a "docker login" on the client machine.

I expect that job will start soon, but it never starts.

(I suspect the "restart" stanza is not applicable for "system" jobs)

Doing the usual set of things like restarting the Nomad client, or running "nomad-helper reevaluate-all" works
(https://github.com/seatgeek/nomad-helper#reevaluate-all)

Further experiment:
I changed the job type to "service" and I noticed that the "restart" stanza which I have specified does come into effect.

Though, I observed an unexpected behavior (I think):
Nomad doesn't seem to "pull + run" on every attempt.
It only seems to try and start the image (which doesn't exist) and keeps failing till the end of the current interval.
When the next interval starts, the "pull + run" works correctly.

So, am I missing something obvious, or is there is any simpler way to make the system job restart?
Also, is my observation about the pull + run accurate ?

Script to reset docker information during testing

systemctl stop docker
rm -rf /var/lib/docker/
rm -rf ~/.docker/
systemctl restart docker

Regards,
Shantanu

@dadgar dadgar changed the title A Docker system job stays stuck in failed mode if the docker image pull failed from an ECR repo. System scheduler does not support rescheduling May 8, 2018
@notnoop
Copy link
Contributor

notnoop commented Jul 15, 2020

Hi @shantanugadgil . Thank you so much for your patience - wondering if this is still an issue. I have tested a failing system job and it seems to be honoring restart policy now, the same way as service job. When the alloc start attempts fail, nomad will stop attempting to start an alloc any further until a new job is posted or an evaluation is forced.

The number of restarts is controlled by the restart policy. With service and batch jobs, Nomad attempts to reschedule failed allocations according to reschedule stanza. But system jobs are meant to be running with one alloc on every eligible node - if a single node fails, it shouldn't be rescheduled on another node, which should already run it.

In 0.12.0 binary, I have tested a system job with a bad image with (and without) a custom restart policy. I noticed that the allocation is restarted few times. Once an allocation attempts more times than specified in the restart stanza, we mark the allocation as failed and never attempt to restart again until a job update.

Does this behavior match your expectation? Are you suggesting that we default system jobs to restart infinite times (with a delay) rather than giving up at some point?

@shantanugadgil
Copy link
Contributor Author

@notnoop thanks for the followup, but the original use-case that I was doing this for, no longer exists.

I have seen a few bugs fixed around the main error (expired AWS ECR login) being fixed, though wasn't sure until now if it applied to the system job as well.

Does this behavior match your expectation? Are you suggesting that we default system jobs to restart infinite times (with a delay) rather than giving up at some point?

as far as I can remember, my original intent was a request for system jobs NOT to give up on retries (again, hazy on the exact requirement), but I think the solution could be as simple as specifying the "appropriate" restart stanza to make it retry infinitely, right? (with a decent backoff time)

@notnoop
Copy link
Contributor

notnoop commented Jul 15, 2020

Thanks for following up. Indeed, if the system job should retry indefinitely, you can set an appropriate restart stanza. Potentially, restart { mode = "delay" } might suffice - Nomad would re-attempt restarting the system job after 30m (the default interval).

If it's OK with you, I'll close the ticket. We'd welcome new tickets if you or anyone notice something odd. Thank you so much again for your patience.

@github-actions
Copy link

github-actions bot commented Nov 4, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants