System scheduler does not support rescheduling #4267

shantanugadgil · 2018-05-08T19:52:22Z

Nomad version

Nomad v0.8.3 (c85483d)

Operating system and Environment details

CentOS 7.4+

Issue

A Docker system job stays stuck in failed mode if the docker image pull failed from an ECR repo.

Reproduction steps

Nomad: 0.8.3
Driver: "driver"
Job Type: "system" / "service"

My Nomad client is correctly setup to use ~/.docker/config.json as the information file.
I have NOT done a docker login from the client yet.
I submit a system job (docker); this tries to start on the client but is then stuck in an
"irrecoverable error"
Now I perform a "docker login" on the client machine.

I expect that job will start soon, but it never starts.

(I suspect the "restart" stanza is not applicable for "system" jobs)

Doing the usual set of things like restarting the Nomad client, or running "nomad-helper reevaluate-all" works
(https://github.com/seatgeek/nomad-helper#reevaluate-all)

Further experiment:
I changed the job type to "service" and I noticed that the "restart" stanza which I have specified does come into effect.

Though, I observed an unexpected behavior (I think):
Nomad doesn't seem to "pull + run" on every attempt.
It only seems to try and start the image (which doesn't exist) and keeps failing till the end of the current interval.
When the next interval starts, the "pull + run" works correctly.

So, am I missing something obvious, or is there is any simpler way to make the system job restart?
Also, is my observation about the pull + run accurate ?

Script to reset docker information during testing

systemctl stop docker
rm -rf /var/lib/docker/
rm -rf ~/.docker/
systemctl restart docker

Regards,
Shantanu

The text was updated successfully, but these errors were encountered:

notnoop · 2020-07-15T16:09:24Z

Hi @shantanugadgil . Thank you so much for your patience - wondering if this is still an issue. I have tested a failing system job and it seems to be honoring restart policy now, the same way as service job. When the alloc start attempts fail, nomad will stop attempting to start an alloc any further until a new job is posted or an evaluation is forced.

The number of restarts is controlled by the restart policy. With service and batch jobs, Nomad attempts to reschedule failed allocations according to reschedule stanza. But system jobs are meant to be running with one alloc on every eligible node - if a single node fails, it shouldn't be rescheduled on another node, which should already run it.

In 0.12.0 binary, I have tested a system job with a bad image with (and without) a custom restart policy. I noticed that the allocation is restarted few times. Once an allocation attempts more times than specified in the restart stanza, we mark the allocation as failed and never attempt to restart again until a job update.

Does this behavior match your expectation? Are you suggesting that we default system jobs to restart infinite times (with a delay) rather than giving up at some point?

shantanugadgil · 2020-07-15T16:48:45Z

@notnoop thanks for the followup, but the original use-case that I was doing this for, no longer exists.

I have seen a few bugs fixed around the main error (expired AWS ECR login) being fixed, though wasn't sure until now if it applied to the system job as well.

Does this behavior match your expectation? Are you suggesting that we default system jobs to restart infinite times (with a delay) rather than giving up at some point?

as far as I can remember, my original intent was a request for system jobs NOT to give up on retries (again, hazy on the exact requirement), but I think the solution could be as simple as specifying the "appropriate" restart stanza to make it retry infinitely, right? (with a decent backoff time)

notnoop · 2020-07-15T18:16:29Z

Thanks for following up. Indeed, if the system job should retry indefinitely, you can set an appropriate restart stanza. Potentially, restart { mode = "delay" } might suffice - Nomad would re-attempt restarting the system job after 30m (the default interval).

If it's OK with you, I'll close the ticket. We'd welcome new tickets if you or anyone notice something odd. Thank you so much again for your patience.

github-actions · 2022-11-04T02:38:42Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added type/enhancement theme/system-scheduler labels May 8, 2018

dadgar changed the title ~~A Docker system job stays stuck in failed mode if the docker image pull failed from an ECR repo.~~ System scheduler does not support rescheduling May 8, 2018

preetapan mentioned this issue Jul 10, 2018

Nomad does not try to restart a system job that failed due to the template renderer #4484

Closed

shantanugadgil mentioned this issue Jul 19, 2019

System scheduler blocked evals #5900

Merged

1 task

yishan-lin mentioned this issue Jun 29, 2020

Batch System Jobs #2527

Closed

shantanugadgil closed this as completed Jul 15, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System scheduler does not support rescheduling #4267

System scheduler does not support rescheduling #4267

shantanugadgil commented May 8, 2018

notnoop commented Jul 15, 2020

shantanugadgil commented Jul 15, 2020

notnoop commented Jul 15, 2020

github-actions bot commented Nov 4, 2022

System scheduler does not support rescheduling #4267

System scheduler does not support rescheduling #4267

Comments

shantanugadgil commented May 8, 2018

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Script to reset docker information during testing

notnoop commented Jul 15, 2020

shantanugadgil commented Jul 15, 2020

notnoop commented Jul 15, 2020

github-actions bot commented Nov 4, 2022