"docker image does not exist" should be a recoverable error #1406

camerondavison · 2016-07-11T01:47:26Z

Currently

Line 577 in e922200

if imageNotFoundMatcher.MatchString(err.Error()) {

make it so that "Error: image .+ not found" is considered an unrecoverable error. I personally feel like anything that is outside of the realm of control of nomad should be considered recoverable. The image could show up later.

Treating most errors that are outside of the realm of control of nomad as recoverable could also help with
#1191 . That error should still be fixed to minimize entropy and false alarm errors, but it would be mostly mitigated in that it would just retry and probably bind a new port correctly.

dadgar · 2016-07-11T16:46:58Z

Hey,

If you are comfortable please make a PR with tests for this. I commented on #1191, I don't think this will help with that

camerondavison · 2016-07-11T19:36:46Z

Ah yes. Correct. This issue was to fix 1 specific type of error to be considered recoverable.

But I was saying if nomad were to begin to take the stance that errors that are outside of the realm of control (such as docker port collisions) are considered recoverable, that action would help with #1191 . Right now Recoverable errors have to be marked as so, but I was thinking it should probably be the other way around, and mark a small subset of errors as Unrecoverable.

dadgar · 2016-07-11T21:03:47Z

@a86c6f7964 There is a very real trade-off that if the client keeps retrying on a machine that will never be able to recover (docker daemon broken, cgroups not mounted, can't access docker repo) then the task won't be reschedule onto another node and it will cause unnecessary delays. I think we should be as accurate with marking things as recoverable as possible.

diptanu · 2016-07-11T21:24:57Z

@a86c6f7964 @dadgar Guys, I don't think this should be a recoverable error. I have seen people making mistakes countless times when they are writing the name of the image and hence making this recoverable is just going to keep prolonging the time that users would take to correct the mistake.

I understand that the image might take some time to appear but that happens when an image is first published and not the norm.

dadgar · 2016-07-11T21:28:47Z

Actually thinking about this more I agree with @diptanu

diptanu · 2016-07-11T21:32:03Z

@a86c6f7964 If you are concerned about the initial rollout process from your CI/CD pipeline, you could use the Docker Registry HTTP API to see if the layers are available and then deploy the job on Nomad.

camerondavison · 2016-07-12T01:34:23Z

Sorry for the long reply, just tying to get my thoughts out there.

docker daemon broken, cgroups not mounted, can't access docker repo

These are all recoverable to me? I would not want my service to go into a failed state and never get started again, just because I had a machine crash at the same time the docker repo was down. Maybe I am missing something, once an allocation for a task goes into the failed state is there any way to get it put to another machine? What I have noticed is that if something like "docker daemon broken" happens on a machine then that allocation is marked as failed never to be allocated to another node and never to be started again until I resubmit the job.

I understand that the image might take some time to appear but that happens when an image is first published and not the norm.

I have seen this happen sometimes if there is a hiccup talking to the docker registry.

Maybe some example is a little more clear as to how I am thinking about this, and can help clear up where the disconnect is to me.

I have 3 machines running the nomad server cluster. I submit a service job to that cluster. I would like that job to run forever. I never want to have to submit that job ever again.

Now 6 months from now, some misconfiguration happens and 1 machine is pointed at an internal docker registry that happens to not have the image, or is stale, or is buggy, or clocks are out of sync, who knows. Even maybe that machine OOM's or has a corrupt disk, there just are so many problems that can happen. This causes a momentary hiccup in the service when it is trying to run on a new machine. Given the state of the world, I would have to intervene into the system, since the job would transition into a failed state (never to ever get run again) until I resubmit the job.

All of this is not hypothetical. I currently have the nomad cluster running 328 jobs, each with 1 group with 1 service task, each with a count of 2 (for HA). And about every week I have to resubmit jobs to clear some state where things transitioned into a failed state.

I have seen people making mistakes countless times when they are writing the name of the image and hence making this recoverable is just going to keep prolonging the time that users would take to correct the mistake.

Why would this make it take longer? There still would be a task that is flapping and never getting to the running state. It would be super easy to look at the events for something that is not running and see the reasons why it is continually flapping. Is it really better to help out these typos than to have a less stable system?

dadgar · 2016-07-12T15:57:41Z

Hey,

So I think this may be trying to paper over a deficiency in Nomad that we are aware of and will be tackling which is server side restarts. Hopefully my other comments on why you want to transition to failed will make sense after.

Right now, when an allocation fails on the client it is marked as failed and as you have said, until the user resubmits a job or forces Nomad to re-evaluate the job, no action is taken. What we need is a server side restart policy that will react to failed allocations and replace them (preferably on a different machine).

Let me know if that makes sense

camerondavison · 2016-07-12T16:07:01Z

That totally make sense. I am fine taking that approach instead. I did not
understand that y'all were intending failures to be localized to 1 machine.
Is there an issue for that I can either work on or at least subscribe to?

On Tue, Jul 12, 2016 at 10:57 AM Alex Dadgar [email protected]
wrote:

Hey,

So I think this may be trying to paper over a deficiency in Nomad that we
are aware of and will be tackling which is server side restarts. Hopefully
my other comments on why you want to transition to failed will make sense
after.

Right now, when an allocation fails on the client it is marked as failed
and as you have said, until the user resubmits a job or forces Nomad to
re-evaluate the job, no action is taken. What we need is a server side
restart policy that will react to failed allocations and replace them
(preferably on a different machine).

Let me know if that makes sense

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1406 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAKGBRCaKOME4G1u_mYv3y497vCZXRdhks5qU7l6gaJpZM4JI-Ja
.

camerondavison · 2016-07-12T16:12:44Z

I think that I would be interested to know if y'all would then consider a "image does not exist" error to be recoverable server side though? Does that fix @diptanu concern about typo'd image names?

dbresson · 2016-07-19T14:56:21Z

@dadgar don't forget about system jobs. They should be retried on all hosts that are having failures.

camerondavison · 2016-07-21T02:05:14Z

Are y'all still waiting for me to reply?

dadgar · 2016-07-22T23:56:56Z

Hey sorry for the delay. I do not think you would want to treat the image not existing as recoverable server side either, only something that could work on a new node.

Closing this in favor of: #1461

camerondavison · 2016-08-19T01:13:57Z

I do not think you would want to treat the image not existing as recoverable server side either, only something that could work on a new node.

I know this is a late comment, but I want to go ahead and state my complete disagreement with the above statement. As I have already stated

"image does not exist" could work on a new node because of

replication lag of the distributed docker registry (where 1 node has it but another does not)
user is running the docker registry on that node which is meant to be a proxy/cache or points to s3 location of another registry (something I have personally thought about doing) and that node's registry is misconfigured
docker bug (which I have hit personally several times but have been unable to reproduce reliably) where it randomly returns a 404 every once in a while (I think it probably is timing out on something and probably should have been a 500 but is instead a 404)

The fact that I can enumerate 3 possibilities here goes to show how hard reliably getting valid results from an outside system is going to be. Which is why I continue to personally think that any time that nomad interacts with an outside system it should assume that it will fail all of the time, and just keep retrying forever until it work or a sysadmin intervenes.

I personally think that once a system job that is marked to retry on failure forever is submitted it should continue retrying forever until a person intervenes. This is how something in systemd would work, this is how a cron would work.

SephVelut · 2016-08-19T01:19:24Z

Nomad's deficiencies in handling failure and rescheduling is what's keeping me away right now. I was using Nomad a year ago and just got tired of constantly intervening in the system to resubmit jobs. Would really like to see more proactive policies for Nomad to deal with outside state, which changes and fails constantly.

dadgar · 2016-08-19T02:11:18Z

We will take this into consideration when server side restarts are implemented. Thanks!

hgontijo · 2016-08-19T17:49:38Z

@a86c6f7964 thanks for providing detailed scenarios regarding "Driver Failure". I'm facing same situation here and I'm circumventing with retries on my integration module (Nomad API client in Java) that is responsible to trigger and track jobs on Nomad. It's a cumbersome process having to track Job -> Allocation -> Task events w/ "Driver Failure" and "Not Restarting" in order to decide on a job retry.
It will be really helpful if Nomad takes this kind of failure handled by the configurable restart policy.

camerondavison · 2017-01-06T16:44:07Z

@hgontijo not sure how you are still keeping track of allocations, but with the changes to the exit codes for the plan endpoint I now just for loop through all of the nomad definitions and run nomad plan periodically to make sure that everything that was originally submitted is still trying to run.

hgontijo · 2017-01-25T18:20:13Z

@a86c6f7964 we only execute batch jobs and we have a Java based status tracker that holds the job ids. This tracker executes periodically to check the status of all allocations on a given job and retry in case of few exceptions (No cluster leader, Driver failure, Missing allocation info, Error fetching results).

github-actions · 2022-12-16T02:12:39Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added type/enhancement theme/driver/docker labels Jul 11, 2016

dadgar added the stage/waiting-reply label Jul 12, 2016

dadgar closed this as completed Jul 22, 2016

camerondavison mentioned this issue Nov 21, 2016

Temporary docker failure - is unrecoverable error #2014

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"docker image does not exist" should be a recoverable error #1406

"docker image does not exist" should be a recoverable error #1406

camerondavison commented Jul 11, 2016

dadgar commented Jul 11, 2016

camerondavison commented Jul 11, 2016

dadgar commented Jul 11, 2016

diptanu commented Jul 11, 2016

dadgar commented Jul 11, 2016

diptanu commented Jul 11, 2016

camerondavison commented Jul 12, 2016

dadgar commented Jul 12, 2016

camerondavison commented Jul 12, 2016

camerondavison commented Jul 12, 2016

dbresson commented Jul 19, 2016

camerondavison commented Jul 21, 2016

dadgar commented Jul 22, 2016

camerondavison commented Aug 19, 2016

SephVelut commented Aug 19, 2016

dadgar commented Aug 19, 2016

hgontijo commented Aug 19, 2016

camerondavison commented Jan 6, 2017

hgontijo commented Jan 25, 2017

github-actions bot commented Dec 16, 2022

"docker image does not exist" should be a recoverable error #1406

"docker image does not exist" should be a recoverable error #1406

Comments

camerondavison commented Jul 11, 2016

dadgar commented Jul 11, 2016

camerondavison commented Jul 11, 2016

dadgar commented Jul 11, 2016

diptanu commented Jul 11, 2016

dadgar commented Jul 11, 2016

diptanu commented Jul 11, 2016

camerondavison commented Jul 12, 2016

dadgar commented Jul 12, 2016

camerondavison commented Jul 12, 2016

camerondavison commented Jul 12, 2016

dbresson commented Jul 19, 2016

camerondavison commented Jul 21, 2016

dadgar commented Jul 22, 2016

camerondavison commented Aug 19, 2016

SephVelut commented Aug 19, 2016

dadgar commented Aug 19, 2016

hgontijo commented Aug 19, 2016

camerondavison commented Jan 6, 2017

hgontijo commented Jan 25, 2017

github-actions bot commented Dec 16, 2022