Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"docker image does not exist" should be a recoverable error #1406

Closed
camerondavison opened this issue Jul 11, 2016 · 20 comments
Closed

"docker image does not exist" should be a recoverable error #1406

camerondavison opened this issue Jul 11, 2016 · 20 comments

Comments

@camerondavison
Copy link
Contributor

Currently

if imageNotFoundMatcher.MatchString(err.Error()) {
make it so that "Error: image .+ not found" is considered an unrecoverable error. I personally feel like anything that is outside of the realm of control of nomad should be considered recoverable. The image could show up later.

Treating most errors that are outside of the realm of control of nomad as recoverable could also help with
#1191 . That error should still be fixed to minimize entropy and false alarm errors, but it would be mostly mitigated in that it would just retry and probably bind a new port correctly.

@dadgar
Copy link
Contributor

dadgar commented Jul 11, 2016

Hey,

If you are comfortable please make a PR with tests for this. I commented on #1191, I don't think this will help with that

@camerondavison
Copy link
Contributor Author

Ah yes. Correct. This issue was to fix 1 specific type of error to be considered recoverable.

But I was saying if nomad were to begin to take the stance that errors that are outside of the realm of control (such as docker port collisions) are considered recoverable, that action would help with #1191 . Right now Recoverable errors have to be marked as so, but I was thinking it should probably be the other way around, and mark a small subset of errors as Unrecoverable.

@dadgar
Copy link
Contributor

dadgar commented Jul 11, 2016

@a86c6f7964 There is a very real trade-off that if the client keeps retrying on a machine that will never be able to recover (docker daemon broken, cgroups not mounted, can't access docker repo) then the task won't be reschedule onto another node and it will cause unnecessary delays. I think we should be as accurate with marking things as recoverable as possible.

@diptanu
Copy link
Contributor

diptanu commented Jul 11, 2016

@a86c6f7964 @dadgar Guys, I don't think this should be a recoverable error. I have seen people making mistakes countless times when they are writing the name of the image and hence making this recoverable is just going to keep prolonging the time that users would take to correct the mistake.

I understand that the image might take some time to appear but that happens when an image is first published and not the norm.

@dadgar
Copy link
Contributor

dadgar commented Jul 11, 2016

Actually thinking about this more I agree with @diptanu

@diptanu
Copy link
Contributor

diptanu commented Jul 11, 2016

@a86c6f7964 If you are concerned about the initial rollout process from your CI/CD pipeline, you could use the Docker Registry HTTP API to see if the layers are available and then deploy the job on Nomad.

@camerondavison
Copy link
Contributor Author

Sorry for the long reply, just tying to get my thoughts out there.

docker daemon broken, cgroups not mounted, can't access docker repo

These are all recoverable to me? I would not want my service to go into a failed state and never get started again, just because I had a machine crash at the same time the docker repo was down. Maybe I am missing something, once an allocation for a task goes into the failed state is there any way to get it put to another machine? What I have noticed is that if something like "docker daemon broken" happens on a machine then that allocation is marked as failed never to be allocated to another node and never to be started again until I resubmit the job.

I understand that the image might take some time to appear but that happens when an image is first published and not the norm.

I have seen this happen sometimes if there is a hiccup talking to the docker registry.

Maybe some example is a little more clear as to how I am thinking about this, and can help clear up where the disconnect is to me.

I have 3 machines running the nomad server cluster. I submit a service job to that cluster. I would like that job to run forever. I never want to have to submit that job ever again.

Now 6 months from now, some misconfiguration happens and 1 machine is pointed at an internal docker registry that happens to not have the image, or is stale, or is buggy, or clocks are out of sync, who knows. Even maybe that machine OOM's or has a corrupt disk, there just are so many problems that can happen. This causes a momentary hiccup in the service when it is trying to run on a new machine. Given the state of the world, I would have to intervene into the system, since the job would transition into a failed state (never to ever get run again) until I resubmit the job.

All of this is not hypothetical. I currently have the nomad cluster running 328 jobs, each with 1 group with 1 service task, each with a count of 2 (for HA). And about every week I have to resubmit jobs to clear some state where things transitioned into a failed state.

I have seen people making mistakes countless times when they are writing the name of the image and hence making this recoverable is just going to keep prolonging the time that users would take to correct the mistake.

Why would this make it take longer? There still would be a task that is flapping and never getting to the running state. It would be super easy to look at the events for something that is not running and see the reasons why it is continually flapping. Is it really better to help out these typos than to have a less stable system?

@dadgar
Copy link
Contributor

dadgar commented Jul 12, 2016

Hey,

So I think this may be trying to paper over a deficiency in Nomad that we are aware of and will be tackling which is server side restarts. Hopefully my other comments on why you want to transition to failed will make sense after.

Right now, when an allocation fails on the client it is marked as failed and as you have said, until the user resubmits a job or forces Nomad to re-evaluate the job, no action is taken. What we need is a server side restart policy that will react to failed allocations and replace them (preferably on a different machine).

Let me know if that makes sense

@camerondavison
Copy link
Contributor Author

That totally make sense. I am fine taking that approach instead. I did not
understand that y'all were intending failures to be localized to 1 machine.
Is there an issue for that I can either work on or at least subscribe to?

On Tue, Jul 12, 2016 at 10:57 AM Alex Dadgar [email protected]
wrote:

Hey,

So I think this may be trying to paper over a deficiency in Nomad that we
are aware of and will be tackling which is server side restarts. Hopefully
my other comments on why you want to transition to failed will make sense
after.

Right now, when an allocation fails on the client it is marked as failed
and as you have said, until the user resubmits a job or forces Nomad to
re-evaluate the job, no action is taken. What we need is a server side
restart policy that will react to failed allocations and replace them
(preferably on a different machine).

Let me know if that makes sense


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1406 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAKGBRCaKOME4G1u_mYv3y497vCZXRdhks5qU7l6gaJpZM4JI-Ja
.

@camerondavison
Copy link
Contributor Author

I think that I would be interested to know if y'all would then consider a "image does not exist" error to be recoverable server side though? Does that fix @diptanu concern about typo'd image names?

@dbresson
Copy link

@dadgar don't forget about system jobs. They should be retried on all hosts that are having failures.

@camerondavison
Copy link
Contributor Author

Are y'all still waiting for me to reply?

@dadgar
Copy link
Contributor

dadgar commented Jul 22, 2016

Hey sorry for the delay. I do not think you would want to treat the image not existing as recoverable server side either, only something that could work on a new node.

Closing this in favor of: #1461

@dadgar dadgar closed this as completed Jul 22, 2016
@camerondavison
Copy link
Contributor Author

I do not think you would want to treat the image not existing as recoverable server side either, only something that could work on a new node.

I know this is a late comment, but I want to go ahead and state my complete disagreement with the above statement. As I have already stated

"image does not exist" could work on a new node because of

  • replication lag of the distributed docker registry (where 1 node has it but another does not)
  • user is running the docker registry on that node which is meant to be a proxy/cache or points to s3 location of another registry (something I have personally thought about doing) and that node's registry is misconfigured
  • docker bug (which I have hit personally several times but have been unable to reproduce reliably) where it randomly returns a 404 every once in a while (I think it probably is timing out on something and probably should have been a 500 but is instead a 404)

The fact that I can enumerate 3 possibilities here goes to show how hard reliably getting valid results from an outside system is going to be. Which is why I continue to personally think that any time that nomad interacts with an outside system it should assume that it will fail all of the time, and just keep retrying forever until it work or a sysadmin intervenes.

I personally think that once a system job that is marked to retry on failure forever is submitted it should continue retrying forever until a person intervenes. This is how something in systemd would work, this is how a cron would work.

@SephVelut
Copy link

Nomad's deficiencies in handling failure and rescheduling is what's keeping me away right now. I was using Nomad a year ago and just got tired of constantly intervening in the system to resubmit jobs. Would really like to see more proactive policies for Nomad to deal with outside state, which changes and fails constantly.

@dadgar
Copy link
Contributor

dadgar commented Aug 19, 2016

We will take this into consideration when server side restarts are implemented. Thanks!

@hgontijo
Copy link
Contributor

@a86c6f7964 thanks for providing detailed scenarios regarding "Driver Failure". I'm facing same situation here and I'm circumventing with retries on my integration module (Nomad API client in Java) that is responsible to trigger and track jobs on Nomad. It's a cumbersome process having to track Job -> Allocation -> Task events w/ "Driver Failure" and "Not Restarting" in order to decide on a job retry.
It will be really helpful if Nomad takes this kind of failure handled by the configurable restart policy.

@camerondavison
Copy link
Contributor Author

@hgontijo not sure how you are still keeping track of allocations, but with the changes to the exit codes for the plan endpoint I now just for loop through all of the nomad definitions and run nomad plan periodically to make sure that everything that was originally submitted is still trying to run.

@hgontijo
Copy link
Contributor

@a86c6f7964 we only execute batch jobs and we have a Java based status tracker that holds the job ids. This tracker executes periodically to check the status of all allocations on a given job and retry in case of few exceptions (No cluster leader, Driver failure, Missing allocation info, Error fetching results).

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants