-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"docker image does not exist" should be a recoverable error #1406
Comments
Hey, If you are comfortable please make a PR with tests for this. I commented on #1191, I don't think this will help with that |
Ah yes. Correct. This issue was to fix 1 specific type of error to be considered recoverable. But I was saying if nomad were to begin to take the stance that errors that are outside of the realm of control (such as docker port collisions) are considered recoverable, that action would help with #1191 . Right now |
@a86c6f7964 There is a very real trade-off that if the client keeps retrying on a machine that will never be able to recover (docker daemon broken, cgroups not mounted, can't access docker repo) then the task won't be reschedule onto another node and it will cause unnecessary delays. I think we should be as accurate with marking things as recoverable as possible. |
@a86c6f7964 @dadgar Guys, I don't think this should be a recoverable error. I have seen people making mistakes countless times when they are writing the name of the image and hence making this recoverable is just going to keep prolonging the time that users would take to correct the mistake. I understand that the image might take some time to appear but that happens when an image is first published and not the norm. |
Actually thinking about this more I agree with @diptanu |
@a86c6f7964 If you are concerned about the initial rollout process from your CI/CD pipeline, you could use the Docker Registry HTTP API to see if the layers are available and then deploy the job on Nomad. |
Sorry for the long reply, just tying to get my thoughts out there.
These are all recoverable to me? I would not want my service to go into a failed state and never get started again, just because I had a machine crash at the same time the docker repo was down. Maybe I am missing something, once an allocation for a task goes into the failed state is there any way to get it put to another machine? What I have noticed is that if something like "docker daemon broken" happens on a machine then that allocation is marked as failed never to be allocated to another node and never to be started again until I resubmit the job.
I have seen this happen sometimes if there is a hiccup talking to the docker registry. Maybe some example is a little more clear as to how I am thinking about this, and can help clear up where the disconnect is to me. I have 3 machines running the nomad server cluster. I submit a service job to that cluster. I would like that job to run forever. I never want to have to submit that job ever again. Now 6 months from now, some misconfiguration happens and 1 machine is pointed at an internal docker registry that happens to not have the image, or is stale, or is buggy, or clocks are out of sync, who knows. Even maybe that machine OOM's or has a corrupt disk, there just are so many problems that can happen. This causes a momentary hiccup in the service when it is trying to run on a new machine. Given the state of the world, I would have to intervene into the system, since the job would transition into a failed state (never to ever get run again) until I resubmit the job. All of this is not hypothetical. I currently have the nomad cluster running 328 jobs, each with 1 group with 1 service task, each with a count of 2 (for HA). And about every week I have to resubmit jobs to clear some state where things transitioned into a failed state.
Why would this make it take longer? There still would be a task that is flapping and never getting to the running state. It would be super easy to look at the events for something that is not running and see the reasons why it is continually flapping. Is it really better to help out these typos than to have a less stable system? |
Hey, So I think this may be trying to paper over a deficiency in Nomad that we are aware of and will be tackling which is server side restarts. Hopefully my other comments on why you want to transition to failed will make sense after. Right now, when an allocation fails on the client it is marked as failed and as you have said, until the user resubmits a job or forces Nomad to re-evaluate the job, no action is taken. What we need is a server side restart policy that will react to failed allocations and replace them (preferably on a different machine). Let me know if that makes sense |
That totally make sense. I am fine taking that approach instead. I did not On Tue, Jul 12, 2016 at 10:57 AM Alex Dadgar [email protected]
|
I think that I would be interested to know if y'all would then consider a "image does not exist" error to be recoverable server side though? Does that fix @diptanu concern about typo'd image names? |
@dadgar don't forget about system jobs. They should be retried on all hosts that are having failures. |
Are y'all still waiting for me to reply? |
Hey sorry for the delay. I do not think you would want to treat the image not existing as recoverable server side either, only something that could work on a new node. Closing this in favor of: #1461 |
I know this is a late comment, but I want to go ahead and state my complete disagreement with the above statement. As I have already stated "image does not exist" could work on a new node because of
The fact that I can enumerate 3 possibilities here goes to show how hard reliably getting valid results from an outside system is going to be. Which is why I continue to personally think that any time that nomad interacts with an outside system it should assume that it will fail all of the time, and just keep retrying forever until it work or a sysadmin intervenes. I personally think that once a system job that is marked to retry on failure forever is submitted it should continue retrying forever until a person intervenes. This is how something in systemd would work, this is how a cron would work. |
Nomad's deficiencies in handling failure and rescheduling is what's keeping me away right now. I was using Nomad a year ago and just got tired of constantly intervening in the system to resubmit jobs. Would really like to see more proactive policies for Nomad to deal with outside state, which changes and fails constantly. |
We will take this into consideration when server side restarts are implemented. Thanks! |
@a86c6f7964 thanks for providing detailed scenarios regarding "Driver Failure". I'm facing same situation here and I'm circumventing with retries on my integration module (Nomad API client in Java) that is responsible to trigger and track jobs on Nomad. It's a cumbersome process having to track Job -> Allocation -> Task events w/ "Driver Failure" and "Not Restarting" in order to decide on a job retry. |
@hgontijo not sure how you are still keeping track of allocations, but with the changes to the exit codes for the plan endpoint I now just for loop through all of the nomad definitions and run |
@a86c6f7964 we only execute batch jobs and we have a Java based status tracker that holds the job ids. This tracker executes periodically to check the status of all allocations on a given job and retry in case of few exceptions (No cluster leader, Driver failure, Missing allocation info, Error fetching results). |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Currently
nomad/client/driver/docker.go
Line 577 in e922200
Treating most errors that are outside of the realm of control of nomad as recoverable could also help with
#1191 . That error should still be fixed to minimize entropy and false alarm errors, but it would be mostly mitigated in that it would just retry and probably bind a new port correctly.
The text was updated successfully, but these errors were encountered: