-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug in auto healing #2069
Comments
Hey, If you bring the agent back up on one of the machines it will realize the work has been migrated and kill the tasks. Once the agent is dead though, there is nothing Nomad can do to clean it up as there is no Nomad process running. If that answers your question please close this issue. If not let me know and happy to answer any questions. |
The tasks themselves run under nomad executable - this executable can verify that there is no agent responding and kill the underlying task. |
@OferE Thanks for the kind words :) As for the |
It's your choice, but i would handle this case as it leaves mess in the cluster: It's not that critical, but it seems more elegant. |
There is also another important use case: In case where the "bug" happened - the periodic task will continue and will cause some mess... All of this is rare ofcause, since the nomad is a stable piece of SW. Anyway - if u don't cosider this as a bug, i will close. Again, amazing project! |
Regarding stability of nomad - in large scale rare things can happen as i'm sure u know. This is something i would handle :-) |
Yeah we expect failures and have designed the agent to reattach to the existing executors and take the correct action. I appreciate your interest in the project! For the above mentioned reasons I am going to close the issue. Thanks, |
sure :-) |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.5.0
Operating system and Environment details
not relevant
Issue
When removing the nomad agent of a client machine - tasks on the machine are moving to other machines in the cluster which is fine - but tasks on the lost machine are not stopped.
Reproduction steps
just launch a job on 2 mahcine and kill the agent on on of the machines.
The text was updated successfully, but these errors were encountered: