bug in auto healing #2069

OferE · 2016-12-08T11:45:55Z

Nomad version

Nomad v0.5.0

Operating system and Environment details

not relevant

Issue

When removing the nomad agent of a client machine - tasks on the machine are moving to other machines in the cluster which is fine - but tasks on the lost machine are not stopped.

Reproduction steps

just launch a job on 2 mahcine and kill the agent on on of the machines.

dadgar · 2016-12-08T18:38:20Z

Hey,

If you bring the agent back up on one of the machines it will realize the work has been migrated and kill the tasks. Once the agent is dead though, there is nothing Nomad can do to clean it up as there is no Nomad process running.

If that answers your question please close this issue. If not let me know and happy to answer any questions.

OferE · 2016-12-08T18:57:41Z

The tasks themselves run under nomad executable - this executable can verify that there is no agent responding and kill the underlying task.
BTW - great tool. amazing tool.

dadgar · 2016-12-08T19:00:00Z

@OferE Thanks for the kind words :)

As for the nomad executor it is a dumb shim. The agent is the one with the logic for talking to the servers. We do not want to spread complexity to all parts of the system (thats how it becomes un-reliable). When the agent comes back it will tell the executor to clean up.

OferE · 2016-12-08T19:03:14Z

It's your choice, but i would handle this case as it leaves mess in the cluster:
The consul service discovery keep displaying the tasks and resolve dns for the "lost" tasks.
It's not that difficult to solve.

It's not that critical, but it seems more elegant.

OferE · 2016-12-08T19:09:55Z

There is also another important use case:
Consider the case where u have some task that is doing something periodically.
If u stop the job - u might expect that the periodic task will stop.

In case where the "bug" happened - the periodic task will continue and will cause some mess...

All of this is rare ofcause, since the nomad is a stable piece of SW.
But if u want perfection....

Anyway - if u don't cosider this as a bug, i will close.
I just wanted to help a bit.

Again, amazing project!

OferE · 2016-12-08T19:12:12Z

Regarding stability of nomad - in large scale rare things can happen as i'm sure u know.
Even things that r not in control of Nomad SW. Non stable VM etc.

This is something i would handle :-)

dadgar · 2016-12-08T19:22:25Z

Yeah we expect failures and have designed the agent to reattach to the existing executors and take the correct action. I appreciate your interest in the project! For the above mentioned reasons I am going to close the issue.

Thanks,
Alex

OferE · 2016-12-08T21:47:22Z

sure :-)
I will try to workaround this bug myself by monitoring the agent process and killall the other nomad processes when there is no agent present.
I just hope that when the agent will return things will not crash...
Will post here my findings in case someone else will be interested (I doubt :-) )

github-actions · 2022-12-17T02:12:12Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added the stage/waiting-reply label Dec 8, 2016

dadgar added fixed-waiting-confirmation stage/waiting-reply and removed stage/waiting-reply labels Dec 8, 2016

dadgar closed this as completed Dec 8, 2016

OferE mentioned this issue Jan 13, 2017

Kill Allocations when client is disconnected from servers #2185

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug in auto healing #2069

bug in auto healing #2069

OferE commented Dec 8, 2016

dadgar commented Dec 8, 2016

OferE commented Dec 8, 2016 •

edited

Loading

dadgar commented Dec 8, 2016

OferE commented Dec 8, 2016

OferE commented Dec 8, 2016

OferE commented Dec 8, 2016 •

edited

Loading

dadgar commented Dec 8, 2016

OferE commented Dec 8, 2016

github-actions bot commented Dec 17, 2022

bug in auto healing #2069

bug in auto healing #2069

Comments

OferE commented Dec 8, 2016

Nomad version

Operating system and Environment details

Issue

Reproduction steps

dadgar commented Dec 8, 2016

OferE commented Dec 8, 2016 • edited Loading

dadgar commented Dec 8, 2016

OferE commented Dec 8, 2016

OferE commented Dec 8, 2016

OferE commented Dec 8, 2016 • edited Loading

dadgar commented Dec 8, 2016

OferE commented Dec 8, 2016

github-actions bot commented Dec 17, 2022

OferE commented Dec 8, 2016 •

edited

Loading

OferE commented Dec 8, 2016 •

edited

Loading