-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stop_on_client_after doesn't handle network partitions as expected #24679
Comments
As a side note |
Hi @akamensky! Isn't the behavior you're describing on the client handled already by the |
From reading documentation it appears to do similar. However, this is defined in the job spec. In an environment where job specs are maintained by developers but the guarantees of the system are placed on the infra team this is not sufficient in my opinion. We wouldn't be able to read through every change of the job specs before it is rolled out and if we did it would become a bottleneck in the overall process. |
Ok, I just wanted to make sure it wasn't a matter of not knowing the existing options. A lot of folks use Sentinel policies for that kind of control, but obviously that's not available to everyone. I'll mark this for further discussion and roadmapping. |
I have tried to use Meanwhile trying (mentioned in docs, but not documented) Edit: upgraded my test setup to 1.9.3, and tested there as well, I am getting the same errors trying to use Edit2: there is actually seems to be a bigger problem (or I misunderstand how it should work then?), after the network partition is over and the node is connected back to the servers, it is unable to run allocations again. Every allocation goes first into "pending", then shows as "recovering" seemingly indefinitely: Seems the only way to be able to run allocations on a client that re-joined after network partition is to manually restart the nomad process on the agent. then it starts working as usual again. |
Embarrassingly, I learned yesterday (ref #24702 (comment)) that the documentation is actually wrong on that |
Thanks for the update on that @tgross , I;ve tried with:
and it accepts it correctly in group scope, but as above does not seem to work on network partition. the tasks keep running without being stopped after the timeout. In nomad (on agent/client) logs all I see is:
|
Ok, thanks @akamensky. Looks like there's potentially a bug there. I'm going to re-file this issue as such so we can get it looked into. |
Thanks for looking into the above issue. I think I will re-create the OG feature request in another ticket as it is still valid FR from my end (making sure the options defined in the job spec work is good, but not as helpful for the infra team). |
Proposal
Currently when the loss of connectivity between an agent(s) and servers occurs the servers will attempt to reschedule the job according to the job configuration. However, there is no way to configure what an agent should do in such situation. As it currently stands, the agent will continue running the pre-existing allocations indefinitely, which in some cases may be undesirable. While some cases can be handled through application logic, this assumes that 1 - network split would have an impact on applications logic (i.e. loss of connectivity to some dependency at the same time), 2 - the application has been implemented with such case in mind. This leaves 2 groups of cases uncovered -- when connectivity between nomad agents and servers would have no impact on application functionality (i.e. all dependencies are still reachable), and when the application is a legacy one that does not necessarily check and handle this case.
Moreover on the orphaned agents, the allocations that were manually killed will be restarted by the agent (as tested in Nomad 1.8.1)
The proposed change is a high-level description of what may be considered as an addition.
To allow for better handling of such cases I think it would be beneficial for the agent to be configured such as that it would shutdown tasks/allocations running on it, if it becomes "orphaned" with a configured interval to wait. For example having 2 additional options in the agent section would allow for better handling of above described cases:
Use-cases
Attempted Solutions
There is generally loss of control of the agent happens with the network partitions,
currently the only way to shutdown processes running on the agent is to manually kill the processthere is currently no way to simply stop application on the orphaned node as the agent will restart the application if it is manually killed (according to job definition), which is far from ideal, in our environment we would have to implement some watchdog process to run on every agent node to monitor whether it is connected to servers and then continuously kill all running tasks by itself (there isn't even an API on agent to list locally running tasks and stop them, which means whatever watchdog we will use will have to do it via other means).The text was updated successfully, but these errors were encountered: