Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop_on_client_after doesn't handle network partitions as expected #24679

Open
akamensky opened this issue Dec 16, 2024 · 9 comments
Open

stop_on_client_after doesn't handle network partitions as expected #24679

akamensky opened this issue Dec 16, 2024 · 9 comments

Comments

@akamensky
Copy link

akamensky commented Dec 16, 2024

Proposal

Currently when the loss of connectivity between an agent(s) and servers occurs the servers will attempt to reschedule the job according to the job configuration. However, there is no way to configure what an agent should do in such situation. As it currently stands, the agent will continue running the pre-existing allocations indefinitely, which in some cases may be undesirable. While some cases can be handled through application logic, this assumes that 1 - network split would have an impact on applications logic (i.e. loss of connectivity to some dependency at the same time), 2 - the application has been implemented with such case in mind. This leaves 2 groups of cases uncovered -- when connectivity between nomad agents and servers would have no impact on application functionality (i.e. all dependencies are still reachable), and when the application is a legacy one that does not necessarily check and handle this case.

Moreover on the orphaned agents, the allocations that were manually killed will be restarted by the agent (as tested in Nomad 1.8.1)

The proposed change is a high-level description of what may be considered as an addition.

To allow for better handling of such cases I think it would be beneficial for the agent to be configured such as that it would shutdown tasks/allocations running on it, if it becomes "orphaned" with a configured interval to wait. For example having 2 additional options in the agent section would allow for better handling of above described cases:

shutdown_orphaned_tasks = <boolean: default false> # enable/disable shutting down local tasks when the agent becomes orphaned. ideally the shutdown would use configured kill_signal of a task. defaulting to false will mean no breaking changes

shutdown_orphaned_tasks_timeout = <duration: default 60s> # set a timeout that starts upon agent becoming disconnected from servers

Use-cases

  1. Network separation between agents and servers when the applications should not be allowed to reschedule while another instance is running

Attempted Solutions

There is generally loss of control of the agent happens with the network partitions, currently the only way to shutdown processes running on the agent is to manually kill the process there is currently no way to simply stop application on the orphaned node as the agent will restart the application if it is manually killed (according to job definition), which is far from ideal, in our environment we would have to implement some watchdog process to run on every agent node to monitor whether it is connected to servers and then continuously kill all running tasks by itself (there isn't even an API on agent to list locally running tasks and stop them, which means whatever watchdog we will use will have to do it via other means).

@akamensky
Copy link
Author

As a side note nomad node drain -self ... on orphaned nodes returns 500 (rpc error: No path to region), so there is complete loss of control of the agent during the network split.

@tgross
Copy link
Member

tgross commented Dec 16, 2024

Hi @akamensky! Isn't the behavior you're describing on the client handled already by the disconnect.stop_after field?

@akamensky
Copy link
Author

From reading documentation it appears to do similar. However, this is defined in the job spec. In an environment where job specs are maintained by developers but the guarantees of the system are placed on the infra team this is not sufficient in my opinion. We wouldn't be able to read through every change of the job specs before it is rolled out and if we did it would become a bottleneck in the overall process.

@tgross
Copy link
Member

tgross commented Dec 16, 2024

Ok, I just wanted to make sure it wasn't a matter of not knowing the existing options. A lot of folks use Sentinel policies for that kind of control, but obviously that's not available to everyone. I'll mark this for further discussion and roadmapping.

@akamensky
Copy link
Author

akamensky commented Dec 16, 2024

I have tried to use disconnect.stop_after, however it does not seem to work on 1.8.1. Trying to add it on the group scope getting An argument named "stop_after" is not expected here., trying on other scopes, getting back Blocks of type "disconnect" are not expected here.

image

Meanwhile trying (mentioned in docs, but not documented) stop_after_client_disconnect, it accepts the argument, but nothing happens on the disconnect (i see missed hearbeat messages in the log, but the allocation keeps running for I assume indefinitely).

Edit: upgraded my test setup to 1.9.3, and tested there as well, I am getting the same errors trying to use disconnect.stop_after, but using stop_after_client_disconnect does not do anything, I am seeing heartbeat timeout errors in nomad logs, but the already started allocation keeps running. I did ~10 test runs, and it actually did stop the allocation in 1 of them, but not in the other 9. I feel there is some serious issue with detection of node being orphaned. But I guess that needs to be looked at in separate report.

Edit2: there is actually seems to be a bigger problem (or I misunderstand how it should work then?), after the network partition is over and the node is connected back to the servers, it is unable to run allocations again. Every allocation goes first into "pending", then shows as "recovering" seemingly indefinitely:

image

Seems the only way to be able to run allocations on a client that re-joined after network partition is to manually restart the nomad process on the agent. then it starts working as usual again.

@tgross
Copy link
Member

tgross commented Dec 18, 2024

Embarrassingly, I learned yesterday (ref #24702 (comment)) that the documentation is actually wrong on that disconnect.stop_after field and it should be disconnect.stop_on_client_after.

@akamensky
Copy link
Author

akamensky commented Dec 19, 2024

Thanks for the update on that @tgross , I;ve tried with:

disconnect {
  stop_on_client_after = "..."
}

and it accepts it correctly in group scope, but as above does not seem to work on network partition. the tasks keep running without being stopped after the timeout.

In nomad (on agent/client) logs all I see is:

2024-12-19T09:19:10.846+0800 [INFO]  client.consul: discovered following servers: servers=[10.2.19.20:4647]
2024-12-19T09:19:38.394+0800 [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 10.2.19.20:4647: connect: connection refused" rpc=Node.UpdateStatus server=10.2.19.20:4647
2024-12-19T09:19:38.395+0800 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: dial tcp 10.2.19.20:4647: connect: connection refused" rpc=Node.UpdateStatus server=10.2.19.20:4647
2024-12-19T09:19:38.395+0800 [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: dial tcp 10.2.19.20:4647: connect: connection refused" period=29.801398812s
2024-12-19T09:19:38.401+0800 [INFO]  client.consul: discovered following servers: servers=[10.2.19.20:4647]

@tgross
Copy link
Member

tgross commented Dec 19, 2024

Ok, thanks @akamensky. Looks like there's potentially a bug there. I'm going to re-file this issue as such so we can get it looked into.

@tgross tgross changed the title [feature] Agent configuration to define action on loss of connectivity stop_on_client_after doesn't handle network partitions as expected Dec 19, 2024
@akamensky
Copy link
Author

Thanks for looking into the above issue. I think I will re-create the OG feature request in another ticket as it is still valid FR from my end (making sure the options defined in the job spec work is good, but not as helpful for the infra team).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants