-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: stop after client disconnect #7793
Conversation
75c7e6b
to
46b8b71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 @langmartin this is mostly looking great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code overall looks good, but I wonder about the restart behavior more. Do we want to have a grace period on client restart for it to connect before killing the task?
In this implementation, it might be possible for the grace period to elapse while client is restarting, so the client may attempt to kill the alloc before the client has a chance to reconnect to the server.
Another concern is that the logic for client restart handling of allocs is getting more complex and it would depend on job type, whether it was successfully restored, whether it has heartyeet configuration, etc.
Also, it's not obvious to me how beneficial the handling is. If the client has been stopped for longer than stop_after_client_disconnect
potentially arbitrary long time, the servers will reschedule and we cannot offer any guarantees about the job being stopped already. When the client starts up, the alloc would be most likely stopped due to the server update anyway.
If so, I'd suggest having the heart-beat counter be an in-memory counter alone. It reduces the periodic IO operation (that requires 2 fsync calls!), and simplify the logic for restart conditions, without significantly weakening the already weakened behavior of clients being temporarily dead.
What do you think?
bdd9abf
to
8feac68
Compare
- track lastHeartbeat, the client local time of the last successful heartbeat round trip - track allocations with `stop_after_client_disconnect` configured - trigger allocation destroy (which handles cleanup) - restore heartbeat/killable allocs tracking when allocs are recovered from disk - on client restart, stop those allocs after a grace period if the servers are still partioned
d87d492
to
1da3031
Compare
Ok, after some followup, I've removed the stateful last heartbeat handling, and replaced it with a simplified check that, in the case the client crashes but the workload remains running and the client restarts, the client will wait for a server connection grace period and then stop all heartyeet allocs. This avoids the complexity of the state store write on every heartbeat and the state read on the client init path. We believe this to be an unusual failure mode, so keeping it simple until we know we need the additional complexity seems good. I went ahead an pre-squashed this pr, planning to merge it as two commits to keep the undone state changes available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I like the direction this went once the saved state was removed. We have a decent unit test here, but it'd probably be worth thinking about whether we can come up with a reasonable e2e test for nightly once the server-side work is done.
(Left one comment/question but feel free to merge if I'm off-base there.)
In order to minimize this change while keeping a simple version of the behavior, we set `lastOk` to the current time less the intial server connection timeout. If the client starts and never contacts the server, it will stop all configured tasks after the initial server connection grace period, on the assumption that we've been out of touch longer than any configured `stop_after_client_disconnect`. The more complex state behavior might be justified later, but we should learn about failure modes first.
1da3031
to
39d3043
Compare
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
stop after client disconnect
Add group configuration, client, and server support for stopping tasks
on disconnected clients (aka "heartyeet").
client:
stop_after_client_disconnect
keylast successful heartbeat round trip
stop_after_client_disconnect
configured