client: stop after client disconnect #7793

langmartin · 2020-04-23T18:10:58Z

stop after client disconnect

Add group configuration, client, and server support for stopping tasks
on disconnected clients (aka "heartyeet").

client:

add the stop_after_client_disconnect key
persistently track lastHeartbeat, the client local time of the
last successful heartbeat round trip
track allocations with stop_after_client_disconnect configured
trigger allocation destroy (which handles cleanup)
restore heartbeat/killable allocs when allocs are recovered from disk
on client restart, don't restart allocs that should have been killed

tgross

👍 @langmartin this is mostly looking great!

client/heartbeatstop.go

client/client.go

client/heartbeatstop.go

jobspec/parse_group.go

notnoop

The code overall looks good, but I wonder about the restart behavior more. Do we want to have a grace period on client restart for it to connect before killing the task?

In this implementation, it might be possible for the grace period to elapse while client is restarting, so the client may attempt to kill the alloc before the client has a chance to reconnect to the server.

Another concern is that the logic for client restart handling of allocs is getting more complex and it would depend on job type, whether it was successfully restored, whether it has heartyeet configuration, etc.

Also, it's not obvious to me how beneficial the handling is. If the client has been stopped for longer than stop_after_client_disconnect potentially arbitrary long time, the servers will reschedule and we cannot offer any guarantees about the job being stopped already. When the client starts up, the alloc would be most likely stopped due to the server update anyway.

If so, I'd suggest having the heart-beat counter be an in-memory counter alone. It reduces the periodic IO operation (that requires 2 fsync calls!), and simplify the logic for restart conditions, without significantly weakening the already weakened behavior of clients being temporarily dead.

What do you think?

- track lastHeartbeat, the client local time of the last successful heartbeat round trip - track allocations with `stop_after_client_disconnect` configured - trigger allocation destroy (which handles cleanup) - restore heartbeat/killable allocs tracking when allocs are recovered from disk - on client restart, stop those allocs after a grace period if the servers are still partioned

langmartin · 2020-04-28T21:17:30Z

Ok, after some followup, I've removed the stateful last heartbeat handling, and replaced it with a simplified check that, in the case the client crashes but the workload remains running and the client restarts, the client will wait for a server connection grace period and then stop all heartyeet allocs. This avoids the complexity of the state store write on every heartbeat and the state read on the client init path. We believe this to be an unusual failure mode, so keeping it simple until we know we need the additional complexity seems good.

I went ahead an pre-squashed this pr, planning to merge it as two commits to keep the undone state changes available.

tgross

LGTM. I like the direction this went once the saved state was removed. We have a decent unit test here, but it'd probably be worth thinking about whether we can come up with a reasonable e2e test for nightly once the server-side work is done.

(Left one comment/question but feel free to merge if I'm off-base there.)

client/heartbeatstop.go

In order to minimize this change while keeping a simple version of the behavior, we set `lastOk` to the current time less the intial server connection timeout. If the client starts and never contacts the server, it will stop all configured tasks after the initial server connection grace period, on the assumption that we've been out of touch longer than any configured `stop_after_client_disconnect`. The more complex state behavior might be justified later, but we should learn about failure modes first.

github-actions · 2023-01-07T02:15:24Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

langmartin marked this pull request as ready for review April 24, 2020 14:56

langmartin requested a review from tgross April 24, 2020 14:56

langmartin force-pushed the f-client-at-most-one branch from 75c7e6b to 46b8b71 Compare April 24, 2020 15:07

tgross reviewed Apr 24, 2020

View reviewed changes

notnoop reviewed Apr 24, 2020

View reviewed changes

langmartin force-pushed the f-client-at-most-one branch from bdd9abf to 8feac68 Compare April 27, 2020 14:57

langmartin force-pushed the f-client-at-most-one branch 2 times, most recently from d87d492 to 1da3031 Compare April 28, 2020 20:58

langmartin requested review from notnoop and tgross April 28, 2020 21:17

tgross approved these changes Apr 29, 2020

View reviewed changes

client/heartbeatstop.go Outdated Show resolved Hide resolved

langmartin force-pushed the f-client-at-most-one branch from 1da3031 to 39d3043 Compare April 29, 2020 15:34

langmartin merged commit 3477f2e into master May 1, 2020

langmartin deleted the f-client-at-most-one branch May 1, 2020 16:35

langmartin mentioned this pull request May 12, 2020

server: stop after client disconnect #7939

Merged

5 tasks

github-actions bot locked as resolved and limited conversation to collaborators Jan 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: stop after client disconnect #7793

client: stop after client disconnect #7793

langmartin commented Apr 23, 2020 •

edited

Loading

tgross left a comment

notnoop left a comment

langmartin commented Apr 28, 2020

tgross left a comment

github-actions bot commented Jan 7, 2023

client: stop after client disconnect #7793

client: stop after client disconnect #7793

Conversation

langmartin commented Apr 23, 2020 • edited Loading

stop after client disconnect

tgross left a comment

Choose a reason for hiding this comment

notnoop left a comment

Choose a reason for hiding this comment

langmartin commented Apr 28, 2020

tgross left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 7, 2023

langmartin commented Apr 23, 2020 •

edited

Loading