Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: stop after client disconnect #7793

Merged
merged 2 commits into from
May 1, 2020
Merged

Conversation

langmartin
Copy link
Contributor

@langmartin langmartin commented Apr 23, 2020

stop after client disconnect

Add group configuration, client, and server support for stopping tasks
on disconnected clients (aka "heartyeet").

client:

  • add the stop_after_client_disconnect key
  • persistently track lastHeartbeat, the client local time of the
    last successful heartbeat round trip
  • track allocations with stop_after_client_disconnect configured
  • trigger allocation destroy (which handles cleanup)
  • restore heartbeat/killable allocs when allocs are recovered from disk
  • on client restart, don't restart allocs that should have been killed

@langmartin langmartin marked this pull request as ready for review April 24, 2020 14:56
@langmartin langmartin requested a review from tgross April 24, 2020 14:56
@langmartin langmartin force-pushed the f-client-at-most-one branch from 75c7e6b to 46b8b71 Compare April 24, 2020 15:07
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 @langmartin this is mostly looking great!

client/heartbeatstop.go Outdated Show resolved Hide resolved
client/heartbeatstop.go Show resolved Hide resolved
client/heartbeatstop.go Outdated Show resolved Hide resolved
client/client.go Outdated Show resolved Hide resolved
client/heartbeatstop.go Show resolved Hide resolved
client/heartbeatstop.go Outdated Show resolved Hide resolved
jobspec/parse_group.go Outdated Show resolved Hide resolved
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code overall looks good, but I wonder about the restart behavior more. Do we want to have a grace period on client restart for it to connect before killing the task?

In this implementation, it might be possible for the grace period to elapse while client is restarting, so the client may attempt to kill the alloc before the client has a chance to reconnect to the server.

Another concern is that the logic for client restart handling of allocs is getting more complex and it would depend on job type, whether it was successfully restored, whether it has heartyeet configuration, etc.

Also, it's not obvious to me how beneficial the handling is. If the client has been stopped for longer than stop_after_client_disconnect potentially arbitrary long time, the servers will reschedule and we cannot offer any guarantees about the job being stopped already. When the client starts up, the alloc would be most likely stopped due to the server update anyway.

If so, I'd suggest having the heart-beat counter be an in-memory counter alone. It reduces the periodic IO operation (that requires 2 fsync calls!), and simplify the logic for restart conditions, without significantly weakening the already weakened behavior of clients being temporarily dead.

What do you think?

@langmartin langmartin force-pushed the f-client-at-most-one branch from bdd9abf to 8feac68 Compare April 27, 2020 14:57
- track lastHeartbeat, the client local time of the last successful
  heartbeat round trip
- track allocations with `stop_after_client_disconnect` configured
- trigger allocation destroy (which handles cleanup)
- restore heartbeat/killable allocs tracking when allocs are recovered from disk
- on client restart, stop those allocs after a grace period if the
  servers are still partioned
@langmartin langmartin force-pushed the f-client-at-most-one branch 2 times, most recently from d87d492 to 1da3031 Compare April 28, 2020 20:58
@langmartin
Copy link
Contributor Author

Ok, after some followup, I've removed the stateful last heartbeat handling, and replaced it with a simplified check that, in the case the client crashes but the workload remains running and the client restarts, the client will wait for a server connection grace period and then stop all heartyeet allocs. This avoids the complexity of the state store write on every heartbeat and the state read on the client init path. We believe this to be an unusual failure mode, so keeping it simple until we know we need the additional complexity seems good.

I went ahead an pre-squashed this pr, planning to merge it as two commits to keep the undone state changes available.

@langmartin langmartin requested review from notnoop and tgross April 28, 2020 21:17
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like the direction this went once the saved state was removed. We have a decent unit test here, but it'd probably be worth thinking about whether we can come up with a reasonable e2e test for nightly once the server-side work is done.

(Left one comment/question but feel free to merge if I'm off-base there.)

client/heartbeatstop.go Outdated Show resolved Hide resolved
In order to minimize this change while keeping a simple version of the
behavior, we set `lastOk` to the current time less the intial server
connection timeout. If the client starts and never contacts the
server, it will stop all configured tasks after the initial server
connection grace period, on the assumption that we've been out of
touch longer than any configured `stop_after_client_disconnect`.

The more complex state behavior might be justified later, but we
should learn about failure modes first.
@langmartin langmartin force-pushed the f-client-at-most-one branch from 1da3031 to 39d3043 Compare April 29, 2020 15:34
@langmartin langmartin merged commit 3477f2e into master May 1, 2020
@langmartin langmartin deleted the f-client-at-most-one branch May 1, 2020 16:35
@github-actions
Copy link

github-actions bot commented Jan 7, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants