Restructure `Worker.close` #8062

hendrikmakait · 2023-08-02T12:37:34Z

At the moment, the order of operations in Worker.close is somewhat random and does not appear intentionally structured.

It reads roughly as follows:

Update self.status to Status.closing
1. Inform scheduler of status change
2. Handle PauseEvent to prevent further tasks from being executed/gathered
Stop self.periodic_callbacks
self.stop()
1. Purge workspace
2. Close monitor
3. Stop sync listeners
4. Abort handshaking comms
5. Stop async listeners in the background
Run BaseWorker.close(timeout=...)
1. Cancel self.async_instructions
2. Wait for them to finish (with timeout)
Teardown preloads
Close extensions
Maybe tell nanny to close_gracefully
Teardown plugins
Close worker-initialized clients
Close scheduler RPC
Stop services
Wait a bit in case we’re using UCX
Close batched stream by first sending close-stream, then closing with a timeout
Shutdown executors (maybe wait with timeout)
Close RPC
Update status to Status.closed
ServerNode.close()
1. stop periodic_callbacks (again)
2. stop monitor (again)
3. stop listeners (again)
4. Abort handshaking comms (again)
5. Stop async listeners (again but blocking this time)
6. Stop background_tasks
7. close RPC (again)
8. Close comms
9. Remove local_directory from sys.path
Unwind exit_stack

There are multiple issues with this:

The workspace gets purged before plugins get torn down, which may remove data of an in-progress P2P shuffle and cause exceptions.
It's generally hard to understand what state a worker is in at any given point in time
Just to add to the confusion, some stuff is stopped multiple times.

As an alternative, I suggest the following rough order of operations:

Inform the cluster that the worker is closing
- Inform scheduler
- Inform nanny
- Set self.status = Status.closing
Tear down any running "add-ins" while resources are still working
- plugins
- extensions
- services
- preloads
Prevent new communications from being established
Close existing comms
Close local resources and clean up internal state

In particular, add-ins such as plugins should be torn down while the worker is still functional to allow them to take arbitrary actions such as informing the scheduler or initiating communications with other workers.

Note:
Timeouts are also a problem but should be tackled in a dedicated PR (see #7318, #7320).

The text was updated successfully, but these errors were encountered:

hendrikmakait · 2023-08-02T12:57:37Z

In particular, add-ins such as plugins should be torn down while the worker is still functional to allow them to take arbitrary actions such as informing the scheduler or initiating communications with other workers.

An alternative to this would be to remove the worker from the scheduler ASAP, mimicking the case where a worker dies. Any necessary cluster-wide coordination would then be left to SchedulerPlugins, and WorkerPlugins would be scoped to local teardown.

hendrikmakait · 2023-08-02T16:50:51Z

Fun fact: By stopping periodic_callbacks in step 2, we also stop heartbeats and keep-alives. If closing the worker takes too long, its TTL will expire and the scheduler will trigger remove_worker regardless of how far along the worker is in its shutdown.

fjetter · 2023-08-03T08:24:51Z

Inform the cluster that the worker is closing

+1

Inform scheduler

this is done by setting the status. I don't think we should do anything else at this stage.

overall the proposed ordering makes sense to me. I think the PCs have to be stopped between 2 and 3. PCs do need comms and we should stop them before closing them.

github-actions bot added the needs triage label Aug 2, 2023

hendrikmakait mentioned this issue Aug 3, 2023

Close state machine and add-ins first in Worker.close #8066

Merged

2 tasks

hendrikmakait closed this as completed in #8066 Aug 4, 2023

hendrikmakait mentioned this issue Aug 4, 2023

Reorder operations in Worker.close #8076

Merged

2 tasks

hendrikmakait reopened this Aug 4, 2023

hendrikmakait closed this as completed in #8076 Aug 7, 2023

jacobtomlinson removed the needs triage label Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure `Worker.close` #8062

Restructure `Worker.close` #8062

hendrikmakait commented Aug 2, 2023 •

edited

Loading

hendrikmakait commented Aug 2, 2023 •

edited

Loading

hendrikmakait commented Aug 2, 2023

fjetter commented Aug 3, 2023

Restructure Worker.close #8062

Restructure Worker.close #8062

Comments

hendrikmakait commented Aug 2, 2023 • edited Loading

hendrikmakait commented Aug 2, 2023 • edited Loading

hendrikmakait commented Aug 2, 2023

fjetter commented Aug 3, 2023

Restructure `Worker.close` #8062

Restructure `Worker.close` #8062

hendrikmakait commented Aug 2, 2023 •

edited

Loading

hendrikmakait commented Aug 2, 2023 •

edited

Loading