-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate partially-removed-worker state on scheduler (comms open, state removed) #6390
Comments
`Scheduler.restart` used to remove every worker without closing it. This was bad practice (dask#6390), as well as incorrect: it certainly seemed the intent was only to remove non-Nanny workers. Then, Nanny workers are restarted via the `restart` RPC to the Nanny, not to the worker.
Is there any ETA for this? |
There's an extra layer of complexity added to this when close_workers=False, retire=FalseThe worker sits forever in close_workers=True, retire=FalseCalls close_workers=False, retire=TrueCalls close_workers=True, retire=TrueShut the workers and the nannies down and remove them. |
I suspect the below is purely hypothetical, but I'll note it nonetheless. At the moment, there is no simple API for graceful worker restart. For example, it would make sense for cleaning up a memory leak on a worker without losing the data on it. Currently you can do client.retire_workers([addr], close_workers=False, remove=False)
client.restart_workers([addr]) but with the removal of the flags from retire_workers, it would become impossible. |
Scheduler.remove_worker
removes state regarding the worker (self.workers[addr]
,self.stream_comms[addr]
, etc.), but does not close the actual network connections to the worker. This is even codified in theclose=False
option, which supports removing the worker state, but not telling the worker to shut down or to disconnect.Keeping the network connections open (and listening to them) is essentially a half-removed state. The scheduler no longer knows about the worker, but if the worker sends it updates over the open connection, it will respond to them (potentially invoking handlers that assume the worker state is there).
There are two things to figure out:
self.workers[addr]
exists? Or alsoself.stream_comms[addr]
, and other such fields? Is there aself.handle_worker
coroutine running for that worker too?Scheduler.remove_worker
ensure that:await
), things are in a well-defined state (worker is either "there", or "not there", or maybe even in a "closing" state, but no half-removed state like we have currently)remove_worker
coroutines run concurrently, everything remains consistentremove_worker
coroutines run concurrently, the second one does not return until the worker is actually removed (i.e. the first coroutine has completed)Addresses #6354
The text was updated successfully, but these errors were encountered: