-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SpecCluster correct state can cause inconsistencies #5919
Comments
it's actually sort of worse - the workers get 2 chances to startup, one (concurrently) in distributed/distributed/deploy/spec.py Lines 349 to 352 in e1e4385
I think this is supposed to be more like tasks = [asyncio.create_task(worker) for worker in workers]
await asyncio.wait(tasks)
for t in tasks:
w = await t
w._cluster = weakref.ref(self) |
Xref #4606 |
Prior to dask#8233, when correcting state after calling Cluster.scale, we would wait until all futures had completed before updating the mapping of workers that we knew about. This meant that failure to boot a worker would propagate from a message on the worker side to an exception on the cluster side. With dask#8233 this order was changed, so that the workers we know about are updated before checking if the worker successfully booted. With this change, any exception is not propagated from the worker to the cluster, and so we cannot easily tell if scaling our cluster was successful. While _correct_state has issues (see dask#5919) until we can fix this properly, at least restore the old behaviour of propagating any raised exceptions to the cluster.
There are a bunch of problems in
SpecCluster._correct_state_internal
such that I believe it should be rewrittencorrect_state_internal
is only called once, i.e. any kind of exception would abort the entire up/downscaling without further attempt to correct the state.self._correct_state_waiting
is actually never cancelled.SpecCluster.scale
schedules a callback to self_correct_state
. This can cause all sorts of race conditions, e.g. by creating more futures even if the cluster is already closingThis list probably continues long and most issues could be addressed individually but I believe we're better off rewriting this section.
cc @graingert
The text was updated successfully, but these errors were encountered: