SpecCluster correct state can cause inconsistencies #5919

fjetter · 2022-03-09T14:19:07Z

There are a bunch of problems in SpecCluster._correct_state_internal such that I believe it should be rewritten

If one worker fails during startup, all workers are rejected. This can cause the cluster to spin up too many workers
Similarly, while closing, if one worker fails but others properly shut down, they are not removed from the internal state
correct_state_internal is only called once, i.e. any kind of exception would abort the entire up/downscaling without further attempt to correct the state.
If the cluster is closing while correct_state is running, nothing is actually cancelled.
self._correct_state_waiting is actually never cancelled.
SpecCluster.scale schedules a callback to self _correct_state. This can cause all sorts of race conditions, e.g. by creating more futures even if the cluster is already closing

This list probably continues long and most issues could be addressed individually but I believe we're better off rewriting this section.

cc @graingert

The text was updated successfully, but these errors were encountered:

graingert · 2022-03-09T14:22:19Z

If one worker fails during startup, all workers are rejected. This can cause the cluster to spin up too many workers

it's actually sort of worse - the workers get 2 chances to startup, one (concurrently) in asyncio.wait(workers) and another (sequentially) in await w # for tornado gen.coroutine support

distributed/distributed/deploy/spec.py

Lines 349 to 352 in e1e4385

    
           await asyncio.wait(workers) 
        
           for w in workers: 
        
               w._cluster = weakref.ref(self) 
        
               await w  # for tornado gen.coroutine support

I think this is supposed to be more like

tasks = [asyncio.create_task(worker) for worker in workers]
await asyncio.wait(tasks)
for t in tasks:
    w = await t
    w._cluster = weakref.ref(self)

fjetter · 2022-03-30T13:23:13Z

Xref #4606

Prior to dask#8233, when correcting state after calling Cluster.scale, we would wait until all futures had completed before updating the mapping of workers that we knew about. This meant that failure to boot a worker would propagate from a message on the worker side to an exception on the cluster side. With dask#8233 this order was changed, so that the workers we know about are updated before checking if the worker successfully booted. With this change, any exception is not propagated from the worker to the cluster, and so we cannot easily tell if scaling our cluster was successful. While _correct_state has issues (see dask#5919) until we can fix this properly, at least restore the old behaviour of propagating any raised exceptions to the cluster.

fjetter added the bug Something is broken label Mar 9, 2022

fjetter mentioned this issue May 20, 2022

Remove report and safe from Worker.close #6363

Merged

1 task

fjetter mentioned this issue Jul 1, 2022

AssertionError: Status.created #6183

Open

fjetter mentioned this issue Oct 27, 2023

Errors when scaling up cluster no longer propagate to client side #8309

Open

wence- mentioned this issue Oct 30, 2023

Restore ordering of worker update in _correct_state_internal #8314

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpecCluster correct state can cause inconsistencies #5919

SpecCluster correct state can cause inconsistencies #5919

fjetter commented Mar 9, 2022

graingert commented Mar 9, 2022 •

edited

Loading

fjetter commented Mar 30, 2022

SpecCluster correct state can cause inconsistencies #5919

SpecCluster correct state can cause inconsistencies #5919

Comments

fjetter commented Mar 9, 2022

graingert commented Mar 9, 2022 • edited Loading

fjetter commented Mar 30, 2022

graingert commented Mar 9, 2022 •

edited

Loading