Fix decide_worker picking a closing worker #8032

crusaderky · 2023-07-24T23:10:48Z

crusaderky · 2023-07-24T23:11:13Z

distributed/scheduler.py

@@ -8205,6 +8202,7 @@ def decide_worker(
        candidates = set(all_workers)
    else:
        candidates = {wws for dts in ts.dependencies for wws in dts.who_has}
+        candidates &= all_workers


This fixes #8019

FWIW I think this is a situation where an actual decide_worker unit test would be appropriate

github-actions · 2023-07-24T23:35:40Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      20 files ±      0       20 suites ±0 11h 58m 7s ⏱️ + 1h 37m 24s
  3 751 tests +      3   3 638 ✔️ +      9   106 💤 -   2   7 ❌ - 2
36 284 runs +2 004 34 524 ✔️ +1 951 1 748 💤 +58 12 ❌ - 3

For more details on these failures, see this check.

Results for commit 80b36c9. ± Comparison against base commit a7f7764.

This pull request removes 5 and adds 8 tests. Note that renamed tests count towards both.

distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---nanny]
distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---no-nanny]
distributed.tests.test_failed_workers ‑ test_submit_after_failed_worker_async[False]
distributed.tests.test_failed_workers ‑ test_submit_after_failed_worker_async[True]
pytest ‑ internal

distributed.tests.test_failed_workers ‑ test_submit_after_failed_worker_async[False-False-closed]
distributed.tests.test_failed_workers ‑ test_submit_after_failed_worker_async[False-False-closing]
distributed.tests.test_failed_workers ‑ test_submit_after_failed_worker_async[False-True-closed]
distributed.tests.test_failed_workers ‑ test_submit_after_failed_worker_async[False-True-closing]
distributed.tests.test_failed_workers ‑ test_submit_after_failed_worker_async[True-False-closed]
distributed.tests.test_failed_workers ‑ test_submit_after_failed_worker_async[True-False-closing]
distributed.tests.test_failed_workers ‑ test_submit_after_failed_worker_async[True-True-closed]
distributed.tests.test_failed_workers ‑ test_submit_after_failed_worker_async[True-True-closing]

♻️ This comment has been updated with latest results.

fjetter · 2023-07-25T07:39:27Z

distributed/tests/test_failed_workers.py

+        in_update_graph = asyncio.Event()
+
+        async def update_graph(*args, **kwargs):
+            in_update_graph.set()
+            await async_poll_for(
+                lambda: b_ws.status == Status.closing, timeout=5, period=0
+            )
+            s.update_graph(*args, **kwargs)
+            nonlocal done_update_graph
+            done_update_graph = True
+
+        s.stream_handlers["update-graph"] = update_graph
+        await in_update_graph.wait()


Given the problem, I would've expected something like

def block(arg, enter, exit): enter.set() exit.wait() return arg enter = Event() exit = Event() enter2 = Event() exit2 = Event() d1 = c.submit(inc, 0, key='d1', workers=["A"]) d2 = c.submit(block, 1, enter=enter, exit=exit, key='d2', workers=["B"]) x = c.submit(sum, [d1, d2], key='x') block_executor = c.submit(block, None, enter=enter2, exit=exit2, key='x', workers=["B"]) await enter.wait() await enter2.wait() await asyncio.gather([ exit.set(), B.close() ])

I haven't tested the above and it may still need fine tuning but I would expect something like this to trigger the condition you're talking about. d2 completes while B is closing s.t. when d2 finishes, x is transitioned while B is still closing.

I don't entirely understand why we need update_graph to trigger this condition

Especially considering that update_graph (right now) is still synchronous.

I'm not too happy with my test, either. But I don't think your suggestion works (read below).

What we're trying to test:

decide_worker_non_rootish(ts) is called on a task with workers=[b.address], allow_other_workers=True,

with the dependencies of the task partially on a and partially on b,

while b is in closing status, and

with decide_worker(ts, valid_workers=set(), all_workers={a}) that would pick b from ts.dependencies[...].who_has, due to less dependency bytes needing to be transferred to it, but instead picks a because b is not in all_workers.

The problem with writing the test is that we need to time update_graph to land exactly during the 1-2 event loop cycles while the worker is in closing status.
The worker transitions from closing to being removed when the batched comms collapse, here:

distributed/distributed/scheduler.py

Lines 5700 to 5705 in f0303aa

finally:

if worker in self.stream_comms:

worker_comm.abort()

await self.remove_worker(

worker, stimulus_id=f"handle-worker-cleanup-{time()}"

)

Alternatively to monkey-patching update-graph, I could have

monkey-patched Scheduler.remove_worker. In hindsight that's a better idea; I'll have a look at it now.

synchronously call update_graph directly on the scheduler and update the state on the client by hand (complicated and brittle).

In your code:

I think that the scheduler will never receive the task-finished message for d2, since it's a whole 2 cycles of event loop after Worker.close collapses the batched comms.

I'm not sure why you think the scheduler should receive {op: task-finished, key: d2} deterministically after {op: worker-status-change, worker: <b>, status: closing}, but deterministically before the collapse of the TCP channel?

Also I can't understand what's the purpose of block_executor.

but deterministically before the collapse of the TCP channel?

I never said it would do so deterministically. I said I would expect something like this to trigger the condition you described. I also said it would require more fine tuning like distributing nbytes properly and maybe introduce an event somewhere.

crusaderky · 2023-07-26T13:19:39Z

I got 2 failures out of 400 runs:

one because of gilknocker: Timed out waiting for sampling thread #8034
one caused by a race condition in Nanny.kill()

neither seem to be related to this PR.
Ready for review and merge.

fjetter · 2023-07-31T10:08:19Z

distributed/tests/test_failed_workers.py

+            await wait_remove_worker.wait()
+            return await orig_remove_worker(*args, **kwargs)
+
+        monkeypatch.setattr(s, "remove_worker", remove_worker)


I'm OK-ish with using monkeypatch here. However, just for the sake of prosperity, there is also a way to use our RPC mechanism more naturally. Essentially you want to intercept the point in time just when a request handler is called. You can make this very explitc

async def new_remove_worker_handler_with_events(self, *args, **kwargs): in_remove_worker.set() await wait_remove_worker.wait() return await self.remove_worker(*args, **kwargs) s.handlers['unregister'] = new_remove_worker_handler_with_events`

Semantically, this overrides the unregister handler and replaces it with a new handler.
However, in the end, it's the same thing just the way the patch is installed is different.

We're not arriving here from the unregister handler. We're arriving from

distributed/distributed/scheduler.py

Lines 5700 to 5705 in f0303aa

finally:

if worker in self.stream_comms:

worker_comm.abort()

await self.remove_worker(

worker, stimulus_id=f"handle-worker-cleanup-{time()}"

)

fjetter · 2023-07-31T10:09:36Z

distributed/tests/test_failed_workers.py

+    L = c.map(
+        inc,
+        range(10),


Do we actually need a map for this? This feels much more difficult to control than if we used single tasks with specific placement

You're right. Simplified.

fjetter · 2023-07-31T10:11:49Z

distributed/scheduler.py

-            if not self.running:
-                return None


Is this related? At least the new test doesn't seem to care about this.

It's unreachable because the same condition is already tested on line 2218

fjetter · 2023-07-31T10:12:56Z

distributed/scheduler.py

@@ -8205,6 +8202,7 @@ def decide_worker(
        candidates = set(all_workers)
    else:
        candidates = {wws for dts in ts.dependencies for wws in dts.who_has}
+        candidates &= all_workers


FWIW I think this is a situation where an actual decide_worker unit test would be appropriate

distributed/tests/test_failed_workers.py

crusaderky · 2023-07-31T11:53:57Z

Code review comments have been addressed

crusaderky commented Jul 24, 2023

View reviewed changes

crusaderky self-assigned this Jul 24, 2023

Fix decide_worker picking a closing worker

882a546

crusaderky force-pushed the closing_worker branch from a1cbca6 to 882a546 Compare July 24, 2023 23:14

fjetter reviewed Jul 25, 2023

View reviewed changes

crusaderky added 3 commits July 25, 2023 17:03

Redesign test

c4e5143

read-only annotations

d884e06

Revert temp changes

5661ac1

lint

d5f1a83

crusaderky marked this pull request as ready for review July 26, 2023 14:18

fjetter reviewed Jul 31, 2023

View reviewed changes

crusaderky added 2 commits July 31, 2023 12:39

Merge branch 'main' into closing_worker

1affb3f

Code review

80b36c9

fjetter approved these changes Aug 3, 2023

View reviewed changes

fjetter merged commit 84e1984 into dask:main Aug 3, 2023

crusaderky deleted the closing_worker branch August 3, 2023 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix decide_worker picking a closing worker #8032

Fix decide_worker picking a closing worker #8032

crusaderky commented Jul 24, 2023

crusaderky Jul 24, 2023

fjetter Jul 31, 2023

github-actions bot commented Jul 24, 2023 •

edited

Loading

fjetter Jul 25, 2023

fjetter Jul 25, 2023

crusaderky Jul 25, 2023

fjetter Jul 31, 2023

crusaderky commented Jul 26, 2023

fjetter Jul 31, 2023

crusaderky Jul 31, 2023 •

edited

Loading

fjetter Jul 31, 2023

crusaderky Jul 31, 2023

fjetter Jul 31, 2023

crusaderky Jul 31, 2023

fjetter Jul 31, 2023

crusaderky commented Jul 31, 2023

	finally:
	if worker in self.stream_comms:
	worker_comm.abort()
	await self.remove_worker(
	worker, stimulus_id=f"handle-worker-cleanup-{time()}"
	)

Fix decide_worker picking a closing worker #8032

Fix decide_worker picking a closing worker #8032

Conversation

crusaderky commented Jul 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jul 24, 2023 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

In your code:

Choose a reason for hiding this comment

crusaderky commented Jul 26, 2023

Choose a reason for hiding this comment

crusaderky Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jul 31, 2023

github-actions bot commented Jul 24, 2023 •

edited

Loading

crusaderky Jul 31, 2023 •

edited

Loading