Ensure Nanny doesn't restart workers that fail to start, and joins subprocess #6427

gjoseph92 · 2022-05-24T00:09:34Z

I think this could be split into multiple PRs if desired. I'd recommend reviewing commit by commit.

Tests added / passed
Passes pre-commit run --all-files

This feels more reasonable, but may not actually matter since it's still not waiting for the process to be joined (just for the signal to be sent). Consider reverting.

`Server.start._close_on_failure` can set status to `failed` when `self.start` fails. If the process doesn't termiante until after this has happened, it may encounter this status. A passlist here would be more sensible.

This is a big deal, because we might be leaking a process then.

We want to be able to log it

github-actions · 2022-05-24T02:11:05Z

Unit Test Results

      12 files -       3       12 suites - 3 5h 50m 28s ⏱️ - 1h 3m 58s
  2 809 tests +      1   2 727 ✔️ ±      0   79 💤 ±  0 3 ❌ +1
17 649 runs - 3 169 16 784 ✔️ - 3 098 859 💤 - 74 6 ❌ +3

For more details on these failures, see this check.

Results for commit 4d61b38. ± Comparison against base commit 7665eaa.

fjetter · 2022-05-24T09:13:57Z

Only remotely related to this change but the name Nanny._on_exit confused me quite a bit. It took me way too long to realize that this is in fact a WorkerProcess on_exit handler and not a Nanny on_exit. I was wondering why we'd even restart the worker if the nanny is exiting...

Doesn't need to happen in this PR but we should consider renaming these methods

distributed/tests/test_nanny.py

This is kinda superfluous with `raises_with_cause`, but I didn't want to refactor everywhere `raises_with_cause` is used since this syntax is a bit less readable.

FIXME this crashes mypy

I kinda like verifying that the error isn't spewed though. May revert.

…-start-failure

This reverts commit 00e277e.

distributed/tests/test_nanny.py

github-actions · 2022-06-08T21:42:46Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 29m 23s ⏱️ - 4m 44s
  2 989 tests +1   2 900 ✔️ +  2     88 💤 ±0 1 ❌ - 1
22 165 runs +8 21 121 ✔️ +10 1 043 💤 - 1 1 ❌ - 1

For more details on these failures, see this check.

Results for commit 88abaf2. ± Comparison against base commit 4f6960a.

♻️ This comment has been updated with latest results.

I'm suspicious that corruption of this somehow caused the repeated segfault?

…-start-failure

gjoseph92 · 2022-06-10T22:08:55Z

distributed/nanny.py

-                f"Worker process still alive after {timeout} seconds, killing"
-            )
-            try:
-                await process.terminate()


Change: we used to first call Worker.close (via child_stop_q.put({"op": "stop"})), then send SIGTERM to the process if it didn't stop in time, then just return to the caller as soon as the terminate signal was sent, regardless of whether the process actually stopped.

Now, we still do Worker.close, but then send SIGKILL (which unlike SIGTERM, can't be blocked) because the Worker.close can already be considered our SIGTERM-like request for graceful close. We wait for the process to actually terminate.

gjoseph92 · 2022-06-10T22:10:33Z

distributed/nanny.py

+            f"Worker process still alive after {wait_timeout} seconds, killing"
+        )
+        await process.kill()
+        await process.join(max(0, deadline - time()))


I'm a little wary of this deadline on the join. I could imagine the default 2s timeout not being long enough for the process to actually shut down in CI. kill didn't used to raise an error if the timeout expired; now it will. I could make it not, but I think it's really the caller's responsibility to decide what to do if the process doesn't shut down in time.

I think we should probably make the default timeout a bit longer.

distributed/tests/test_nanny.py

gjoseph92 · 2022-06-10T22:14:03Z

distributed/core.py

@@ -309,6 +309,7 @@ async def start_unsafe(self):
        await self.rpc.start()
        return self

+    @final


driveby: added so nobody in the future can try to override this

gjoseph92 · 2022-06-10T22:14:15Z

distributed/nanny.py

@@ -405,7 +405,7 @@ async def instantiate(self) -> Status:
            self.process = WorkerProcess(
                worker_kwargs=worker_kwargs,
                silence_logs=self.silence_logs,
-                on_exit=self._on_exit_sync,
+                on_exit=self._on_worker_exit_sync,


driveby: renamed for clarity

Still segfaulting in CI, maybe mp objects aren't thread-safe.

hendrikmakait · 2022-08-03T14:23:14Z

It looks like the added tests work now, but the error of test_quiet_close_process (#6582) has changed.

hendrikmakait · 2022-08-05T13:31:52Z

Locally, test_quiet_close_process still flakes with

sys:1: RuntimeWarning: coroutine 'InProc.write' was never awaited
Task was destroyed but it is pending!
task: <Task pending name='Task-70' coro=<InProc.write() running at /Users/hendrikmakait/projects/dask/distributed/distributed/comm/inproc.py:215> cb=[IOLoop.add_future.<locals>.<lambda>() at /opt/homebrew/Caskroom/mambaforge/base/envs/dask-distributed-py3.9/lib/python3.9/site-packages/tornado/ioloop.py:688]>

hendrikmakait · 2022-08-05T13:34:01Z

@gjoseph92: I'd say we merge this in and monitor for changes in test_quiet_close_process? #6551 already aims at solving the local issue.

gjoseph92 · 2022-08-05T15:13:58Z

Thanks for finishing this up @hendrikmakait

…bprocess (dask#6427)

gjoseph92 added 8 commits May 23, 2022 16:40

WorkerProcess.start: wait for terminate

7d73e26

This feels more reasonable, but may not actually matter since it's still not waiting for the process to be joined (just for the signal to be sent). Consider reverting.

Nanny: don't attempt to restart if failed

ba57395

`Server.start._close_on_failure` can set status to `failed` when `self.start` fails. If the process doesn't termiante until after this has happened, it may encounter this status. A passlist here would be more sensible.

Log failure to kill worker process

0676b3b

This is a big deal, because we might be leaking a process then.

AsyncProcess.join should raise TimeoutError

2861e56

We want to be able to log it

WorkerProcess.kill should join the process

5e1d2ea

fix test_worker_start_exception to actually test

ea544e5

failed is a valid status for WorkerProcess.kill

3405583

reduce log spewing in nanny

4d61b38

fjetter reviewed May 24, 2022

View reviewed changes

distributed/tests/test_nanny.py Show resolved Hide resolved

gjoseph92 self-assigned this Jun 7, 2022

gjoseph92 added 7 commits June 8, 2022 11:44

add raises_with_causes

31db02a

This is kinda superfluous with `raises_with_cause`, but I didn't want to refactor everywhere `raises_with_cause` is used since this syntax is a bit less readable.

use start_unsafe and mp.Value

de4daaf

remove raises_with_causes

babfd00

Note Server.start should not be overridden

302d59e

FIXME this crashes mypy

remove logs capture

00e277e

I kinda like verifying that the error isn't spewed though. May revert.

Merge remote-tracking branch 'upstream/main' into nanny-close-proc-on…

ee343aa

…-start-failure

Revert "remove logs capture"

bdb632a

This reverts commit 00e277e.

gjoseph92 commented Jun 8, 2022

View reviewed changes

distributed/tests/test_nanny.py Show resolved Hide resolved

on_exit -> on_worker_exit

27e29b9

gjoseph92 mentioned this pull request Jun 9, 2022

Flaky test_AllProgress #6550

Open

gjoseph92 added 7 commits June 10, 2022 11:51

no asyncio.to_thread

cd1e235

remove startup_attempts shared value

fa50856

I'm suspicious that corruption of this somehow caused the repeated segfault?

Merge remote-tracking branch 'upstream/main' into nanny-close-proc-on…

8ed0062

…-start-failure

Nanny: kill subprocess if terminate fails

11aaca9

fix test

68ec8df

don't terminate, just kill

8decd1a

clarify await terminate

bb9a35b

gjoseph92 added 2 commits June 10, 2022 15:59

it's slow

8dc579e

brokenworker doesn't need init

6dd8215

gjoseph92 commented Jun 10, 2022

View reviewed changes

gjoseph92 added 2 commits June 10, 2022 18:48

poll mp Event instead of waiting in thread

9e8e538

Still segfaulting in CI, maybe mp objects aren't thread-safe.

Merge branch 'main' into nanny-close-proc-on-start-failure

3dbebad

gjoseph92 mentioned this pull request Jul 19, 2022

Flaky test_contact_listen_address #6742

Open

This was referenced Jul 26, 2022

Flaky tests: assert False, (bad_thread, call_stacks) - Worker executor thread still running after Nanny.kill #6796

Closed

[ci] [dask] CI jobs failing with Dask 2022.7.1 microsoft/LightGBM#5390

Open

hendrikmakait self-assigned this Aug 1, 2022

hendrikmakait added 4 commits August 1, 2022 12:52

Merge branch 'main' into nanny-close-proc-on-start-failure

c976317

use background tasks for _on_worker_exit_sync

1d8b9e4

Create Event using mp_context

28c91a5

Merge branch 'main' into nanny-close-proc-on-start-failure

88abaf2

gjoseph92 mentioned this pull request Aug 4, 2022

Don't connect to cluster subprocesses at shutdown #6829

Merged

2 tasks

gjoseph92 mentioned this pull request Aug 5, 2022

Flaky distributed/tests/test_nanny.py::test_repeated_restarts #6838

Open

gjoseph92 marked this pull request as ready for review August 5, 2022 15:09

gjoseph92 merged commit caf5189 into dask:main Aug 5, 2022

gjoseph92 deleted the nanny-close-proc-on-start-failure branch August 5, 2022 15:14

gjoseph92 mentioned this pull request Aug 17, 2022

Flaky tests: nanny worker(s) did not shut down within 20s during client.restart() #6902

Open

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Ensure Nanny doesn't restart workers that fail to start, and joins su…

2eec21e

…bprocess (dask#6427)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure Nanny doesn't restart workers that fail to start, and joins subprocess #6427

Ensure Nanny doesn't restart workers that fail to start, and joins subprocess #6427

gjoseph92 commented May 24, 2022

github-actions bot commented May 24, 2022

fjetter commented May 24, 2022

github-actions bot commented Jun 8, 2022 •

edited

Loading

gjoseph92 Jun 10, 2022

gjoseph92 Jun 10, 2022

gjoseph92 Jun 10, 2022

gjoseph92 Jun 10, 2022

hendrikmakait commented Aug 3, 2022

hendrikmakait commented Aug 5, 2022

hendrikmakait commented Aug 5, 2022

gjoseph92 commented Aug 5, 2022

Ensure Nanny doesn't restart workers that fail to start, and joins subprocess #6427

Ensure Nanny doesn't restart workers that fail to start, and joins subprocess #6427

Conversation

gjoseph92 commented May 24, 2022

github-actions bot commented May 24, 2022

Unit Test Results

fjetter commented May 24, 2022

github-actions bot commented Jun 8, 2022 • edited Loading

Unit Test Results

gjoseph92 Jun 10, 2022

Choose a reason for hiding this comment

gjoseph92 Jun 10, 2022

Choose a reason for hiding this comment

gjoseph92 Jun 10, 2022

Choose a reason for hiding this comment

gjoseph92 Jun 10, 2022

Choose a reason for hiding this comment

hendrikmakait commented Aug 3, 2022

hendrikmakait commented Aug 5, 2022

hendrikmakait commented Aug 5, 2022

gjoseph92 commented Aug 5, 2022

github-actions bot commented Jun 8, 2022 •

edited

Loading