Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure Nanny doesn't restart workers that fail to start, and joins subprocess #6427

Merged
merged 31 commits into from
Aug 5, 2022

Conversation

gjoseph92
Copy link
Collaborator

Closes #6426

I think this could be split into multiple PRs if desired. I'd recommend reviewing commit by commit.

  • Tests added / passed
  • Passes pre-commit run --all-files

gjoseph92 added 8 commits May 23, 2022 16:40
This feels more reasonable, but may not actually matter since it's still not waiting for the process to be joined (just for the signal to be sent). Consider reverting.
`Server.start._close_on_failure` can set status to `failed` when `self.start` fails. If the process doesn't termiante until after this has happened, it may encounter this status.

A passlist here would be more sensible.
This is a big deal, because we might be leaking a process then.
We want to be able to log it
@github-actions
Copy link
Contributor

Unit Test Results

       12 files   -        3         12 suites   - 3   5h 50m 28s ⏱️ - 1h 3m 58s
  2 809 tests +       1    2 727 ✔️ ±       0    79 💤 ±  0  3 +1 
17 649 runs   - 3 169  16 784 ✔️  - 3 098  859 💤  - 74  6 +3 

For more details on these failures, see this check.

Results for commit 4d61b38. ± Comparison against base commit 7665eaa.

@fjetter
Copy link
Member

fjetter commented May 24, 2022

Only remotely related to this change but the name Nanny._on_exit confused me quite a bit. It took me way too long to realize that this is in fact a WorkerProcess on_exit handler and not a Nanny on_exit. I was wondering why we'd even restart the worker if the nanny is exiting...

Doesn't need to happen in this PR but we should consider renaming these methods

@gjoseph92 gjoseph92 self-assigned this Jun 7, 2022
gjoseph92 added 7 commits June 8, 2022 11:44
This is kinda superfluous with `raises_with_cause`, but I didn't want to refactor everywhere `raises_with_cause` is used since this syntax is a bit less readable.
I kinda like verifying that the error isn't spewed though. May revert.
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 29m 23s ⏱️ - 4m 44s
  2 989 tests +1    2 900 ✔️ +  2       88 💤 ±0  1  - 1 
22 165 runs  +8  21 121 ✔️ +10  1 043 💤  - 1  1  - 1 

For more details on these failures, see this check.

Results for commit 88abaf2. ± Comparison against base commit 4f6960a.

♻️ This comment has been updated with latest results.

f"Worker process still alive after {timeout} seconds, killing"
)
try:
await process.terminate()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: we used to first call Worker.close (via child_stop_q.put({"op": "stop"})), then send SIGTERM to the process if it didn't stop in time, then just return to the caller as soon as the terminate signal was sent, regardless of whether the process actually stopped.

Now, we still do Worker.close, but then send SIGKILL (which unlike SIGTERM, can't be blocked) because the Worker.close can already be considered our SIGTERM-like request for graceful close. We wait for the process to actually terminate.

f"Worker process still alive after {wait_timeout} seconds, killing"
)
await process.kill()
await process.join(max(0, deadline - time()))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little wary of this deadline on the join. I could imagine the default 2s timeout not being long enough for the process to actually shut down in CI. kill didn't used to raise an error if the timeout expired; now it will. I could make it not, but I think it's really the caller's responsibility to decide what to do if the process doesn't shut down in time.

I think we should probably make the default timeout a bit longer.

distributed/tests/test_nanny.py Show resolved Hide resolved
@@ -309,6 +309,7 @@ async def start_unsafe(self):
await self.rpc.start()
return self

@final
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

driveby: added so nobody in the future can try to override this

@@ -405,7 +405,7 @@ async def instantiate(self) -> Status:
self.process = WorkerProcess(
worker_kwargs=worker_kwargs,
silence_logs=self.silence_logs,
on_exit=self._on_exit_sync,
on_exit=self._on_worker_exit_sync,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

driveby: renamed for clarity

@hendrikmakait
Copy link
Member

It looks like the added tests work now, but the error of test_quiet_close_process (#6582) has changed.

@hendrikmakait
Copy link
Member

Locally, test_quiet_close_process still flakes with

sys:1: RuntimeWarning: coroutine 'InProc.write' was never awaited
Task was destroyed but it is pending!
task: <Task pending name='Task-70' coro=<InProc.write() running at /Users/hendrikmakait/projects/dask/distributed/distributed/comm/inproc.py:215> cb=[IOLoop.add_future.<locals>.<lambda>() at /opt/homebrew/Caskroom/mambaforge/base/envs/dask-distributed-py3.9/lib/python3.9/site-packages/tornado/ioloop.py:688]>

@hendrikmakait
Copy link
Member

@gjoseph92: I'd say we merge this in and monitor for changes in test_quiet_close_process? #6551 already aims at solving the local issue.

@gjoseph92 gjoseph92 marked this pull request as ready for review August 5, 2022 15:09
@gjoseph92 gjoseph92 merged commit caf5189 into dask:main Aug 5, 2022
@gjoseph92
Copy link
Collaborator Author

Thanks for finishing this up @hendrikmakait

@gjoseph92 gjoseph92 deleted the nanny-close-proc-on-start-failure branch August 5, 2022 15:14
gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nanny restarts worker once if it fails to start / doesn't clean up processes on startup failure
3 participants