-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure Nanny doesn't restart workers that fail to start, and joins subprocess #6427
Ensure Nanny doesn't restart workers that fail to start, and joins subprocess #6427
Conversation
This feels more reasonable, but may not actually matter since it's still not waiting for the process to be joined (just for the signal to be sent). Consider reverting.
`Server.start._close_on_failure` can set status to `failed` when `self.start` fails. If the process doesn't termiante until after this has happened, it may encounter this status. A passlist here would be more sensible.
This is a big deal, because we might be leaking a process then.
We want to be able to log it
Only remotely related to this change but the name Doesn't need to happen in this PR but we should consider renaming these methods |
This is kinda superfluous with `raises_with_cause`, but I didn't want to refactor everywhere `raises_with_cause` is used since this syntax is a bit less readable.
FIXME this crashes mypy
I kinda like verifying that the error isn't spewed though. May revert.
This reverts commit 00e277e.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 29m 23s ⏱️ - 4m 44s For more details on these failures, see this check. Results for commit 88abaf2. ± Comparison against base commit 4f6960a. ♻️ This comment has been updated with latest results. |
I'm suspicious that corruption of this somehow caused the repeated segfault?
f"Worker process still alive after {timeout} seconds, killing" | ||
) | ||
try: | ||
await process.terminate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: we used to first call Worker.close
(via child_stop_q.put({"op": "stop"})
), then send SIGTERM to the process if it didn't stop in time, then just return to the caller as soon as the terminate signal was sent, regardless of whether the process actually stopped.
Now, we still do Worker.close
, but then send SIGKILL (which unlike SIGTERM, can't be blocked) because the Worker.close
can already be considered our SIGTERM-like request for graceful close. We wait for the process to actually terminate.
f"Worker process still alive after {wait_timeout} seconds, killing" | ||
) | ||
await process.kill() | ||
await process.join(max(0, deadline - time())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little wary of this deadline on the join. I could imagine the default 2s timeout not being long enough for the process to actually shut down in CI. kill
didn't used to raise an error if the timeout expired; now it will. I could make it not, but I think it's really the caller's responsibility to decide what to do if the process doesn't shut down in time.
I think we should probably make the default timeout a bit longer.
@@ -309,6 +309,7 @@ async def start_unsafe(self): | |||
await self.rpc.start() | |||
return self | |||
|
|||
@final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
driveby: added so nobody in the future can try to override this
@@ -405,7 +405,7 @@ async def instantiate(self) -> Status: | |||
self.process = WorkerProcess( | |||
worker_kwargs=worker_kwargs, | |||
silence_logs=self.silence_logs, | |||
on_exit=self._on_exit_sync, | |||
on_exit=self._on_worker_exit_sync, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
driveby: renamed for clarity
Still segfaulting in CI, maybe mp objects aren't thread-safe.
It looks like the added tests work now, but the error of |
Locally,
|
@gjoseph92: I'd say we merge this in and monitor for changes in |
Thanks for finishing this up @hendrikmakait |
Closes #6426
I think this could be split into multiple PRs if desired. I'd recommend reviewing commit by commit.
pre-commit run --all-files