-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart workers when worker-ttl expires #8538
Conversation
crusaderky
commented
Feb 27, 2024
•
edited
Loading
edited
- Closes worker-ttl timeout should attempt a nanny restart #8537
- Blocked by Refactor restart() and restart_workers() #8550
@@ -545,7 +545,7 @@ def __init__( | |||
self._memory_unmanaged_old = 0 | |||
self._memory_unmanaged_history = deque() | |||
self.metrics = {} | |||
self.last_seen = 0 | |||
self.last_seen = time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align behaviour to ClientState
b1aaf4a
to
7c20a11
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 29 files ± 0 29 suites ±0 11h 19m 2s ⏱️ + 2m 34s For more details on these failures and errors, see this check. Results for commit 2cee264. ± Comparison against base commit 6be418d. ♻️ This comment has been updated with latest results. |
e24f463
to
d1f6231
Compare
a4ce685
to
b213688
Compare
Will this restart the worker immediately or when the thing that is running releases the Gil again/finishes? What I want to know: If my program holds the Gil and thus the 300s timeout hits, do we restart the worker before my program finishes? |
Immediately. The nanny will first try gracefully sending an {op: close} message to the client and then, if there's no answer for 5 seconds, SIGKILL it. |
4fc3483
to
4a0373f
Compare
06b3e18
to
6d56c47
Compare
await self.remove_worker(address=ws.address, stimulus_id=stimulus_id) | ||
|
||
if to_restart: | ||
await self.restart_workers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you briefly check that the tracking in https://github.com/coiled/platform/blob/4dbd6f449884464caaba09b470aa06394a22d024/analytics/preload_scripts/telemetry.py#L772 still works? I don't see a reason why not, but want to be sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afraid I can no longer do that. Is there anything else I can do to push this PR through the finishing line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should still work from what I can tell. Closing workers are a bit of a brittle thing, though, so it's not impossible that there is some kind of race condition ongoing where that condition would no longer work. If that's the case, we can look into it later
I merged main. Assuming CI is not horribly broken, we can merge |