-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Follow-up on removing report and safe from Worker.close #6423
Conversation
This isn't necessary, we could leave it. But I find `terminate` vs `close` confusingly-named, and it's not really going to be used now. I think this might be better added as part of dask#6387? Could go either way.
else: | ||
self.running.discard(ws) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolves https://github.com/dask/distributed/pull/6363/files#r878475185
This is equivalent to main before the PR:
distributed/distributed/scheduler.py
Lines 4747 to 4777 in 41a54ee
def handle_worker_status_change( | |
self, status: str, worker: str, stimulus_id: str | |
) -> None: | |
ws = self.workers.get(worker) | |
if not ws: | |
return | |
prev_status = ws.status | |
ws.status = Status.lookup[status] # type: ignore | |
if ws.status == prev_status: | |
return | |
self.log_event( | |
ws.address, | |
{ | |
"action": "worker-status-change", | |
"prev-status": prev_status.name, | |
"status": status, | |
}, | |
) | |
if ws.status == Status.running: | |
self.running.add(ws) | |
recs = self.bulk_schedule_after_adding_worker(ws) | |
if recs: | |
client_msgs: dict = {} | |
worker_msgs: dict = {} | |
self._transitions(recs, client_msgs, worker_msgs, stimulus_id) | |
self.send_all(client_msgs, worker_msgs) | |
else: | |
self.running.discard(ws) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is required for some tests. https://github.com/dask/distributed/pull/6363/files#r879719934
I agree than an explicit test would be better but I expect something to break. If CI is happy, I'm happy, though.
|
||
|
||
@gen_cluster(Worker=Nanny, nthreads=[("", 1)], timeout=10) | ||
async def test_scheduler_crash_doesnt_restart(s, a): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails on main
async def terminate(self, **kwargs): | ||
return await self.close(nanny=True, **kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to remove Worker.terminate
since it won't be used now that the Worker.close
default is to also close the Nanny. I think we will want two different methods eventually, but found the naming of terminate
vs close
very confusing—it should be something like terminate
vs maybe_restart
?
If others would prefer to see terminate
stick around, I'm happy to remove this commit and leave it.
@log_errors | ||
async def close( | ||
self, | ||
timeout=30, | ||
executor_wait=True, | ||
nanny=False, | ||
nanny=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By changing the default back to True, resolves:
This reverts commit ae9a98f.
I'm 100% fine with reverting the default for Regarding the remove_worker on status change, I believe we require this. If this is blocking the PR I would propose to not revert this to the original code and keep my version. We may want to add a couple of tests but I think we should merge the nanny toggle reversion regardless and the tests should be a follow up PR (i.e. not necessarily release blocking) |
Unit Test Results 15 files + 15 15 suites +15 6h 26m 7s ⏱️ + 6h 26m 7s For more details on these failures, see this check. Results for commit e3ab594. ± Comparison against base commit d84485b. ♻️ This comment has been updated with latest results. |
This reverts commit d6e513c.
This was breaking because: * Worker failed to start * `Server.start` then calls `self.close()` * `Worker.close` was calling the `close_gracefully` RPC on the Nanny * For some reason, the RPC on the Nanny was unresponsive. I can't find an explanation for this. The event loop was not blocked. The listener seemed to be running. This definitely needs more investigation but I didn't want to look into it right now. * The worker was waiting forever on the nanny to respond, the nanny was waiting forever on the worker to close.
Can confirm these now pass for me locally with Gabe's latest updates. @pentschev do those Dask-CUDA tests also pass? |
I'm still seeing this fail/hang on
|
Oops, forgot to pull latest changes. This passes as does my testing of Dask-CUDA |
@pentschev is asleep but also confirmed that this fixes the immediate issues in dask-cuda |
Co-authored-by: Florian Jetter <[email protected]>
Seems like this is good to go? Do the failures seem flaky to others? #6426 is a follow-up issue, but not a blocker. |
Thanks @gjoseph92! Given the current time, let's give @fjetter a chance to take a look and then once this PR is in I'll push out the release (hopefully early morning CT) |
if self.status in (Status.closed, Status.closing, Status.failed): | ||
await self.finished() | ||
return | ||
|
||
if self.status == Status.init: | ||
# If the worker is still in startup/init and is started by a nanny, | ||
# this means the nanny itself is not up, yet. If the Nanny isn't up, | ||
# yet, it's server will not accept any incoming RPC requests and | ||
# will block until the startup is finished. | ||
# Therefore, this worker trying to communicate with the Nanny during | ||
# startup is not possible and we cannot close it. | ||
# In this case, the Nanny will automatically close after inspecting | ||
# the worker status | ||
nanny = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added another commit to be more explicit about the stati and added a more thorough explanation of why this is happening
cc @gjoseph92
I'm not sure if the test failure #6430 is related |
Thanks all! |
Closes #6422
xref #6363
cc @fjetter @jrbourbeau
pre-commit run --all-files