-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure client session is quiet after cluster.close()
or client.shutdown()
#7429
Conversation
with suppress(CommClosedError): | ||
await self.scheduler.terminate() | ||
|
||
await self._close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Closing the client is an extra cleanup step (i.e. not needed to make the user's client session quiet). However, it seemed strange that we didn't close it when the cluster was closed. Happy to rollback is it causes issues or folks would rather keep it open for some reason
cluster.close()
and client.shutdown()
cluster.close()
or client.shutdown()
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 22 files ± 0 22 suites ±0 10h 4m 52s ⏱️ - 18m 54s For more details on these failures and errors, see this check. Results for commit a8740c0. ± Comparison against base commit 35c07cb. |
Seems sensible to me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything here seems fine to me. I'm less aware of this code these days so it might be good to have someone else (@crusaderky ?) take a brief look. However, if you're seeing good benefits to this I'm also happy to just trust and merge.
# Don't send heartbeat if scheduler comm or cluster are already closed | ||
if (self.scheduler_comm and not self.scheduler_comm.comm.closed()) or ( | ||
self.cluster and self.cluster.status not in (Status.closed, Status.closing) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly confused by this. If scheduler_comm.comm.closed()
should we send a heartbeat if the cluster is not closing?
If the client closes but the cluster is still alive then we should probably stop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, I think that this case might be rare but relevant
cluster = Cluster()
client1 = Client(cluster)
client2 = Client(cluster)
client1.close()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If scheduler_comm.comm.closed() should we send a heartbeat if the cluster is not closing?
I don't think we want to attempt to send a heartbeat if the scheduler_comm
is closed. This if
-statement should include that case.
If the client closes but the cluster is still alive then we should probably stop?
This should already be the case. If the client closes then the scheduler_comm
will also be closed
In [1]: from distributed import LocalCluster, Client
In [2]: cluster = LocalCluster()
...:
...: client1 = Client(cluster)
...: client2 = Client(cluster)
...:
...: client1.close()
In [3]: client1.scheduler_comm.comm.closed()
Out[3]: True
In [4]: client2.scheduler_comm.comm.closed()
Out[4]: False
# Don't attempt to reconnect if cluster are already closed. | ||
# Instead close down the client. | ||
await self._close() | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems sensible to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything here seems fine to me. I'm less aware of this code these days so it might be good to have someone else (@crusaderky ?) take a brief look. However, if you're seeing good benefits to this I'm also happy to just trust and merge.
I'm fairly confident in the changes here, and am seeing good benefits locally, but am also totally fine to wait for folks to review (possibly after coming back from the holidays).
@fjetter I've left this one to you and your team. Pinging so that it rises up in your queue. |
@graingert can you review this please? |
Thanks for reviewing @graingert! |
Not immediately sure why, but it looks like this PR started causing dask-cuda's CI to hang (specifically Apologies for the timing relative to release, wish I could've found this a few hours ago 😅 |
Ah, thanks for surfacing @charlesbluca. Do you have a traceback I could look at? It looks like CI for the default branch is passing over in |
Yeah, here's an example of a hanging run - unfortunately doesn't seem like much contextualizing info there 😕 https://github.com/rapidsai/dask-cuda/actions/runs/3914939838/jobs/6692645588 |
Hmm yeah, unfortunately there doesn't appear to be much to work off on in the build output. Just to double check, we're sure this is the PR that leads to the hanging behavior (i.e. |
I started also taking a look at this and can confirm @charlesbluca 's findings. So far I can reproduce the hang with Distributed main, but not anymore after reverting this commit. However, after reverting I see ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/datasets/pentschev/miniconda3/envs/rn-230116/lib/python3.9/site-packages/distributed/comm/core.py", line 291, in connect
comm = await asyncio.wait_for(
File "/datasets/pentschev/miniconda3/envs/rn-230116/lib/python3.9/asyncio/tasks.py", line 479, in wait_for
return fut.result()
File "/datasets/pentschev/miniconda3/envs/rn-230116/lib/python3.9/site-packages/distributed/comm/tcp.py", line 511, in connect
convert_stream_closed_error(self, e)
File "/datasets/pentschev/miniconda3/envs/rn-230116/lib/python3.9/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x7f54d58b9c40>: ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/datasets/pentschev/miniconda3/envs/rn-230116/lib/python3.9/site-packages/distributed/utils.py", line 741, in wrapper
return await func(*args, **kwargs)
File "/datasets/pentschev/miniconda3/envs/rn-230116/lib/python3.9/site-packages/distributed/client.py", line 1301, in _reconnect
await self._ensure_connected(timeout=timeout)
File "/datasets/pentschev/miniconda3/envs/rn-230116/lib/python3.9/site-packages/distributed/client.py", line 1331, in _ensure_connected
comm = await connect(
File "/datasets/pentschev/miniconda3/envs/rn-230116/lib/python3.9/site-packages/distributed/comm/core.py", line 315, in connect
await asyncio.sleep(backoff)
File "/datasets/pentschev/miniconda3/envs/rn-230116/lib/python3.9/asyncio/tasks.py", line 652, in sleep
return await future
asyncio.exceptions.CancelledError Haven't dug in deeper yet, will try to do more of that throughout the day. |
After digging a bit more I found out that the hang is the result of two threads concurrently trying to acquire the UCX spinlock and the GIL. Currently UCX-Py is not thread-safe, which can cause this sort of problem. Normally we expect that ALL communication would occur only on the Distributed communications thread, and although I have not yet been able to determine the exact cause for this issue, I believe it is that something changed in the order of task execution that causes the client thread to execute some communication (probably when calling |
It seems rapidsai/dask-cuda#1084 resolves the issue in the Dask-CUDA tests. On the description of that issue I wrote what I believe to be the cause, cross posting here for completeness:
Thanks @charlesbluca for tracking this down and @jrbourbeau for this fix and taking the time to respond our comments! |
After dask/distributed#7429 was merged, some of those tests started hanging and I could confirm there were two threads concurrently attempting to take the UCX spinlock and the GIL, which led to such deadlock. UCX-Py is currently not thread-safe, and indeed can cause problems like this should two or more threads attempt to call communication routines that will required the UCX spinlock. My theory is that the synchronous cluster will indeed cause communication on the main thread (in this case, the `pytest` thread) upon attempting to shutdown the cluster, instead of only within the Distributed communication thread, likely being the reason behind the test hanging. Asynchronous Distributed clusters seem not to cause any communication from the main thread, but only in the communication thread as expected, thus making the tests asynchronous suffice to resolve such issues. In practice, it's unlikely that people will use sync Distributed clusters from the same process (as pytest does), and thus it's improbable to happen in real use-cases. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #1084
This is a follow-up on #7428. Frequently I see users do the following in a notebook
And then after a few seconds an error is printed (highlighted in red) in their notebook
The same type of thing can happen when using
client.shutdown()
instead ofcluster.close()
.This PR adds logic to determine if the scheduler/cluster is closing/has already been closed and, if so, don't attempt to reconnect the client to the scheduler, heartbeat, etc. The goal is to avoid scary errors in the user's client session if they're shutting things down in an expected way.
cc @ncclementi @shughes-uk @dchudz who have run into this before
EDIT: looks like we have some intentional tests around reconnecting if the connection between the client and scheduler is temporarily lost. This makes sense as we want to be resilient to transient network blips. I've restricted the "don't reconnect" logic to only be if there's a cluster manager associated with the client and the cluster is closing/closed (which seems safe to me).