Catch `BaseException` on UCX read error #6996

pentschev · 2022-09-02T20:34:56Z

This change will ensure CancelledErrors are catched upon shutting down the Dask cluster, which may otherwise raise various errors.

Tests added / passed
Passes pre-commit run --all-files

This change will ensure `CancelledError`s are catched upon shutting down the Dask cluster, which may otherwise raise various errors. See also dask#6574 .

github-actions · 2022-09-02T22:03:17Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 14m 37s ⏱️ - 8m 39s
  3 146 tests ±0   3 061 ✔️ +2   85 💤 ±0 0 ❌ - 2
23 286 runs ±0 22 377 ✔️ +2 909 💤 ±0 0 ❌ - 2

Results for commit 8de1557. ± Comparison against base commit 68e5a6a.

♻️ This comment has been updated with latest results.

wence-

In addition to these comments, I think it is also necessary to wrap the same exception handling around write (in case the other end hung up in a strange way). Or is that not necessary?

wence- · 2022-09-28T08:27:52Z

distributed/comm/tests/test_ucx.py

+    # Depending on the UCP protocol selected, it may raise either
+    # `asyncio.TimeoutError` or `CommClosedError`, so validate either one.
+    with pytest.raises((asyncio.TimeoutError, CommClosedError)):
+        await asyncio.wait_for(reader.read(), 0.01)


OK, so here we're waiting for a read that will never be matched by a write, and so eventually we'll fail.

That's right.

distributed/comm/ucx.py

wence- · 2022-09-28T08:49:34Z

distributed/comm/ucx.py

+        except BaseException:
+            # In addition to UCX exceptions, may be CancelledError or a another
+            # "low-level" exception. The only safe thing to do is to abort.
+            # (See also https://github.com/dask/distributed/pull/6574).
            self.abort()
            raise CommClosedError("Connection closed by writer")
        else:


I think we also need to catch connection issues on line 354 as well.

So perhaps lines 353 and 354 should be replaced by:

try: for frame in recv_frames: await self.ep.recv(frame) except BaseException as e: raise CommClosedError("Connection closed by writer.\nInner exception: {e!r}")

I had thought that one might be able to reduce synchronisation a little bit by using:

await asyncio.gather(*(map(self.ep.recv, recv_frames))

With a matching change in write of await asyncio.gather(*map(self.ep.send, send_frames)).

But I am unsure of the semantics of UCX wrt message overtaking. I think this could potentially result in the second (say) sent frame ending up in the first receive slot, which would be bad.

I think we also need to catch connection issues on line 354 as well.

So perhaps lines 353 and 354 should be replaced by:

try: for frame in recv_frames: await self.ep.recv(frame) except BaseException as e: raise CommClosedError("Connection closed by writer.\nInner exception: {e!r}")

I'm not entirely sure we want that, maybe it never occurred in practice or just raising the original exception may be fine. I'm mostly concerned with unforeseen side-effects this may cause and would prefer not to mess with it now given it's not been a problem so far. WDYT?

I had thought that one might be able to reduce synchronisation a little bit by using:

await asyncio.gather(*(map(self.ep.recv, recv_frames))

With a matching change in write of await asyncio.gather(*map(self.ep.send, send_frames)).

But I am unsure of the semantics of UCX wrt message overtaking. I think this could potentially result in the second (say) sent frame ending up in the first receive slot, which would be bad.

I would expect that as well and had done it once and had to revert #5505 because that caused various issues, unfortunately. In any case, with the C++ UCX introduction of "multi-transfers" this will anyway be reduced to a single future, so I will not try to improve this code in its current form.

I'm not entirely sure we want that, maybe it never occurred in practice or just raising the original exception may be fine. I'm mostly concerned with unforeseen side-effects this may cause and would prefer not to mess with it now given it's not been a problem so far. WDYT?

What could happen (although my guess is that it would be low likelihood) is that we're receiving a bunch of frames, each await yields to the event loop, and in between awaits the remote endpoint is closed for some other reason.

Yes, but I fear that by raising a different exception now we may end up in some different control path that we didn't expect. I'm hoping that this patch can end up in the next Distributed release and it could be included in RAPIDS 22.10. I would be fine trying that out afterwards, but I'm a bit nervous of breaking something close to release time.

OK, thanks, makes sense.

Co-authored-by: Lawrence Mitchell <[email protected]>

pentschev · 2022-09-28T09:07:30Z

Thanks @wence- , replied to your comments, please take another look when you have a chance.

wence- · 2022-09-28T09:41:17Z

Thanks @wence- , replied to your comments, please take another look when you have a chance.

I think we need to catch errors (other than UCXBaseException) on the write side as well as read (since those awaitables may also be cancelled).

Like this perhaps?

diff --git a/distributed/comm/ucx.py b/distributed/comm/ucx.py
index 16ce82d4..b3a595a5 100644
--- a/distributed/comm/ucx.py
+++ b/distributed/comm/ucx.py
@@ -254,27 +254,27 @@ class UCX(Comm):
     ) -> int:
         if self.closed():
             raise CommClosedError("Endpoint is closed -- unable to send message")
-        try:
-            if serializers is None:
-                serializers = ("cuda", "dask", "pickle", "error")
-            # msg can also be a list of dicts when sending batched messages
-            frames = await to_frames(
-                msg,
-                serializers=serializers,
-                on_error=on_error,
-                allow_offload=self.allow_offload,
-            )
-            nframes = len(frames)
-            cuda_frames = tuple(hasattr(f, "__cuda_array_interface__") for f in frames)
-            sizes = tuple(nbytes(f) for f in frames)
-            cuda_send_frames, send_frames = zip(
-                *(
-                    (is_cuda, each_frame)
-                    for is_cuda, each_frame in zip(cuda_frames, frames)
-                    if nbytes(each_frame) > 0
-                )
+        if serializers is None:
+            serializers = ("cuda", "dask", "pickle", "error")
+        # msg can also be a list of dicts when sending batched messages
+        frames = await to_frames(
+            msg,
+            serializers=serializers,
+            on_error=on_error,
+            allow_offload=self.allow_offload,
+        )
+        nframes = len(frames)
+        cuda_frames = tuple(hasattr(f, "__cuda_array_interface__") for f in frames)
+        sizes = tuple(nbytes(f) for f in frames)
+        cuda_send_frames, send_frames = zip(
+            *(
+                (is_cuda, each_frame)
+                for is_cuda, each_frame in zip(cuda_frames, frames)
+                if nbytes(each_frame) > 0
             )
+        )
 
+        try:
             # Send meta data
 
             # Send close flag and number of frames (_Bool, int64)
@@ -297,10 +297,11 @@ class UCX(Comm):
 
             for each_frame in send_frames:
                 await self.ep.send(each_frame)
-            return sum(sizes)
-        except (ucp.exceptions.UCXBaseException):
+        except BaseException as e:
             self.abort()
-            raise CommClosedError("While writing, the connection was closed")
+            raise CommClosedError("While writing, the connection was closed.\n"
+                                  f"Inner exception: {e!r}")
+        return sum(sizes)
 
     @log_errors
     async def read(self, deserializers=("cuda", "dask", "pickle", "error")):

I've moved the scope of the try/catch block because we're now catching a broader range of exceptions.

distributed/comm/ucx.py

pentschev · 2022-09-28T10:00:51Z

Thanks @wence- , replied to your comments, please take another look when you have a chance.

I think we need to catch errors (other than UCXBaseException) on the write side as well as read (since those awaitables may also be cancelled).

Like this perhaps?

diff --git a/distributed/comm/ucx.py b/distributed/comm/ucx.py
index 16ce82d4..b3a595a5 100644
--- a/distributed/comm/ucx.py
+++ b/distributed/comm/ucx.py
@@ -254,27 +254,27 @@ class UCX(Comm):
     ) -> int:
         if self.closed():
             raise CommClosedError("Endpoint is closed -- unable to send message")
-        try:
-            if serializers is None:
-                serializers = ("cuda", "dask", "pickle", "error")
-            # msg can also be a list of dicts when sending batched messages
-            frames = await to_frames(
-                msg,
-                serializers=serializers,
-                on_error=on_error,
-                allow_offload=self.allow_offload,
-            )
-            nframes = len(frames)
-            cuda_frames = tuple(hasattr(f, "__cuda_array_interface__") for f in frames)
-            sizes = tuple(nbytes(f) for f in frames)
-            cuda_send_frames, send_frames = zip(
-                *(
-                    (is_cuda, each_frame)
-                    for is_cuda, each_frame in zip(cuda_frames, frames)
-                    if nbytes(each_frame) > 0
-                )
+        if serializers is None:
+            serializers = ("cuda", "dask", "pickle", "error")
+        # msg can also be a list of dicts when sending batched messages
+        frames = await to_frames(
+            msg,
+            serializers=serializers,
+            on_error=on_error,
+            allow_offload=self.allow_offload,
+        )
+        nframes = len(frames)
+        cuda_frames = tuple(hasattr(f, "__cuda_array_interface__") for f in frames)
+        sizes = tuple(nbytes(f) for f in frames)
+        cuda_send_frames, send_frames = zip(
+            *(
+                (is_cuda, each_frame)
+                for is_cuda, each_frame in zip(cuda_frames, frames)
+                if nbytes(each_frame) > 0
             )
+        )
 
+        try:
             # Send meta data
 
             # Send close flag and number of frames (_Bool, int64)
@@ -297,10 +297,11 @@ class UCX(Comm):
 
             for each_frame in send_frames:
                 await self.ep.send(each_frame)
-            return sum(sizes)
-        except (ucp.exceptions.UCXBaseException):
+        except BaseException as e:
             self.abort()
-            raise CommClosedError("While writing, the connection was closed")
+            raise CommClosedError("While writing, the connection was closed.\n"
+                                  f"Inner exception: {e!r}")
+        return sum(sizes)
 
     @log_errors
     async def read(self, deserializers=("cuda", "dask", "pickle", "error")):

I've moved the scope of the try/catch block because we're now catching a broader range of exceptions.

Similar to the try/catch changes in the previous read, I'm a bit concerned about breaking something else right now. I propose we open a new PR to address those two changes and wait to merge them after the upcoming release. WDYT?

Co-authored-by: Lawrence Mitchell <[email protected]>

wence- · 2022-09-29T09:34:44Z

distributed/comm/ucx.py

+            raise CommClosedError("Connection closed by writer.\n"
+                                  f"Inner exception: {e!r}")


Suggested change

raise CommClosedError("Connection closed by writer.\n"

f"Inner exception: {e!r}")

raise CommClosedError(f"Connection closed by writer.\nInner exception: {e!r}")

To pacify the linter.

wence-

Modulo the lint pacification, looks good, and we can revisit catching exceptions around write, etc... later.

Can confirm, too, that this fixes ugly-looking UCX errors on disconnect during cluster shutdown for me.

quasiben · 2022-09-29T18:12:11Z

rerun tests

quasiben · 2022-09-29T18:26:12Z

This failure is due to changes in strides for broadcasted arrays which occurred in NumPy 1.23

quasiben · 2022-09-29T18:31:26Z

@gmarkall pointed me to numpy/numpy#21477 -- I'll fix the pinning in the docker image to <1.22 . Graham, is NumPy 1.23 compatibility something being tracked in Numba ?

gmarkall · 2022-09-29T19:27:49Z

Graham, is NumPy 1.23 compatibility something being tracked in Numba ?

Numba 0.56.2 is compatible with NumPy 1.23. I need to determine whether the issue here is np.empty_like() not quite creating an empty array like its argument, or if the array compatibility check in Numba needs a fix for this case.

gmarkall · 2022-09-29T19:30:21Z

PR to WAR the issue: #7089

quasiben · 2022-09-29T19:36:00Z

Numba 0.56.2 is compatible with NumPy 1.23. I need to determine whether the issue here is np.empty_like() not quite creating an empty array like its argument, or if the array compatibility check in Numba needs a fix for this case.

Apologies for the mischaracterization!

…ption

quasiben · 2022-09-30T12:40:13Z

Merged with main (including #7089) when tests pass I'll merge

quasiben · 2022-09-30T17:02:40Z

This is now passing -- merging in. Thanks @pentschev and @wence- !

pentschev added 2 commits September 2, 2022 13:18

Catch BaseException on UCX read error

dcb8a8c

This change will ensure `CancelledError`s are catched upon shutting down the Dask cluster, which may otherwise raise various errors. See also dask#6574 .

Test UCX for read errors

97b0d5d

pentschev marked this pull request as ready for review September 2, 2022 20:35

Validate test independent of selected UCP protocol

ec8d702

Merge remote-tracking branch 'upstream/main' into ucx-base-exception

f98bdf2

wence- suggested changes Sep 28, 2022

View reviewed changes

Include the inner exception when raising CommClosedError

da64c03

Co-authored-by: Lawrence Mitchell <[email protected]>

wence- reviewed Sep 28, 2022

View reviewed changes

distributed/comm/ucx.py Outdated Show resolved Hide resolved

Fix missing f-string prefix

514dc68

Co-authored-by: Lawrence Mitchell <[email protected]>

wence- reviewed Sep 29, 2022

View reviewed changes

wence- approved these changes Sep 29, 2022

View reviewed changes

Fix linting

582bc2c

quasiben mentioned this pull request Sep 29, 2022

Release 2022.9.2 dask/community#278

Closed

5 tasks

pentschev mentioned this pull request Sep 29, 2022

[REVIEW] Pin dask and distributed for release rapidsai/dask-cuda#1003

Merged

quasiben mentioned this pull request Sep 29, 2022

Pin numpy to less than 1.23 rapidsai/dask-build-environment#46

Merged

Merge branch 'main' of github.com:dask/distributed into ucx-base-exce…

8de1557

…ption

quasiben merged commit 10dd0cc into dask:main Sep 30, 2022

wence- mentioned this pull request Oct 10, 2022

Catch exceptions on both send and recv in UCX comm #7130

Closed

pentschev deleted the ucx-base-exception branch October 10, 2022 12:47

pentschev mentioned this pull request Oct 10, 2022

Improve exception catching in UCX communication #7132

Merged

2 tasks

gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Catch BaseException on UCX read error (dask#6996)

131aa0b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch `BaseException` on UCX read error #6996

Catch `BaseException` on UCX read error #6996

pentschev commented Sep 2, 2022

github-actions bot commented Sep 2, 2022 •

edited

Loading

wence- left a comment

wence- Sep 28, 2022

pentschev Sep 28, 2022

wence- Sep 28, 2022

pentschev Sep 28, 2022

wence- Sep 28, 2022

pentschev Sep 28, 2022

wence- Sep 29, 2022

pentschev commented Sep 28, 2022

wence- commented Sep 28, 2022

pentschev commented Sep 28, 2022

wence- Sep 29, 2022

wence- left a comment

quasiben commented Sep 29, 2022

quasiben commented Sep 29, 2022

quasiben commented Sep 29, 2022

gmarkall commented Sep 29, 2022 •

edited

Loading

gmarkall commented Sep 29, 2022

quasiben commented Sep 29, 2022

quasiben commented Sep 30, 2022

quasiben commented Sep 30, 2022

		raise CommClosedError("Connection closed by writer.\n"
		f"Inner exception: {e!r}")

	raise CommClosedError("Connection closed by writer.\n"
	f"Inner exception: {e!r}")
	raise CommClosedError(f"Connection closed by writer.\nInner exception: {e!r}")

Catch BaseException on UCX read error #6996

Catch BaseException on UCX read error #6996

Conversation

pentschev commented Sep 2, 2022

github-actions bot commented Sep 2, 2022 • edited Loading

Unit Test Results

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pentschev commented Sep 28, 2022

wence- commented Sep 28, 2022

pentschev commented Sep 28, 2022

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

quasiben commented Sep 29, 2022

quasiben commented Sep 29, 2022

quasiben commented Sep 29, 2022

gmarkall commented Sep 29, 2022 • edited Loading

gmarkall commented Sep 29, 2022

quasiben commented Sep 29, 2022

quasiben commented Sep 30, 2022

quasiben commented Sep 30, 2022

Catch `BaseException` on UCX read error #6996

Catch `BaseException` on UCX read error #6996

github-actions bot commented Sep 2, 2022 •

edited

Loading

gmarkall commented Sep 29, 2022 •

edited

Loading