-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch BaseException
on UCX read error
#6996
Changes from all commits
dcb8a8c
97b0d5d
ec8d702
f98bdf2
da64c03
514dc68
582bc2c
8de1557
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -325,12 +325,14 @@ async def read(self, deserializers=("cuda", "dask", "pickle", "error")): | |
await self.ep.recv(header) | ||
header = struct.unpack(header_fmt, header) | ||
cuda_frames, sizes = header[:nframes], header[nframes:] | ||
except ( | ||
ucp.exceptions.UCXCloseError, | ||
ucp.exceptions.UCXCanceled, | ||
) + (getattr(ucp.exceptions, "UCXConnectionReset", ()),): | ||
except BaseException as e: | ||
# In addition to UCX exceptions, may be CancelledError or a another | ||
# "low-level" exception. The only safe thing to do is to abort. | ||
# (See also https://github.com/dask/distributed/pull/6574). | ||
self.abort() | ||
raise CommClosedError("Connection closed by writer") | ||
raise CommClosedError( | ||
f"Connection closed by writer.\nInner exception: {e!r}" | ||
) | ||
else: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we also need to catch connection issues on line 354 as well. So perhaps lines 353 and 354 should be replaced by: try:
for frame in recv_frames:
await self.ep.recv(frame)
except BaseException as e:
raise CommClosedError("Connection closed by writer.\nInner exception: {e!r}") I had thought that one might be able to reduce synchronisation a little bit by using:
With a matching change in But I am unsure of the semantics of UCX wrt message overtaking. I think this could potentially result in the second (say) sent frame ending up in the first receive slot, which would be bad. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure we want that, maybe it never occurred in practice or just raising the original exception may be fine. I'm mostly concerned with unforeseen side-effects this may cause and would prefer not to mess with it now given it's not been a problem so far. WDYT?
I would expect that as well and had done it once and had to revert #5505 because that caused various issues, unfortunately. In any case, with the C++ UCX introduction of "multi-transfers" this will anyway be reduced to a single future, so I will not try to improve this code in its current form. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What could happen (although my guess is that it would be low likelihood) is that we're receiving a bunch of frames, each There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but I fear that by raising a different exception now we may end up in some different control path that we didn't expect. I'm hoping that this patch can end up in the next Distributed release and it could be included in RAPIDS 22.10. I would be fine trying that out afterwards, but I'm a bit nervous of breaking something close to release time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, thanks, makes sense. |
||
# Recv frames | ||
frames = [ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so here we're waiting for a read that will never be matched by a write, and so eventually we'll fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right.