-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/CUDA_IPC: fix peer-access-map init #6360
UCT/CUDA_IPC: fix peer-access-map init #6360
Conversation
@yosefe This fixes the issue brought up by @pentschev with #5815 |
Can one of the admins verify this patch? |
ok to test |
The performance issue is fixed with this PR, only errors such as those below should probably be demoted to debug/trace.
|
On latest changes performance is still good, but I think log below was accidentally promoted to error in https://github.com/openucx/ucx/pull/6360/files#diff-561a7e4fa208245423bb161bc7a1cc05fbce0d65a10ef13c77e366f38c541826L308-R308 . Could you confirm this is accidental @Akshay-Venkatesh ? I now see those errors in my workflow:
|
@@ -305,7 +305,7 @@ UCS_PROFILE_FUNC(ucs_status_t, uct_cuda_ipc_map_memhandle, (key, mapped_addr), | |||
} | |||
} | |||
} else { | |||
ucs_debug("%s: failed to open ipc mem handle. addr:%p len:%lu", | |||
ucs_error("%s: failed to open ipc mem handle. addr:%p len:%lu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like log level was "swapped" ith line 114..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed now. cc @pentschev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR addresses both issues I observed and LGTM now. Backporting #5815 along with this one to 1.10 is +1 from me now.
Thanks @Akshay-Venkatesh for fixing this so quickly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls squash
da7c142
to
04e9f90
Compare
@yosefe is it ok to prepare a backport? |
Yes, let's have it along with the fix |
The UCX-Py endpoint reuse is not anymore necessary, so we also disable that for UCX 1.11+. The primary reason it was introduced was to circumvent an issue with CUDA IPC that was resolved by openucx/ucx#6360. Using the endpoint reuse class has also proven to be very slow, taking a long time to initialize for clusters with just a few dozen workers and pretty much unusable for a cluster in the order of 100 workers. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Benjamin Zaitlen (https://github.com/quasiben) URL: #620
What
Peer-accessibility map is a 2-d matrix and it was incorrectly initialized following restructuring. This PR fixes that.