-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX 1.10 issues #668
Comments
cc @randerzander for visibility |
Just to update this thread, we have both pulled the problematic |
We need fixes from openucx/ucx#6001 and openucx/ucx#6157 to be backported to 1.10 for UCX-Py, discussion is ongoing. We will also need support for TCP loopback in the new UCX 1.10 transport for 1.10, which is also being checked by UCX devs. Apart from that, we will need to adjust transports in https://github.com/dask/distributed/blob/9442d9b3f2847bf6d0252a8ed671d342a5379501/distributed/comm/ucx.py#L479-L480 , which I'll do once there's a new 1.10 RC with all the patches we need. #667 won't be necessary, and I'll close it now. |
Thanks for the update Peter and keeping track of all of these threads! 😀 |
This isn't relevant anymore. The resolution is that we should avoid UCX 1.10, but UCX 1.11 onwards will be fully supported. Closing. |
After the UCX 1.10 package was created, we started seeing and getting reports of some issues. The first is raised by a change in the default
UCX_SOCKADDR_CM_ENABLE=y
, which used to be disabled until 1.9, causing:If we revert that change, as done in #667), in an attempt to revert behavior to UCX pre-1.10 we see some segfaults:
After discussing offline with some UCX devs, I've been told that starting with UCX 1.10 we should move to the new
tcp_sockcm
. That involves some changes in variables we use today, specifically removingsockcm
, with a base of variables now being switched to:UCX_TLS=tcp,cuda_copy UCX_SOCKADDR_TLS_PRIORITY=tcp UCX_SOCKADDR_CM_ENABLE=y
. We still needcuda_ipc
andrc
to enable NVLink and IB, respectively.Moving to the new
tcp_sockcm
, we still see issues though, particularly:Client
to aLocalCUDACluster
, unless we specifyhost
to the latter to prevent from binding to loopback, and this would break lots of user code today.0,1,2,3
, but not0,1,2,4
. This still seems like a bug in UCX.With all the above said, using UCX 1.10 is not viable for UCX-Py at the moment of writing, I'm working with @dmitrygx and @alinask to check whether these issues/limitations can be solved. However, the UCX 1.10 conda package is breaking for our nightly build users and they have to pin
ucx=1.8
or they will experience segfaults. We can still create 1.9 packages, but that would still require us to either delete the current UCX package from Anaconda or pin 1.9 in our metapackages, any ideas or preferences on what path we should follow @quasiben @jakirkham ?The text was updated successfully, but these errors were encountered: