-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX fails when trying to run training across 2 nodes #9908
Comments
If I set UCX_TLS=tcp,cuda,cuda_copy,cuda_ipc and "srun", it is working fine. Individual transport is a problem. One more problem is "UCX_TLS=tcp,cuda,cuda_copy,cuda_ipc" set and run in a container, I am getting these ERRORS. [1716920651.160128] [gpu1:2592419:0] ucp_worker.c:1783 UCX INFO ep_cfg[4]: tag(tcp/ib0 tcp/docker0) |
The problem seems be to happening only with tcp/docker0 which is not part of UCX transport. How do I avoid it? |
#9475 should disable docker interface. Can you pls try UCX v1.17.0 or above? |
I see RC1 and RC2 for 1.17.0. Is it compatible with other components (Open MPI etc.)? I am building from source, do you think it is better to apply the patch? The changes are already there in my source files. How do I disable docker interface with "UCX_TCP_BRIDGE_ENABLE"? |
Yes, they are all backward compatible. Better just take 1.17.0-rc2 to avoid extra work of applying manual patch. |
One more thing, I want to understand little further. Greatly appreciate the help. This combination throws an error: "select.c:630 UCX ERROR no active messages transport to : Unsupported operation". Once I add "rc" or "sm" to TLS, there are no more issues. Changing "UCX_NET_DEVICES=all" also resolves the issue. If I use only "UCX_NET_DEVICES=mlx5_0:1" for the container without "UCX_TLS" environment variable, I am not getting docker related issues but not sure if I am compromising on throughput. I don't want to upgrade UCX if it is not necessary. |
You restricted available transports for host memory to tcp only. But you also specified that only mlx5_0:1 network device can be used (which is an IB device I guess). So you'd either need to add some tcp-capable device to |
Thank you! |
i'd also add |
What is the best way to test perf? I don't think "ucx_perftest" works because of authentication. I am using "Slurm" to authenticate while running MPI workloads. |
|
to run perftest with mpirun, UCX needs to be configured with |
I was expecting better bandwidth for IB device. $ ucx_perftest 192.168.1.121 -t tag_lat |
|
Not much change with this flag $ UCX_PROTO_ENABLE=n ucx_perftest 192.168.1.121 -t tag_bw -m cuda -n 100 -s 230700000 With UCX_PROTO_INFO=y $ UCX_PROTO_INFO=y ucx_perftest 192.168.1.121 -t tag_bw -m cuda -n 100 -s 230700000 |
Describe the bug
UCX Fails whenever UCX_TLS is set to anything other than "rc". Even changing UCX_NET_DEVICES from "all" to a particular device also causes issues.
Steps to Reproduce
[gpu1:1581048] pml_ucx.c:419 Error: ucp_ep_create(proc=9) failed: Destination is unreachable
[gpu1:1581048] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 9
Only configuration which is working is "UCX_NET_DEVICES=all" and "UCX_TLS=rc". I made sure I can ping other nodes, netcat etc.
$ export UCX_NET_DEVICES=all
$ export UCX_TLS=tcp
$ export UCX_LOG_LEVEL=debug
$ export OMPI_MCA_pml=ucx
$ srun --mpi=pmix mpi_hello_world
[1716918989.011368] [gpu1:2588768:0] debug.c:1155 UCX DEBUG using signal stack 0x152ef592e000 size 141824
[1716918989.031305] [gpu1:2588768:0] cpu.c:339 UCX DEBUG measured tsc frequency 1993.110 MHz after 0.30 ms
[1716918989.031323] [gpu1:2588768:0] init.c:121 UCX DEBUG /opt/ml4sw/MPI/ucx-1.16.0/lib/libucs.so.0 loaded at 0x152ef403e000
[1716918989.031343] [gpu1:2588768:0] init.c:122 UCX DEBUG cmd line: mpi_hello_world
[1716918989.031352] [gpu1:2588768:0] module.c:72 UCX DEBUG ucs library path: /opt/ml4sw/MPI/ucx-1.16.0/lib/libucs.so.0
[1716918989.031355] [gpu1:2588768:0] module.c:280 UCX DEBUG loading modules for ucs
[1716918990.407928] [gpu1:2588768:0] time.c:22 UCX DEBUG arch clock frequency: 1993110367.89 Hz
[1716918990.407988] [gpu1:2588768:0] ucp_context.c:2137 UCX INFO Version 1.16.0 (loaded from /opt/ml4sw/MPI/ucx-1.16.0/lib/libucp.so.0)
[1716918990.407994] [gpu1:2588768:0] ucp_context.c:1904 UCX DEBUG estimated number of endpoints is 1
[1716918990.407995] [gpu1:2588768:0] ucp_context.c:1911 UCX DEBUG estimated number of endpoints per node is 1
[1716918990.407998] [gpu1:2588768:0] ucp_context.c:1921 UCX DEBUG estimated bcopy bandwidth is 7340032000.000000
[1716918990.408011] [gpu1:2588768:0] ucp_context.c:1980 UCX DEBUG allocation method[0] is md 'sysv'
[1716918990.408012] [gpu1:2588768:0] ucp_context.c:1980 UCX DEBUG allocation method[1] is md 'posix'
[1716918990.408020] [gpu1:2588768:0] ucp_context.c:1992 UCX DEBUG allocation method[2] is 'thp'
[1716918990.408022] [gpu1:2588768:0] ucp_context.c:1980 UCX DEBUG allocation method[3] is md '*'
[1716918990.408023] [gpu1:2588768:0] ucp_context.c:1992 UCX DEBUG allocation method[4] is 'mmap'
[1716918990.408024] [gpu1:2588768:0] ucp_context.c:1992 UCX DEBUG allocation method[5] is 'heap'
[1716918990.408043] [gpu1:2588768:0] module.c:280 UCX DEBUG loading modules for uct
[1716918990.408490] [gpu1:2588768:0] module.c:280 UCX DEBUG loading modules for uct_cuda
[1716918990.408859] [gpu1:2588768:0] module.c:165 UCX DEBUG ignoring 'ucs_module_global_init' (0x152ee4b7eb10) from libuct_cuda.so.0 (0x152ee4b78000), expected in libuct_cuda_gdrcopy.so.0 (152ee4972000)
[1716918990.410964] [gpu1:2588768:0] topo.c:240 UCX DEBUG added sys_dev 0 for bus id 07:00.0
[1716918990.410968] [gpu1:2588768:0] topo.c:240 UCX DEBUG added sys_dev 1 for bus id 0b:00.0
[1716918990.410970] [gpu1:2588768:0] topo.c:240 UCX DEBUG added sys_dev 2 for bus id 48:00.0
[1716918990.410975] [gpu1:2588768:0] topo.c:240 UCX DEBUG added sys_dev 3 for bus id 4c:00.0
[1716918990.410977] [gpu1:2588768:0] topo.c:240 UCX DEBUG added sys_dev 4 for bus id 88:00.0
[1716918990.410979] [gpu1:2588768:0] topo.c:240 UCX DEBUG added sys_dev 5 for bus id 8b:00.0
[1716918990.410981] [gpu1:2588768:0] topo.c:240 UCX DEBUG added sys_dev 6 for bus id c9:00.0
[1716918990.410982] [gpu1:2588768:0] topo.c:240 UCX DEBUG added sys_dev 7 for bus id cc:00.0
[1716918990.411014] [gpu1:2588768:0] module.c:280 UCX DEBUG loading modules for uct_ib
[1716918990.411234] [gpu1:2588768:0] ucp_context.c:1562 UCX DEBUG closing md self because it has no selected transport resources
[1716918990.417610] [gpu1:2588768:0] tcp_iface.c:926 UCX DEBUG filtered out bridge device docker0
[1716918990.419518] [gpu1:2588768:0] topo.c:800 UCX DEBUG /sys/class/net/ens21f0: PF sysfs path is '/sys/devices/pci0000:a0/0000:a0:03.1/0000:a3:00.0/0000:a4:02.0/0000:b0:00.0'
[1716918990.419523] [gpu1:2588768:0] topo.c:240 UCX DEBUG added sys_dev 8 for bus id b0:00.0
[1716918990.419525] [gpu1:2588768:0] topo.c:475 UCX DEBUG ens21f0: bdf_name 0000:b0:00.0 sys_dev 8
[1716918990.432256] [gpu1:2588768:0] topo.c:800 UCX DEBUG /sys/class/net/ib0: PF sysfs path is '/sys/devices/pci0000:00/0000:00:01.1/0000:03:00.0/0000:04:04.0/0000:0e:00.0'
[1716918990.432260] [gpu1:2588768:0] topo.c:240 UCX DEBUG added sys_dev 9 for bus id 0e:00.0
[1716918990.432262] [gpu1:2588768:0] topo.c:475 UCX DEBUG ib0: bdf_name 0000:0e:00.0 sys_dev 9
[1716918990.437785] [gpu1:2588768:0] topo.c:795 UCX DEBUG /sys/class/net/lo: sysfs path undetected
[1716918990.437787] [gpu1:2588768:0] topo.c:479 UCX DEBUG lo: system device unknown
[1716918990.448699] [gpu1:2588768:0] ucp_context.c:1562 UCX DEBUG closing md sysv because it has no selected transport resources
[1716918990.448760] [gpu1:2588768:0] ucp_context.c:1562 UCX DEBUG closing md posix because it has no selected transport resources
[1716918990.448775] [gpu1:2588768:0] cuda_copy_md.c:95 UCX DEBUG dmabuf is not supported on cuda device 0
[1716918990.448799] [gpu1:2588768:0] ucp_context.c:1562 UCX DEBUG closing md cuda_cpy because it has no selected transport resources
[1716918990.448821] [gpu1:2588768:0] ucp_context.c:1562 UCX DEBUG closing md cuda_ipc because it has no selected transport resources
[1716918990.448853] [gpu1:2588768:0] ucp_context.c:1562 UCX DEBUG closing md gdr_copy because it has no selected transport resources
[1716918990.460163] [gpu1:2588768:0] topo.c:800 UCX DEBUG /sys/class/infiniband/mlx5_0: PF sysfs path is '/sys/devices/pci0000:00/0000:00:01.1/0000:03:00.0/0000:04:04.0/0000:0e:00.0'
[1716918990.460168] [gpu1:2588768:0] topo.c:475 UCX DEBUG mlx5_0: bdf_name 0000:0e:00.0 sys_dev 9
[1716918990.460197] [gpu1:2588768:0] ib_device.c:487 UCX DEBUG mlx5_0: vendor_id 0x15b3 device_id 4123
[1716918990.460692] [gpu1:2588768:0] ib_mlx5dv_md.c:1188 UCX DEBUG mlx5_0: crossing_vhca_mkey is not supported
[1716918990.460693] [gpu1:2588768:0] ib_mlx5dv_md.c:1204 UCX DEBUG mlx5_0: mkey_by_name_reserve is not supported
[1716918990.460830] [gpu1:2588768:0] ib_mlx5dv_md.c:1010 UCX DEBUG mlx5_0: ODP is disabled because version 1 is not supported for DevX QP
[1716918990.461010] [gpu1:2588768:0] async.c:232 UCX DEBUG added async handler 0xeceaf0 [id=89 ref 1] ???() to hash
[1716918990.461277] [gpu1:2588768:0] async.c:494 UCX DEBUG listening to async event fd 89 events 0x1 mode thread_spinlock
[1716918990.461282] [gpu1:2588768:0] ib_device.c:586 UCX DEBUG initialized device 'mlx5_0' (InfiniBand channel adapter) with 1 ports
[1716918990.461294] [gpu1:2588768:0] ib_md.c:1128 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1716918990.461299] [gpu1:2588768:0] ib_md.c:1128 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1716918990.461305] [gpu1:2588768:0] ib_md.c:1149 UCX DEBUG mlx5_0: ibv_reg_dmabuf_mr(fd=-1) returned Protocol not supported, dmabuf is not supported
[1716918990.461308] [gpu1:2588768:0] mpool.c:138 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1716918990.461600] [gpu1:2588768:0] ib_mlx5dv_md.c:1696 UCX DEBUG mlx5_0: opened DEVX md log_max_qp=17
[1716918990.462574] [gpu1:2588768:0] ib_mlx5dv_md.c:94 UCX DEBUG mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM) failed, syndrome 0x45d3a4: Remote I/O error
[1716918990.462928] [gpu1:2588768:0] ib_md.c:1116 UCX DEBUG mlx5_0: relaxed order memory access is enabled
[1716918990.463247] [gpu1:2588768:0] ib_mlx5dv_md.c:1141 UCX DEBUG created indirect rkey 0x3b400 for remote flush
[1716918990.463249] [gpu1:2588768:0] ib_md.c:1067 UCX DEBUG mlx5_0: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1716918990.464745] [gpu1:2588768:0] ucp_context.c:1562 UCX DEBUG closing md mlx5_0 because it has no selected transport resources
[1716918990.464750] [gpu1:2588768:0] ib_mlx5dv_md.c:1755 UCX DEBUG mlx5_0: md=0xed3650 md->flags=0x3f1d7f flush_rkey=0x3b400
[1716918990.465038] [gpu1:2588768:0] mpool.c:194 UCX DEBUG mpool devx dbrec destroyed
[1716918990.465042] [gpu1:2588768:0] ib_device.c:605 UCX DEBUG destroying ib device mlx5_0
[1716918990.465046] [gpu1:2588768:0] async.c:157 UCX DEBUG removed async handler 0xeceaf0 [id=89 ref 1] ???() from hash
[1716918990.465047] [gpu1:2588768:0] async.c:547 UCX DEBUG removing async handler 0xeceaf0 [id=89 ref 1] ???()
[1716918990.465094] [gpu1:2588768:0] async.c:172 UCX DEBUG release async handler 0xeceaf0 [id=89 ref 0] ???()
[1716918990.487917] [gpu1:2588768:0] topo.c:800 UCX DEBUG /sys/class/infiniband/mlx5_1: PF sysfs path is '/sys/devices/pci0000:00/0000:00:01.1/0000:03:00.0/0000:04:04.0/0000:0e:00.1'
[1716918990.487922] [gpu1:2588768:0] topo.c:240 UCX DEBUG added sys_dev 10 for bus id 0e:00.1
[1716918990.487923] [gpu1:2588768:0] topo.c:475 UCX DEBUG mlx5_1: bdf_name 0000:0e:00.1 sys_dev 10
[1716918990.487949] [gpu1:2588768:0] ib_device.c:487 UCX DEBUG mlx5_1: vendor_id 0x15b3 device_id 4123
[1716918990.488421] [gpu1:2588768:0] ib_mlx5dv_md.c:1188 UCX DEBUG mlx5_1: crossing_vhca_mkey is not supported
[1716918990.488422] [gpu1:2588768:0] ib_mlx5dv_md.c:1204 UCX DEBUG mlx5_1: mkey_by_name_reserve is not supported
[1716918990.488556] [gpu1:2588768:0] ib_mlx5dv_md.c:1010 UCX DEBUG mlx5_1: ODP is disabled because version 1 is not supported for DevX QP
[1716918990.488715] [gpu1:2588768:0] async.c:232 UCX DEBUG added async handler 0xed4290 [id=89 ref 1] ???() to hash
[1716918990.488818] [gpu1:2588768:0] async.c:494 UCX DEBUG listening to async event fd 89 events 0x1 mode thread_spinlock
[1716918990.488820] [gpu1:2588768:0] ib_device.c:586 UCX DEBUG initialized device 'mlx5_1' (InfiniBand channel adapter) with 1 ports
[1716918990.488826] [gpu1:2588768:0] ib_md.c:1128 UCX DEBUG mlx5_1: cuda GPUDirect RDMA is disabled
[1716918990.488831] [gpu1:2588768:0] ib_md.c:1128 UCX DEBUG mlx5_1: rocm GPUDirect RDMA is disabled
[1716918990.488835] [gpu1:2588768:0] ib_md.c:1149 UCX DEBUG mlx5_1: ibv_reg_dmabuf_mr(fd=-1) returned Protocol not supported, dmabuf is not supported
[1716918990.488837] [gpu1:2588768:0] mpool.c:138 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1716918990.489090] [gpu1:2588768:0] ib_mlx5dv_md.c:1696 UCX DEBUG mlx5_1: opened DEVX md log_max_qp=17
[1716918990.489984] [gpu1:2588768:0] ib_mlx5dv_md.c:94 UCX DEBUG mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM) failed, syndrome 0x45d3a4: Remote I/O error
[1716918990.490324] [gpu1:2588768:0] ib_md.c:1116 UCX DEBUG mlx5_1: relaxed order memory access is enabled
[1716918990.490631] [gpu1:2588768:0] ib_mlx5dv_md.c:1141 UCX DEBUG created indirect rkey 0x1bf000 for remote flush
[1716918990.490633] [gpu1:2588768:0] ib_md.c:1067 UCX DEBUG mlx5_1: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1716918990.490651] [gpu1:2588768:0] ib_device.c:1052 UCX DEBUG no compatible IB ports found for flags 0xc4
[1716918990.490654] [gpu1:2588768:0] uct_md.c:97 UCX DEBUG failed to query dc_mlx5 resources: No such device
[1716918990.492004] [gpu1:2588768:0] ib_device.c:1052 UCX DEBUG no compatible IB ports found for flags 0x0
[1716918990.492005] [gpu1:2588768:0] uct_md.c:97 UCX DEBUG failed to query rc_verbs resources: No such device
[1716918990.492007] [gpu1:2588768:0] ib_device.c:1052 UCX DEBUG no compatible IB ports found for flags 0x4
[1716918990.492008] [gpu1:2588768:0] uct_md.c:97 UCX DEBUG failed to query rc_mlx5 resources: No such device
[1716918990.492009] [gpu1:2588768:0] ib_device.c:1052 UCX DEBUG no compatible IB ports found for flags 0x0
[1716918990.492009] [gpu1:2588768:0] uct_md.c:97 UCX DEBUG failed to query ud_verbs resources: No such device
[1716918990.492010] [gpu1:2588768:0] ib_device.c:1052 UCX DEBUG no compatible IB ports found for flags 0x4
[1716918990.492011] [gpu1:2588768:0] uct_md.c:97 UCX DEBUG failed to query ud_mlx5 resources: No such device
[1716918990.492012] [gpu1:2588768:0] ucp_context.c:1117 UCX DEBUG No tl resources found for md mlx5_1
[1716918990.492013] [gpu1:2588768:0] ucp_context.c:1562 UCX DEBUG closing md mlx5_1 because it has no selected transport resources
[1716918990.492018] [gpu1:2588768:0] ib_mlx5dv_md.c:1755 UCX DEBUG mlx5_1: md=0xed5340 md->flags=0x3f1d7f flush_rkey=0x1bf000
[1716918990.492291] [gpu1:2588768:0] mpool.c:194 UCX DEBUG mpool devx dbrec destroyed
[1716918990.492292] [gpu1:2588768:0] ib_device.c:605 UCX DEBUG destroying ib device mlx5_1
[1716918990.492294] [gpu1:2588768:0] async.c:157 UCX DEBUG removed async handler 0xed4290 [id=89 ref 1] ???() from hash
[1716918990.492295] [gpu1:2588768:0] async.c:547 UCX DEBUG removing async handler 0xed4290 [id=89 ref 1] ???()
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 101522.0 ON gpu1 CANCELLED AT 2024-05-28T10:56:30 ***
[1716918990.492331] [gpu1:25887srun: error: gpu1: task 0: Exited with exit code 1
ucx_info -v
)$ ucx_info -v
Library version: 1.16.0
Library path: /opt/ml4sw/MPI/ucx-1.16.0/lib/libucs.so.0
API headers version: 1.16.0
Git branch '', revision e4bb802
Configured with: --prefix=/opt/ml4sw/MPI/ucx-1.16.0 --with-cuda=/usr/local/cuda --with-gdrcopy=/usr
Setup and versions
Slurm - 23.11.5
OpenMPI - 5.0.3
Pmix - 5.0.2
Enroot - 3.4.1-1
UCX - 1.16.0
cat /etc/issue
orcat /etc/redhat-release
+uname -a
Red Hat Enterprise Linux release 8.9 (Ootpa) + Linux gpu1 4.18.0-513.24.1.el8_9.x86_64 Add basic types and functions, initial makefile, and smoke test. #1 SMP Thu Mar 14 14:20:09 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/mlnx-release
(the string identifies software and firmware setup)rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
ibstat
oribv_devinfo -vv
command$ ibv_devinfo -vv
lsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
$ lsmod|grep gdrdrv
gdrdrv 24576 0
nvidia 54001664 1361 nvidia_uvm,gdrdrv,nvidia_modeset
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCXThe text was updated successfully, but these errors were encountered: