-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ucx panic in func rdma_get_cm_event #8905
Comments
@HehuaTang the crash seems to be coming from rdma_get_cm_event in librdmacm, can you try running simple rdmacm test in k8s container, |
I have try it and failed to run it in client . mlx5_19 1 0 fe80:0000:0000:0000:f816:05ff:fe26:942c v1 eth0
Client : [root@wjw-roce-test220-m /]# show_gids mlx5_14 1 0 fe80:0000:0000:0000:f816:92ff:fe5d:9dff v1 eth0 [root@wjw-roce-test220-m /]# ib_send_lat -R 10.38.155.116 -x 3 -d mlx5_14 mount device into container, you can find all the sriov device but only one can work according to contain isolated network. PS: if I delete -R option in client and server and I can run 'ib_send_lat test' successfully. demsg: |
So it seems an issue in rdma-core and not in UCX. |
Thanks you for helping me find the root cause. |
close it as it is not ucx issue but librdmacm's issue. |
Describe the bug
A clear and concise description of what the bug is.
client stack backtrace :
I0224 08:23:21.570152 67335 /orpc/orpc-dep/orpc/src/brpc/ucp_ctx.cpp:75] Running with ucp library version: 1.14.0
[wjw-roce-test231-m:67322:a:67355] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 67355) ====
0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x124) [0x7fcd121fefd4]
1 /usr/local/ucx/lib/libucs.so.0(+0x2e2fc) [0x7fcd121ff2fc]
2 /usr/local/ucx/lib/libucs.so.0(+0x2e56b) [0x7fcd121ff56b]
3 /lib64/libpthread.so.0(+0xf630) [0x7fcd11dc2630]
4 /lib64/librdmacm.so.1(rdma_get_cm_event+0x39e) [0x7fcce66ff18e]
5 /usr/local/ucx/lib/ucx/libuct_rdmacm.so.0(+0x5a7f) [0x7fcce6915a7f]
6 /usr/local/ucx/lib/libucs.so.0(ucs_async_dispatch_handlers+0x160) [0x7fcd121e8710]
7 /usr/local/ucx/lib/libucs.so.0(+0x1a1c3) [0x7fcd121eb1c3]
8 /usr/local/ucx/lib/libucs.so.0(ucs_event_set_wait+0xa9) [0x7fcd12207fe9]
9 /usr/local/ucx/lib/libucs.so.0(+0x1a4fc) [0x7fcd121eb4fc]
10 /lib64/libpthread.so.0(+0x7ea5) [0x7fcd11dbaea5]
11 /lib64/libc.so.6(clone+0x6d) [0x7fcd100c8b0d]
Segmentation fault
server : bt
the same to client stack backtrace
[root@wjw-roce-test231-m bu]# cp ../*.pem ./
[root@wjw-roce-test231-m bu]# UCX_TLS=^tcp UCX_IB_GID_INDEX=3 UCX_NET_DEVICES=mlx5_19:1 ./multi_threaded_echo_server
I0224 11:20:01.514996 104434 /orpc/orpc-dep/orpc/src/brpc/ucp_ctx.cpp:75] Running with ucp library version: 1.14.0
I0224 11:20:02.259763 104434 /orpc/orpc-dep/orpc/src/brpc/ucp_acceptor.cpp:323] Ucp server is listening on IP 0.0.0.0 port 13339, idle connection check interval: -1s
I0224 11:20:02.259822 104434 /orpc/orpc-dep/orpc/src/brpc/server.cpp:1133] Server[example::EchoServiceImpl] is serving on port=8002.
I0224 11:20:02.260134 104434 /orpc/orpc-dep/orpc/src/brpc/server.cpp:1136] Check out http://wjw-roce-test231-m:8002 in web browser.
[wjw-roce-test231-m:104434:a:104538] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 104538) ====
0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x124) [0x7f653a9c8374]
1 /usr/local/ucx/lib/libucs.so.0(+0x2e69c) [0x7f653a9c869c]
2 /usr/local/ucx/lib/libucs.so.0(+0x2e90b) [0x7f653a9c890b]
3 /lib64/libpthread.so.0(+0xf630) [0x7f653a58b630]
4 /usr/local/ucx/lib/ucx/libuct_rdmacm.so.0(+0x5c1b) [0x7f64ec3d1c1b]
5 /usr/local/ucx/lib/libucs.so.0(ucs_async_dispatch_handlers+0x160) [0x7f653a9b18d0]
6 /usr/local/ucx/lib/libucs.so.0(+0x1a563) [0x7f653a9b4563]
7 /usr/local/ucx/lib/libucs.so.0(ucs_event_set_wait+0xa9) [0x7f653a9d1389]
8 /usr/local/ucx/lib/libucs.so.0(+0x1a89c) [0x7f653a9b489c]
9 /lib64/libpthread.so.0(+0x7ea5) [0x7f653a583ea5]
10 /lib64/libc.so.6(clone+0x6d) [0x7f6538891b0d]
Segmentation fault
Steps to Reproduce
UCX_TLS=^tcp UCX_IB_GID_INDEX=3 UCX_NET_DEVICES=mlx5_19:1 ./multi_threaded_echo_client -server=x.x.x.x:13339 --use_ucp=true --thread_num=1 --brpc_ucp_worker_busy_poll=true --attachment_size=2048
ucx_info -v
)ucx-1.14.x code was downloaded from ucx-1.14.x branch tag from github on date 2/23/2023
ucx_info -v
Library version: 1.14.0
Library path: /usr/lib64/libucs.so.0
API headers version: 1.14.0
Git branch '', revision f8877c5
Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-11.7
Any UCX environment variables used
Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
-[root@wjw-roce-test231-m bu]# cat /etc/issue
\S
Kernel \r on an \m
[root@wjw-roce-test231-m bu]# cat /etc/issue
\S
Kernel \r on an \m
[root@wjw-roce-test231-m bu]# cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
[root@wjw-roce-test231-m bu]# uname -a
Linux wjw-roce-test231-m 3.10.0-957.27.2.el7.x86_64 Add basic types and functions, initial makefile, and smoke test. #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/mlnx-release
(the string identifies software and firmware setup)Driver version:
rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
[root@wjw-roce-test231-m bu]# rpm -q rmda-core
package rmda-core is not installed
[root@wjw-roce-test231-m bu]# rpm -qa | grep rdma
rdma-core-devel-58mlnx43-1.58112.x86_64
librdmacm-utils-58mlnx43-1.58112.x86_64
rdma-core-58mlnx43-1.58112.x86_64
librdmacm-58mlnx43-1.58112.x86_64
ucx-rdmacm-1.14.0-1.58112.x86_64
[root@wjw-roce-test231-m bu]# rpm -qa |grep libibverbs
libibverbs-58mlnx43-1.58112.x86_64
libibverbs-utils-58mlnx43-1.58112.x86_64
[root@wjw-roce-test231-m bu]# ofed_info -s
MLNX_OFED_LINUX-5.8-1.1.2.1:
HW information from
ibstat
oribv_devinfo -vv
command[root@wjw-roce-test231-m bu]# show_gids
DEV PORT INDEX GID IPv4 VER DEV
mlx5_19 1 0 fe80:0000:0000:0000:f816:05ff:fe26:942c v1 eth0
mlx5_19 1 1 fe80:0000:0000:0000:f816:05ff:fe26:942c v2 eth0
mlx5_19 1 2 0000:0000:0000:0000:0000:ffff:0a26:9b74 10.38.155.116 v1 eth0
mlx5_19 1 3 0000:0000:0000:0000:0000:ffff:0a26:9b74 1038.155.116 v2 eth0
3773- LMC: 0
3782- SM lid: 0
3794- Capability mask: 0x00010000
3824- Port GUID: 0x0000000000000000
3856- Link layer: Ethernet
3879:CA 'mlx5_19'
3892- CA type: MT4120
3909- Number of ports: 1
3929- Firmware version: 16.31.2006
3959- Hardware version: 0
3980- Node GUID: 0x0000000000000000
[root@wjw-roce-test231-m bu]# ibv_devinfo -vv | grep 5_19 -a5 -b5
58540- GID[ 0]: fe80:0000:0000:0000:f816:92ff:fec2:f696, RoCE v1
58603- GID[ 1]: fe80::f816:92ff:fec2:f696, RoCE v2
58652- GID[ 2]: 0000:0000:0000:0000:0000:ffff:0a26:9849, RoCE v1
58715- GID[ 3]: ::ffff:10.38.152.73, RoCE v2
58758-
58759:hca_id: mlx5_19
58775- transport: InfiniBand (0)
58804- fw_ver: 16.31.2006
58827- node_guid: 0000:0000:0000:0000
58861- sys_image_guid: b8ce:f603:000c:29c8
58900- vendor_id: 0x02c9
lsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
kubectl version
Client Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-csp.10.9", GitCommit:"367a19c21ce71e1b0b6e99fc2dff3929b9f13bc8", GitTreeState:"clean", BuildDate:"2022-02-10T03:28:52Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-csp.10.8", GitCommit:"ab4fd8847d5880ffcffaec92a73e4e7130ee49ca", GitTreeState:"clean", BuildDate:"2021-09-09T02:30:36Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCXIsssue description :
I run two rdma apps in pods, client and server. they are both in one namesapce sparated pod created by k8s . rdma device is sriov nic. The crash happened as above description. If I run the same app in two container created by docker run. they worked fine.
So I
run --net=host--cap-add SYS_PTRACE --shm-size=8g --device=/dev/infiniband:/dev/infiniband:rw --name orpc-test-2 hub.xyz.com.orpc-rdma/orpc-rdma-depy:v1.0.0
The text was updated successfully, but these errors were encountered: