Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ucx panic in func rdma_get_cm_event #8905

Closed
HehuaTang opened this issue Feb 24, 2023 · 5 comments
Closed

ucx panic in func rdma_get_cm_event #8905

HehuaTang opened this issue Feb 24, 2023 · 5 comments
Labels

Comments

@HehuaTang
Copy link

Describe the bug

A clear and concise description of what the bug is.
client stack backtrace :
I0224 08:23:21.570152 67335 /orpc/orpc-dep/orpc/src/brpc/ucp_ctx.cpp:75] Running with ucp library version: 1.14.0
[wjw-roce-test231-m:67322:a:67355] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 67355) ====
0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x124) [0x7fcd121fefd4]
1 /usr/local/ucx/lib/libucs.so.0(+0x2e2fc) [0x7fcd121ff2fc]
2 /usr/local/ucx/lib/libucs.so.0(+0x2e56b) [0x7fcd121ff56b]
3 /lib64/libpthread.so.0(+0xf630) [0x7fcd11dc2630]
4 /lib64/librdmacm.so.1(rdma_get_cm_event+0x39e) [0x7fcce66ff18e]
5 /usr/local/ucx/lib/ucx/libuct_rdmacm.so.0(+0x5a7f) [0x7fcce6915a7f]
6 /usr/local/ucx/lib/libucs.so.0(ucs_async_dispatch_handlers+0x160) [0x7fcd121e8710]
7 /usr/local/ucx/lib/libucs.so.0(+0x1a1c3) [0x7fcd121eb1c3]
8 /usr/local/ucx/lib/libucs.so.0(ucs_event_set_wait+0xa9) [0x7fcd12207fe9]
9 /usr/local/ucx/lib/libucs.so.0(+0x1a4fc) [0x7fcd121eb4fc]
10 /lib64/libpthread.so.0(+0x7ea5) [0x7fcd11dbaea5]
11 /lib64/libc.so.6(clone+0x6d) [0x7fcd100c8b0d]

Segmentation fault
server : bt
the same to client stack backtrace
[root@wjw-roce-test231-m bu]# cp ../*.pem ./
[root@wjw-roce-test231-m bu]# UCX_TLS=^tcp UCX_IB_GID_INDEX=3 UCX_NET_DEVICES=mlx5_19:1 ./multi_threaded_echo_server
I0224 11:20:01.514996 104434 /orpc/orpc-dep/orpc/src/brpc/ucp_ctx.cpp:75] Running with ucp library version: 1.14.0
I0224 11:20:02.259763 104434 /orpc/orpc-dep/orpc/src/brpc/ucp_acceptor.cpp:323] Ucp server is listening on IP 0.0.0.0 port 13339, idle connection check interval: -1s
I0224 11:20:02.259822 104434 /orpc/orpc-dep/orpc/src/brpc/server.cpp:1133] Server[example::EchoServiceImpl] is serving on port=8002.
I0224 11:20:02.260134 104434 /orpc/orpc-dep/orpc/src/brpc/server.cpp:1136] Check out http://wjw-roce-test231-m:8002 in web browser.

[wjw-roce-test231-m:104434:a:104538] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 104538) ====
0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x124) [0x7f653a9c8374]
1 /usr/local/ucx/lib/libucs.so.0(+0x2e69c) [0x7f653a9c869c]
2 /usr/local/ucx/lib/libucs.so.0(+0x2e90b) [0x7f653a9c890b]
3 /lib64/libpthread.so.0(+0xf630) [0x7f653a58b630]
4 /usr/local/ucx/lib/ucx/libuct_rdmacm.so.0(+0x5c1b) [0x7f64ec3d1c1b]
5 /usr/local/ucx/lib/libucs.so.0(ucs_async_dispatch_handlers+0x160) [0x7f653a9b18d0]
6 /usr/local/ucx/lib/libucs.so.0(+0x1a563) [0x7f653a9b4563]
7 /usr/local/ucx/lib/libucs.so.0(ucs_event_set_wait+0xa9) [0x7f653a9d1389]
8 /usr/local/ucx/lib/libucs.so.0(+0x1a89c) [0x7f653a9b489c]
9 /lib64/libpthread.so.0(+0x7ea5) [0x7f653a583ea5]
10 /lib64/libc.so.6(clone+0x6d) [0x7f6538891b0d]

Segmentation fault

Steps to Reproduce

  • Command line
    UCX_TLS=^tcp UCX_IB_GID_INDEX=3 UCX_NET_DEVICES=mlx5_19:1 ./multi_threaded_echo_client -server=x.x.x.x:13339 --use_ucp=true --thread_num=1 --brpc_ucp_worker_busy_poll=true --attachment_size=2048
  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)
    ucx-1.14.x code was downloaded from ucx-1.14.x branch tag from github on date 2/23/2023

ucx_info -v

Library version: 1.14.0

Library path: /usr/lib64/libucs.so.0

API headers version: 1.14.0

Git branch '', revision f8877c5

Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-11.7

Any UCX environment variables used

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • cat /etc/issue or cat /etc/redhat-release + uname -a
      -[root@wjw-roce-test231-m bu]# cat /etc/issue
      \S
      Kernel \r on an \m
      [root@wjw-roce-test231-m bu]# cat /etc/issue
      \S
      Kernel \r on an \m
      [root@wjw-roce-test231-m bu]# cat /etc/redhat-release
      CentOS Linux release 7.6.1810 (Core)
      [root@wjw-roce-test231-m bu]# uname -a
      Linux wjw-roce-test231-m 3.10.0-957.27.2.el7.x86_64 Add basic types and functions, initial makefile, and smoke test. #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
    • For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
  • For RDMA/IB/RoCE related issues:
    • Driver version:

      • rpm -q rdma-core or rpm -q libibverbs
      • or: MLNX_OFED version ofed_info -s
        [root@wjw-roce-test231-m bu]# rpm -q rmda-core
        package rmda-core is not installed
        [root@wjw-roce-test231-m bu]# rpm -qa | grep rdma
        rdma-core-devel-58mlnx43-1.58112.x86_64
        librdmacm-utils-58mlnx43-1.58112.x86_64
        rdma-core-58mlnx43-1.58112.x86_64
        librdmacm-58mlnx43-1.58112.x86_64
        ucx-rdmacm-1.14.0-1.58112.x86_64
        [root@wjw-roce-test231-m bu]# rpm -qa |grep libibverbs
        libibverbs-58mlnx43-1.58112.x86_64
        libibverbs-utils-58mlnx43-1.58112.x86_64
        [root@wjw-roce-test231-m bu]# ofed_info -s
        MLNX_OFED_LINUX-5.8-1.1.2.1:
    • HW information from ibstat or ibv_devinfo -vv command
      [root@wjw-roce-test231-m bu]# show_gids
      DEV PORT INDEX GID IPv4 VER DEV


mlx5_19 1 0 fe80:0000:0000:0000:f816:05ff:fe26:942c v1 eth0
mlx5_19 1 1 fe80:0000:0000:0000:f816:05ff:fe26:942c v2 eth0
mlx5_19 1 2 0000:0000:0000:0000:0000:ffff:0a26:9b74 10.38.155.116 v1 eth0
mlx5_19 1 3 0000:0000:0000:0000:0000:ffff:0a26:9b74 1038.155.116 v2 eth0

3773- LMC: 0
3782- SM lid: 0
3794- Capability mask: 0x00010000
3824- Port GUID: 0x0000000000000000
3856- Link layer: Ethernet
3879:CA 'mlx5_19'
3892- CA type: MT4120
3909- Number of ports: 1
3929- Firmware version: 16.31.2006
3959- Hardware version: 0
3980- Node GUID: 0x0000000000000000
[root@wjw-roce-test231-m bu]# ibv_devinfo -vv | grep 5_19 -a5 -b5
58540- GID[ 0]: fe80:0000:0000:0000:f816:92ff:fec2:f696, RoCE v1
58603- GID[ 1]: fe80::f816:92ff:fec2:f696, RoCE v2
58652- GID[ 2]: 0000:0000:0000:0000:0000:ffff:0a26:9849, RoCE v1
58715- GID[ 3]: ::ffff:10.38.152.73, RoCE v2
58758-
58759:hca_id: mlx5_19
58775- transport: InfiniBand (0)
58804- fw_ver: 16.31.2006
58827- node_guid: 0000:0000:0000:0000
58861- sys_image_guid: b8ce:f603:000c:29c8
58900- vendor_id: 0x02c9

  • For GPU related issues:
    • GPU type
    • Cuda:
      • Drivers version
      • Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv
        kubectl version
        Client Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-csp.10.9", GitCommit:"367a19c21ce71e1b0b6e99fc2dff3929b9f13bc8", GitTreeState:"clean", BuildDate:"2022-02-10T03:28:52Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
        Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-csp.10.8", GitCommit:"ab4fd8847d5880ffcffaec92a73e4e7130ee49ca", GitTreeState:"clean", BuildDate:"2021-09-09T02:30:36Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

Additional information (depending on the issue)

  • OpenMPI version
  • Output of ucx_info -d to show transports and devices recognized by UCX
  • Configure result - config.log
  • Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"

Isssue description :
I run two rdma apps in pods, client and server. they are both in one namesapce sparated pod created by k8s . rdma device is sriov nic. The crash happened as above description. If I run the same app in two container created by docker run. they worked fine.
So I
run --net=host--cap-add SYS_PTRACE --shm-size=8g --device=/dev/infiniband:/dev/infiniband:rw --name orpc-test-2 hub.xyz.com.orpc-rdma/orpc-rdma-depy:v1.0.0

@HehuaTang HehuaTang added the Bug label Feb 24, 2023
@yosefe
Copy link
Contributor

yosefe commented Feb 26, 2023

@HehuaTang the crash seems to be coming from rdma_get_cm_event in librdmacm, can you try running simple rdmacm test in k8s container,
client: ib_send_lat -R -x 3
server: ib_send_lat -R <server-rdma-ip> -x 3

@HehuaTang
Copy link
Author

HehuaTang commented Feb 27, 2023

I have try it and failed to run it in client .
Server:
[root@wjw-roce-test231-m /]# ib_send_lat -R -x 3
Port number 1 state is Down
Couldn't set the link layer
Couldn't get context for the device
[root@wjw-roce-test231-m /]#
[root@wjw-roce-test231-m /]# show_gids
DEV PORT INDEX GID IPv4 VER DEV


mlx5_19 1 0 fe80:0000:0000:0000:f816:05ff:fe26:942c v1 eth0
mlx5_19 1 1 fe80:0000:0000:0000:f816:05ff:fe26:942c v2 eth0
mlx5_19 1 2 0000:0000:0000:0000:0000:ffff:0a26:9b74 10.38.155.116 v1 eth0
mlx5_19 1 3 0000:0000:0000:0000:0000:ffff:0a26:9b74 10.38.155.116 v2 eth0
n_gids_found=4
[root@wjw-roce-test231-m /]# ib_send_lat -d mlx5_19 -X 3
Events must be enabled to select a completion vector
[root@wjw-roce-test231-m /]# ib_send_lat -d mlx5_19 -x 3


  • Waiting for client to connect... *

Client :

[root@wjw-roce-test220-m /]# show_gids
DEV PORT INDEX GID IPv4 VER DEV


mlx5_14 1 0 fe80:0000:0000:0000:f816:92ff:fe5d:9dff v1 eth0
mlx5_14 1 1 fe80:0000:0000:0000:f816:92ff:fe5d:9dff v2 eth0
mlx5_14 1 2 0000:0000:0000:0000:0000:ffff:0a26:9b15 10.38.155.21 v1 eth0
mlx5_14 1 3 0000:0000:0000:0000:0000:ffff:0a26:9b15 10.38.155.21 v2 eth0
in_gids_found=4

[root@wjw-roce-test220-m /]# ib_send_lat -R 10.38.155.116 -x 3 -d mlx5_14
Segmentation fault

mount device into container, you can find all the sriov device but only one can work according to contain isolated network.
"Devices": [
{
"PathOnHost": "/dev/infiniband",
"PathInContainer": "/dev/infiniband",
"CgroupPermissions": "rwm"
}
],

PS: if I delete -R option in client and server and I can run 'ib_send_lat test' successfully.

demsg:
[416451.853362] ib_send_lat[95655]: segfault at 0 ip 00007f3b6ecc518e sp 00007fff234faa60 error 4 in librdmacm.so.1.3.43.0[7f3b6ecbc000+18000]

@yosefe
Copy link
Contributor

yosefe commented Feb 27, 2023

So it seems an issue in rdma-core and not in UCX.

@HehuaTang
Copy link
Author

HehuaTang commented Feb 27, 2023

Thanks you for helping me find the root cause.

@HehuaTang
Copy link
Author

close it as it is not ucx issue but librdmacm's issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants