-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX Crashing with > 160 cores when using Intel MPI #9071
Comments
Seems similar to #8620 |
@ebolandrtx can you pls check if |
Hi, i ended up setting: UCX_TLS=rc,ud,sm,self and that world. Is there a difference between ud_v and ud that i should be aware of, and should i be using dc instead of rc? |
Seeing two new errors now (associated): sys.c:314 UCX WARN could not find address of current library: (null) |
Are you using static link? |
Hi
We are transitioning to using the Red Hat provided UCX library and have noted that it is crashing when using Intel MPI when the core count exceeds a certain threshold. We are seeing the following error:
HOST:1151067:0:1151067] ud_ep.c:888 Assertion `ctl->type == UCT_UD_PACKET_CREP' failed
==== backtrace (tid:1151067) ====
0 /usr/lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x7fba21010edc]
1 /usr/lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x7fba2100dd41]
2 /usr/lib64/libucs.so.0(ucs_fatal_error_format+0x10f) [0x7fba2100de5f]
3 /usr/lib64/ucx/libuct_ib.so.0(+0x5b890) [0x7fba1f052890]
4 /usr/lib64/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x316) [0x7fba1f052d96]
5 /usr/lib64/ucx/libuct_ib.so.0(+0x6470d) [0x7fba1f05b70d]
6 /usr/lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x7fba214bdada]
7 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0xa7a1) [0x7fba217457a1]
8 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0x22b0d) [0x7fba2175db0d]
9 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0x22a97) [0x7fba2175da97]
10 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x62b3fe) [0x7fbb704043fe]
11 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x1fa7a1) [0x7fbb6ffd37a1]
12 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x78eb7e) [0x7fbb70567b7e]
13 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x371f43) [0x7fbb7014af43]
14 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x263faa) [0x7fbb7003cfaa]
15 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x16a7a2) [0x7fbb6ff437a2]
16 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x19e9cd) [0x7fbb6ff779cd]
17 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x1ba1c7) [0x7fbb6ff931c7]
18 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x185bd4) [0x7fbb6ff5ebd4]
19 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x165780) [0x7fbb6ff3e780]
20 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x2674fd) [0x7fbb700404fd]
21 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(MPI_Bcast+0x51f) [0x7fbb6ff26a8f]
We are using the native RHEL8 UCX:
Version 1.13.0
Git branch '', revision 6765970
Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --without-cm --without-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni
Our infiniband device info is:
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.24.1000
I suspect this is related to the rc vs. dc communication at higher core counts based on some of the issues I've read, and would appreciate some discussion to fix.
The text was updated successfully, but these errors were encountered: