UCX Crashing with > 160 cores when using Intel MPI #9071

ebolandrtx · 2023-05-11T17:40:37Z

Hi

We are transitioning to using the Red Hat provided UCX library and have noted that it is crashing when using Intel MPI when the core count exceeds a certain threshold. We are seeing the following error:

HOST:1151067:0:1151067] ud_ep.c:888 Assertion `ctl->type == UCT_UD_PACKET_CREP' failed
==== backtrace (tid:1151067) ====
0 /usr/lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x7fba21010edc]
1 /usr/lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x7fba2100dd41]
2 /usr/lib64/libucs.so.0(ucs_fatal_error_format+0x10f) [0x7fba2100de5f]
3 /usr/lib64/ucx/libuct_ib.so.0(+0x5b890) [0x7fba1f052890]
4 /usr/lib64/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x316) [0x7fba1f052d96]
5 /usr/lib64/ucx/libuct_ib.so.0(+0x6470d) [0x7fba1f05b70d]
6 /usr/lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x7fba214bdada]
7 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0xa7a1) [0x7fba217457a1]
8 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0x22b0d) [0x7fba2175db0d]
9 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0x22a97) [0x7fba2175da97]
10 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x62b3fe) [0x7fbb704043fe]
11 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x1fa7a1) [0x7fbb6ffd37a1]
12 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x78eb7e) [0x7fbb70567b7e]
13 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x371f43) [0x7fbb7014af43]
14 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x263faa) [0x7fbb7003cfaa]
15 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x16a7a2) [0x7fbb6ff437a2]
16 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x19e9cd) [0x7fbb6ff779cd]
17 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x1ba1c7) [0x7fbb6ff931c7]
18 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x185bd4) [0x7fbb6ff5ebd4]
19 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x165780) [0x7fbb6ff3e780]
20 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x2674fd) [0x7fbb700404fd]
21 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(MPI_Bcast+0x51f) [0x7fbb6ff26a8f]

We are using the native RHEL8 UCX:

Version 1.13.0
Git branch '', revision 6765970
Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --without-cm --without-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni

Our infiniband device info is:

hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.24.1000

I suspect this is related to the rc vs. dc communication at higher core counts based on some of the issues I've read, and would appreciate some discussion to fix.

The text was updated successfully, but these errors were encountered:

yosefe · 2023-05-14T08:06:01Z

Seems similar to #8620

yosefe · 2023-05-14T08:06:51Z

@ebolandrtx can you pls check if UCX_TLS=self,sm,ud_v or UCX_TLS=self,sm,dc makes the issue go away?

ebolandrtx · 2023-05-15T18:24:10Z

Hi, i ended up setting: UCX_TLS=rc,ud,sm,self and that world. Is there a difference between ud_v and ud that i should be aware of, and should i be using dc instead of rc?

ebolandrtx · 2023-05-15T20:42:31Z

Seeing two new errors now (associated):

sys.c:314 UCX WARN could not find address of current library: (null)
module.c:68 UCX ERROR dladdr failed: (null)

yosefe · 2023-05-16T06:19:40Z

Hi, i ended up setting: UCX_TLS=rc,ud,sm,self and that world. Is there a difference between ud_v and ud that i should be aware of, and should i be using dc instead of rc?

UCX_TLS=rc,ud,sm,self will use RC transport. However, it's recommended to use DC with UCX_TLS=dc,self,sm.
BTW, UCX would prefer using DC when possible, but AFAIR Intel MPI is setting UCX_TLS to use ud.
"ud_v" is a different (and slightly less performant) implementation, if it works it can help narrow down the issue.

Seeing two new errors now (associated):

sys.c:314 UCX WARN could not find address of current library: (null) module.c:68 UCX ERROR dladdr failed: (null)

Are you using static link?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCX Crashing with > 160 cores when using Intel MPI #9071

UCX Crashing with > 160 cores when using Intel MPI #9071

ebolandrtx commented May 11, 2023 •

edited

Loading

yosefe commented May 14, 2023

yosefe commented May 14, 2023 •

edited

Loading

ebolandrtx commented May 15, 2023 •

edited

Loading

ebolandrtx commented May 15, 2023

yosefe commented May 16, 2023

UCX Crashing with > 160 cores when using Intel MPI #9071

UCX Crashing with > 160 cores when using Intel MPI #9071

Comments

ebolandrtx commented May 11, 2023 • edited Loading

yosefe commented May 14, 2023

yosefe commented May 14, 2023 • edited Loading

ebolandrtx commented May 15, 2023 • edited Loading

ebolandrtx commented May 15, 2023

yosefe commented May 16, 2023

ebolandrtx commented May 11, 2023 •

edited

Loading

yosefe commented May 14, 2023 •

edited

Loading

ebolandrtx commented May 15, 2023 •

edited

Loading