Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX Crashing with > 160 cores when using Intel MPI #9071

Open
ebolandrtx opened this issue May 11, 2023 · 5 comments
Open

UCX Crashing with > 160 cores when using Intel MPI #9071

ebolandrtx opened this issue May 11, 2023 · 5 comments

Comments

@ebolandrtx
Copy link

ebolandrtx commented May 11, 2023

Hi

We are transitioning to using the Red Hat provided UCX library and have noted that it is crashing when using Intel MPI when the core count exceeds a certain threshold. We are seeing the following error:

HOST:1151067:0:1151067] ud_ep.c:888 Assertion `ctl->type == UCT_UD_PACKET_CREP' failed
==== backtrace (tid:1151067) ====
0 /usr/lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x7fba21010edc]
1 /usr/lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x7fba2100dd41]
2 /usr/lib64/libucs.so.0(ucs_fatal_error_format+0x10f) [0x7fba2100de5f]
3 /usr/lib64/ucx/libuct_ib.so.0(+0x5b890) [0x7fba1f052890]
4 /usr/lib64/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x316) [0x7fba1f052d96]
5 /usr/lib64/ucx/libuct_ib.so.0(+0x6470d) [0x7fba1f05b70d]
6 /usr/lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x7fba214bdada]
7 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0xa7a1) [0x7fba217457a1]
8 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0x22b0d) [0x7fba2175db0d]
9 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0x22a97) [0x7fba2175da97]
10 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x62b3fe) [0x7fbb704043fe]
11 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x1fa7a1) [0x7fbb6ffd37a1]
12 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x78eb7e) [0x7fbb70567b7e]
13 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x371f43) [0x7fbb7014af43]
14 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x263faa) [0x7fbb7003cfaa]
15 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x16a7a2) [0x7fbb6ff437a2]
16 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x19e9cd) [0x7fbb6ff779cd]
17 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x1ba1c7) [0x7fbb6ff931c7]
18 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x185bd4) [0x7fbb6ff5ebd4]
19 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x165780) [0x7fbb6ff3e780]
20 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x2674fd) [0x7fbb700404fd]
21 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(MPI_Bcast+0x51f) [0x7fbb6ff26a8f]

We are using the native RHEL8 UCX:

Version 1.13.0
Git branch '', revision 6765970
Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --without-cm --without-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni

Our infiniband device info is:

hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.24.1000

I suspect this is related to the rc vs. dc communication at higher core counts based on some of the issues I've read, and would appreciate some discussion to fix.

@yosefe
Copy link
Contributor

yosefe commented May 14, 2023

Seems similar to #8620

@yosefe
Copy link
Contributor

yosefe commented May 14, 2023

@ebolandrtx can you pls check if UCX_TLS=self,sm,ud_v or UCX_TLS=self,sm,dc makes the issue go away?

@ebolandrtx
Copy link
Author

ebolandrtx commented May 15, 2023

Hi, i ended up setting: UCX_TLS=rc,ud,sm,self and that world. Is there a difference between ud_v and ud that i should be aware of, and should i be using dc instead of rc?

@ebolandrtx
Copy link
Author

Seeing two new errors now (associated):

sys.c:314 UCX WARN could not find address of current library: (null)
module.c:68 UCX ERROR dladdr failed: (null)

@yosefe
Copy link
Contributor

yosefe commented May 16, 2023

Hi, i ended up setting: UCX_TLS=rc,ud,sm,self and that world. Is there a difference between ud_v and ud that i should be aware of, and should i be using dc instead of rc?

UCX_TLS=rc,ud,sm,self will use RC transport. However, it's recommended to use DC with UCX_TLS=dc,self,sm.
BTW, UCX would prefer using DC when possible, but AFAIR Intel MPI is setting UCX_TLS to use ud.
"ud_v" is a different (and slightly less performant) implementation, if it works it can help narrow down the issue.

Seeing two new errors now (associated):

sys.c:314 UCX WARN could not find address of current library: (null) module.c:68 UCX ERROR dladdr failed: (null)

Are you using static link?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants