You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I got the following error message using OpenMPI 4.1.1 and ucx 1.10 or OpenMPI 4.0.3 and ucx1.10.
[n07299:125295:0:125295] ib_mlx5_log.c:145 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[n07299:125295:0:125295] ib_mlx5_log.c:145 DCI QP 0x13014 wqe[399]: NOP --- [rqpn 0x0 rlid 0]
I' ve searched and seen similar issues on openucx. As a first try, I installed a lower version of ucx, i.e., ucx1.9. However, when I install Openmpi 4.1.1 or Openmpi4.0.3 with ucx1.9, I got the following message:
Making all in mca/common/ucx
make[2]: Entering directory `/packages/openmpi-4.0.3_ucx1_9/openmpi-4.0.3/opal/mca/common/ucx'
CC libmca_common_ucx_la-common_ucx.lo
CCLD libmca_common_ucx.la
ld: cannot find -lnuma
make[2]: *** [libmca_common_ucx.la] Error 1
make[2]: Leaving directory `/packages/openmpi-4.0.3_ucx1_9/openmpi-4.0.3/opal/mca/common/ucx'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/packages/openmpi-4.0.3_ucx1_9/openmpi-4.0.3/opal'
make: *** [all-recursive] Error 1
I'm sure I have numa lib (libnuma.so.1) installed, which is located in /usr/lib64. And I added this path to LD_LIBRARY_PATH and LIBRARY_PATH. I have no idea why it cannot find it! The commands I used to install openmpi is:
ucx_libs()
{
builddir=/es01/software/ucx19
libs=""forxin s m t p
do
libs="${libs}:${builddir}/lib/libuc${x}.so"doneecho${libs}
}
export LD_PRELOAD=$(ucx_libs ${builddir})echo$LD_PRELOAD
tar zxvf openmpi-4.0.3.tar.gz
cd openmpi-4.0.3
./configure --prefix=/user/local/openmpi_4_0_3_intel2018_ucx_1_9 CC=icc CXX=icpc FC=ifort --with-ucx=/es01/software/ucx19
make
make install
Yep, I want to solve issue 1, but I got another issue. Both of them confused me!
Followed is some information about my platform:
@yosefe Sorry for the delay. Issue #1 was solved following you suggestion. After the maintenance of our platform, however, I didn't meet issue #2 anymore. I'm not sure whether it has something to do with the OpenMPI mpirun parameters since I add "-x LD_LIBRARY_PATH -x UCX_NET_DEVICES=mlx5_0:1 -mca pml ucx -mca btl ^vader,tcp,openib,uct" after mpirun in the new run, while I use the default parameters, i.e. I add no parameters myself, in the old run.
Hi,
I got the following error message using OpenMPI 4.1.1 and ucx 1.10 or OpenMPI 4.0.3 and ucx1.10.
I' ve searched and seen similar issues on openucx. As a first try, I installed a lower version of ucx, i.e., ucx1.9. However, when I install Openmpi 4.1.1 or Openmpi4.0.3 with ucx1.9, I got the following message:
I'm sure I have numa lib (libnuma.so.1) installed, which is located in /usr/lib64. And I added this path to LD_LIBRARY_PATH and LIBRARY_PATH. I have no idea why it cannot find it! The commands I used to install openmpi is:
Yep, I want to solve issue 1, but I got another issue. Both of them confused me!
Followed is some information about my platform:
The text was updated successfully, but these errors were encountered: