-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV one MPI_Isend between ranks on a single GPU #4756
Comments
Please, set |
Same result after adding cuda_copy. Here is a trace using a debug build of UCX. 0 0x00002aaaad8e35fd in pause () from /lib64/libpthread.so.0 |
Ok, how the CUDA library is linked with your application? Statically or dynamically? |
Ahhh, setting UCX_MEMTYPE_CACHE=n does allow it to run. I just noticed that this issue is mentioned in the "Known issues" section of Many thanks |
@dkokron you're welcome! Let us know if you see any other issues |
I get the following stack trace from the attached reproducer.
pgi_ucx_reproduce.f.gz
gdb) where
#0 0x00002aaaae889ae7 in _memcpy_avx_unaligned () from /lib64/libc.so.6
#1 0x00002aaabb8b884e in uct_am_short_fill_data (length=, payload=, header=, buffer=) at /nobackupnfs2/dkokron/play/Mellanox/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-suse12.4-x86_64/sources/ucx-1.7.0.gnu/src/uct/base/uct_iface.h:725
#2 uct_self_ep_am_short (tl_ep=, id=, header=, payload=, length=288) at sm/self/self.c:259
#3 0x00002aaabb681453 in uct_ep_am_short (length=, payload=0x2aaaf7afa880, header=0, id=2 '\002', ep=) at /nobackupnfs2/dkokron/play/Mellanox/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-suse12.4-x86_64/sources/ucx-1.7.0.gnu/src/uct/api/uct.h:2406
#4 ucp_tag_send_inline (tag=0, datatype=64, count=36, buffer=0x2aaaf7afa880, ep=0x2aaaca60f000) at tag/tag_send.c:173
#5 ucp_tag_send_nb (ep=0x2aaaca60f000, buffer=0x2aaaf7afa880, count=36, datatype=64, tag=0, cb=0x2aaab9c3f0d0 <mca_pml_ucx_send_completion>) at tag/tag_send.c:208
#6 0x00002aaab9c3d1e1 in mca_pml_ucx_isend (buf=0x2aaaf7afa880, count=, datatype=0x2aaaab6556a0 <ompi_mpi_real8>, dst=, tag=, mode=MCA_PML_BASE_SEND_STANDARD, comm=0x2aaaab665c40 <ompi_mpi_comm_world>, request=0x7fffffffb408) at pml_ucx.c:743
#7 0x00002aaaab3db572 in PMPI_Isend (buf=0xfaef48, count=, type=0x120, dest=-139482720, tag=16445256, comm=, request=0x7fffffffb408) at pisend.c:95
#8 0x00002aaaab155b55 in ompi_isend_f (buf=, count=0x7fffffffb518, datatype=, dest=0x64704c <mpi_stuff_0+12>, tag=0x636a24 <.STATICS5+4>, comm=0x4053a4 <.C283_seam_vvec>, request=0x6469f8 <.BSS5+24>, ierr=0x7fffffffb4fc) at pisend_f.c:82
#9 0x0000000000404f0e in seam_vvec (v=...) at pgi_ucx_reproduce.f:155
#10 0x00000000004034f5 in test () at pgi_ucx_reproduce.f:96
#11 0x0000000000401b66 in main ()
#12 0x00002aaaae77d765 in __libc_start_main () from /lib64/libc.so.6
#13 0x0000000000401a59 in _start () at ../sysdeps/x86_64/start.S:118
Using hpcx-2.5.0 with the latest ucx-1.7.0 (not the ucx that comes with hpcx-2.5.0)
ucx_info -v
UCT version=1.7.0 revision 9d06c3a
configured with: --enable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-knem --without-java --prefix=/nobackupnfs2/$USER/play/Mellanox/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-suse12.4-x86_64/sources/ucx-1.7.0.gnu/install --with-cuda=/nasa/cuda/10.2 --with-mlx5-dv --with-rc --with-ud --with-dc --with-ib-hw-tm --with-dm --with-cm --enable-mt
OpenMPI was configured with
./configure CC=pgcc CXX=pgc++ F77=pgf90 FC=pgf90 --prefix=/nobackupnfs2/$USER/play/Mellanox/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-suse12.4-x86_64/ompi-pgi --with-hcoll=/nobackupnfs2/$USER/play/Mellanox/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-suse12.4-x86_64/hcoll --with-ucx=/nobackupnfs2/$USER/play/Mellanox/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-suse12.4-x86_64/sources/ucx-1.7.0.gnu/install --enable-mca-no-build=btl-uct --with-libevent=internal --enable-mpi1-compatibility --with-pmix=internal --with-tm=/PBS --without-slurm --with-lustre --with-platform=/nobackupnfs2/$USER/play/Mellanox/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-suse12.4-x86_64/sources/openmpi-gitclone/contrib/platform/mellanox/optimized --with-cuda=/nasa/cuda/10.2
ucx was compiled using gnu-4.8.5 while OpenMPI was compiled with pgi-19.5 both on a system running SLES12sp4 (Pleiades system at NASA/Ames)
mpif90 -acc -ta=tesla pgi_ucx_reproduce.f
UCX_IB_SUBNET_PREFIX=fec0:0000:0000:0004
UCX_MAX_EAGER_LANES=1
UCX_MAX_RNDV_LANES=1
UCX_TLS=rc,shm,self
OMP_NUM_THREADS=1
OMPI_HOME=/nobackupnfs2/$USER/play/Mellanox/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-suse12.4-x86_64/ompi-pgi
OMPI_MCA_btl=^openib
OMPI_MCA_coll_fca_enable=0
OMPI_MCA_coll_hcoll_enable=1
OMPI_MCA_coll_hcoll_np=0
OMPI_MCA_pml=ucx
mpiexec -np 1 ./a.out
uname -a
Linux r101i0n14 4.12.14-95.40.1.20191112-nasa #1 SMP Tue Nov 5 10:25:27 UTC 2019 (621d94a) x86_64 x86_64 x86_64 GNU/Linux
ofed_info -s
OFED-internal-4.7-1.0.0:
CUDA Driver Version: 10020
NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.33.01 Wed Nov 13 00:00:22 UTC 2019
Device Number: 0
Device Name: Tesla V100-SXM2-32GB
lsmod|grep nv_peer_mem
nv_peer_mem 16384 0
ib_core 389120 13 ib_cm,rdma_cm,ib_umad,nv_peer_mem,ib_uverbs,ib_ipoib,iw_cm,mlx5_ib,ib_ucm,beegfs,rdma_ucm,ko2iblnd,mlx4_ib
nvidia 19927040 3 nv_peer_mem,nvidia_modeset,nvidia_uvm
GDRCopy is not loaded
output from ucx_info -d is attached
ucx.dev.txt
output from ibv_devinfo is attached
ibdev.txt
The text was updated successfully, but these errors were encountered: