You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is intended to track the OpenMPI seg fault problem discussed last week. When running SNAP with OpenMPI on KNL using 1024 ranks the application seg faults during initialization. This problem does not occur when running with MPICH instead.
Stock nightly libfabric and OpenMPI libraries are used from Sung's install directory. More details can be supplied if desired.
Core was generated by `/cray/css/u19/c17581/snap/nersc/SNAPJune13/small/../../../SNAP/src/gsnap 1024ta'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007ffff67112cf in ompi_mtl_ofi_irecv (mtl=0x7ffff6a50d20 <ompi_mtl_ofi>,
comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, src=119, tag=-12, convertor=0x80f710,
mtl_request=0x80f820) at mtl_ofi.h:537
537 remote_addr = endpoint->peer_fiaddr;
(gdb) where
#0 0x00007ffff67112cf in ompi_mtl_ofi_irecv (mtl=0x7ffff6a50d20 <ompi_mtl_ofi>, comm=
0x7ffff6a59b20 <ompi_mpi_comm_world>, src=119, tag=-12, convertor=0x80f710, mtl_request=0x80f820)
at mtl_ofi.h:537
#1 0x00007ffff6774cb1 in mca_pml_cm_irecv (addr=0x930530, count=1, datatype=
0x7ffff6a45040 <ompi_mpi_int>, src=119, tag=-12, comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, request=
0x7fffffff6b50) at pml_cm.h:119
#2 0x00007ffff66635a2 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0x7fffffff6d24, rbuf=
Changed to title to reflect that this is a general problem for mini apps at high rank count, not just limited to SNAP.
tenbrugg
changed the title
Running SNAP with 1k ranks and OpenMPI causes seg fault
Running mini apps with 1k ranks and OpenMPI causes seg fault
Aug 3, 2016
This issue is intended to track the OpenMPI seg fault problem discussed last week. When running SNAP with OpenMPI on KNL using 1024 ranks the application seg faults during initialization. This problem does not occur when running with MPICH instead.
srun -n 1024 -N 16 --cpu_bind=none --hint=nomultithread --exclusive ../../../SNAP/src/gsnap 1024tasksSTlibfab.input
Stock nightly libfabric and OpenMPI libraries are used from Sung's install directory. More details can be supplied if desired.
Core was generated by `/cray/css/u19/c17581/snap/nersc/SNAPJune13/small/../../../SNAP/src/gsnap 1024ta'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007ffff67112cf in ompi_mtl_ofi_irecv (mtl=0x7ffff6a50d20 <ompi_mtl_ofi>,
537 remote_addr = endpoint->peer_fiaddr;
(gdb) where
#0 0x00007ffff67112cf in ompi_mtl_ofi_irecv (mtl=0x7ffff6a50d20 <ompi_mtl_ofi>, comm=
#1 0x00007ffff6774cb1 in mca_pml_cm_irecv (addr=0x930530, count=1, datatype=
#2 0x00007ffff66635a2 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0x7fffffff6d24, rbuf=
#3 0x00007ffff666d03a in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x7fffffff6d24, rbuf=
#4 0x00007ffff65a6f62 in ompi_comm_allreduce_intra (inbuf=0x7fffffff6d24, outbuf=0x7fffffff6d28, count=
#5 0x00007ffff65a5963 in ompi_comm_nextcid (newcomm=0x932490, comm=
#6 0x00007ffff65a2875 in ompi_comm_dup_with_info (comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, info=0x0,
#7 0x00007ffff65a2760 in ompi_comm_dup (comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, newcomm=
#8 0x00007ffff65f031c in PMPI_Comm_dup (comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, newcomm=
#9 0x00007ffff6ab86e0 in ompi_comm_dup_f (comm=0x43a858, newcomm=
#10 0x0000000000404410 in plib_module_MOD_pinit ()
#11 0x00000000004023bc in MAIN ()
#12 0x00000000004021fd in main ()
The text was updated successfully, but these errors were encountered: