Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running mini apps with 1k ranks and OpenMPI causes seg fault #883

Open
tenbrugg opened this issue Jun 27, 2016 · 3 comments
Open

Running mini apps with 1k ranks and OpenMPI causes seg fault #883

tenbrugg opened this issue Jun 27, 2016 · 3 comments
Assignees

Comments

@tenbrugg
Copy link

This issue is intended to track the OpenMPI seg fault problem discussed last week. When running SNAP with OpenMPI on KNL using 1024 ranks the application seg faults during initialization. This problem does not occur when running with MPICH instead.

srun -n 1024 -N 16 --cpu_bind=none --hint=nomultithread --exclusive ../../../SNAP/src/gsnap 1024tasksSTlibfab.input

Stock nightly libfabric and OpenMPI libraries are used from Sung's install directory. More details can be supplied if desired.

Core was generated by `/cray/css/u19/c17581/snap/nersc/SNAPJune13/small/../../../SNAP/src/gsnap 1024ta'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007ffff67112cf in ompi_mtl_ofi_irecv (mtl=0x7ffff6a50d20 <ompi_mtl_ofi>,

comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, src=119, tag=-12, convertor=0x80f710, 
mtl_request=0x80f820) at mtl_ofi.h:537

537 remote_addr = endpoint->peer_fiaddr;

(gdb) where
#0 0x00007ffff67112cf in ompi_mtl_ofi_irecv (mtl=0x7ffff6a50d20 <ompi_mtl_ofi>, comm=

0x7ffff6a59b20 <ompi_mpi_comm_world>, src=119, tag=-12, convertor=0x80f710, mtl_request=0x80f820)
at mtl_ofi.h:537

#1 0x00007ffff6774cb1 in mca_pml_cm_irecv (addr=0x930530, count=1, datatype=

0x7ffff6a45040 <ompi_mpi_int>, src=119, tag=-12, comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, request=
0x7fffffff6b50) at pml_cm.h:119

#2 0x00007ffff66635a2 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0x7fffffff6d24, rbuf=

0x7fffffff6d28, count=1, dtype=0x7ffff6a45040 <ompi_mpi_int>, op=0x7ffff6a64720 <ompi_mpi_op_max>, 
comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, module=0x818740) at base/coll_base_allreduce.c:221

#3 0x00007ffff666d03a in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x7fffffff6d24, rbuf=

0x7fffffff6d28, count=1, dtype=0x7ffff6a45040 <ompi_mpi_int>, op=0x7ffff6a64720 <ompi_mpi_op_max>, 
comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, module=0x818740) at coll_tuned_decision_fixed.c:66

#4 0x00007ffff65a6f62 in ompi_comm_allreduce_intra (inbuf=0x7fffffff6d24, outbuf=0x7fffffff6d28, count=

1, op=0x7ffff6a64720 <ompi_mpi_op_max>, comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, bridgecomm=0x0, 
local_leader=0x0, remote_leader=0x0, send_first=-1, tag=0x7ffff6798b5e "nextcid", iter=0)
at communicator/comm_cid.c:878

#5 0x00007ffff65a5963 in ompi_comm_nextcid (newcomm=0x932490, comm=

0x7ffff6a59b20 <ompi_mpi_comm_world>, bridgecomm=0x0, local_leader=0x0, remote_leader=0x0, mode=32, 
send_first=-1) at communicator/comm_cid.c:221

#6 0x00007ffff65a2875 in ompi_comm_dup_with_info (comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, info=0x0,

newcomm=0x7fffffff6e48) at communicator/comm.c:1037

#7 0x00007ffff65a2760 in ompi_comm_dup (comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, newcomm=

0x7fffffff6e48) at communicator/comm.c:998

#8 0x00007ffff65f031c in PMPI_Comm_dup (comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, newcomm=

0x7fffffff6e48) at pcomm_dup.c:63

#9 0x00007ffff6ab86e0 in ompi_comm_dup_f (comm=0x43a858, newcomm=

0x63f46c <__plib_module_MOD_comm_snap>, ierr=0x7fffffff6e78) at pcomm_dup_f.c:76

#10 0x0000000000404410 in plib_module_MOD_pinit ()
#11 0x00000000004023bc in MAIN
()
#12 0x00000000004021fd in main ()

@tenbrugg
Copy link
Author

tenbrugg commented Aug 3, 2016

Changed to title to reflect that this is a general problem for mini apps at high rank count, not just limited to SNAP.

@tenbrugg tenbrugg changed the title Running SNAP with 1k ranks and OpenMPI causes seg fault Running mini apps with 1k ranks and OpenMPI causes seg fault Aug 3, 2016
@hppritcha hppritcha self-assigned this Aug 8, 2016
@hppritcha
Copy link
Member

@sungeunchoi
what configure options are you using to build open mpi for these mini-app builds?

@sungeunchoi
Copy link

@hppritcha Other than --prefix and --with-libfabric, I use --enable-mpi-thread-multiple --disable-dlopen --with-verbs=no.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants