Very slow MPI_Finalize possibly due to GNI datagram resource bottleneck #932

tenbrugg · 2016-08-17T22:44:07Z

At times I have observed very slow, or possibly hung, job termination on avalon with 8000 or more ranks. This is with MPICH since OMPI frequently seg faults at this scale (#883) and I typically don't use it. To give you a sense of timing, sometime last week I noticed that SNAP computation took ~5min at this scale, and successful termination in that case took ~20 min.

Yesterday, Aug 16, SNAP failed during set up due to user error, and then did not terminate within the 30min job time limit. A file will be attached which shows relevant script output, analysis, and partial traces. When I switched to OpenMPI, the app behaved as expected, failing and terminating quickly. This could be an MPICH issue or a lower-level issue which is triggered by MPICH's use of libfabric-GNI. Howard's analysis in the next comment point towards the latter.

tenbrugg · 2016-08-17T22:47:18Z

Summary of Howard's initial analysis:

I looked at the MPICH ofi netmod code and do see what can be causing this issue - namely that
all VC's (netmod VC's not GNI provider VC's), are marked active. That has the effect of requiring
an all-to-all short message pattern within the MPI_Finalize code. I could see how this could be
particularly problematic if the app hadn't actually exchanged any other messges with a lot of the
other ranks in the job before MPI_FInalize is called. That would result in the GNI provider having
to do a lot of connection setups just to exchange this one "mpich shutdown" message.

I think this problem should be reproducible using a simple mpi hello world program.

Debugging and fixing this will require a large system and ability to look at and interpret things like dmesg output from kgni.

What may be happening is that the GNI datagram mechanism is being overloaded (not enough
TX entries posted to kgni's session ring), resulting in lots of retries of datagram sends within kgni.

tenbrugg · 2016-08-17T22:49:59Z

A couple more data points:

Even for apps which run successfully, at this scale MPICH termination takes much longer than for OpenMPI. For instance, XSBench with OpenMPI runs in ~2min, whereas XSBench with MPICH takes ~21 min. SNAP + OpenMPI finishes in 10min, SNAP + MPICH was killed after 30min, after having spent a large portion of that time in termination after having printed that SNAP was "Done".
Yesterday I ran out of dedicated time to run with warning enabled. However, recently when I started to look at SNAP's slow termination I ran with warning enabled and the console was flooded with messages like these. To get the app to finish then (Aug 10 or 11), I had to turn warning off again.

1536: libfabric:gni:ep_ctrl:_gnix_cm_nic_send():364<warn> _gnix_dgram_alloc returned Resource temporarily unavailable
1536: libfabric:gni:ep_ctrl:_gnix_cm_nic_send():364<warn> _gnix_dgram_alloc returned R
1280: ily unavailable
1280: libfabric:gni:ep_ctrl:_gnix_cm_nic_send():364<warn> _gnix_dgram_alloc returned Resource temporarily unavailable
1280: libfabric:gni:ep_ctrl:_gnix_cm_nic_send():364<warn> _gnix_dgram_alloc returned Resource temporarily unavailable
1280: libfabric:gni:ep_ctrl:_gnix_cm_nic_send():364<warn> _gnix_dgram_alloc returned Resource temporarily unavailable
1280: libfabric:gni:ep_ctrl:_gnix_cm_nic_send():364<warn> _gnix_dgram_alloc returned Resource temporarily unavailable

tenbrugg · 2016-08-17T23:02:46Z

SNAPlargeUsageFailAug16avalon.txt

tenbrugg changed the title ~~Very slow MPI_Finalize due to GNI datagram resource bottleneck~~ Very slow MPI_Finalize possibly due to GNI datagram resource bottleneck Aug 18, 2016

hppritcha added this to the future milestone Sep 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow MPI_Finalize possibly due to GNI datagram resource bottleneck #932

Very slow MPI_Finalize possibly due to GNI datagram resource bottleneck #932

tenbrugg commented Aug 17, 2016

tenbrugg commented Aug 17, 2016

tenbrugg commented Aug 17, 2016

tenbrugg commented Aug 17, 2016

Very slow MPI_Finalize possibly due to GNI datagram resource bottleneck #932

Very slow MPI_Finalize possibly due to GNI datagram resource bottleneck #932

Comments

tenbrugg commented Aug 17, 2016

tenbrugg commented Aug 17, 2016

tenbrugg commented Aug 17, 2016

tenbrugg commented Aug 17, 2016