-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow MPI_Finalize possibly due to GNI datagram resource bottleneck #932
Comments
Summary of Howard's initial analysis: I looked at the MPICH ofi netmod code and do see what can be causing this issue - namely that I think this problem should be reproducible using a simple mpi hello world program. Debugging and fixing this will require a large system and ability to look at and interpret things like dmesg output from kgni. What may be happening is that the GNI datagram mechanism is being overloaded (not enough |
A couple more data points:
|
At times I have observed very slow, or possibly hung, job termination on avalon with 8000 or more ranks. This is with MPICH since OMPI frequently seg faults at this scale (#883) and I typically don't use it. To give you a sense of timing, sometime last week I noticed that SNAP computation took ~5min at this scale, and successful termination in that case took ~20 min.
Yesterday, Aug 16, SNAP failed during set up due to user error, and then did not terminate within the 30min job time limit. A file will be attached which shows relevant script output, analysis, and partial traces. When I switched to OpenMPI, the app behaved as expected, failing and terminating quickly. This could be an MPICH issue or a lower-level issue which is triggered by MPICH's use of libfabric-GNI. Howard's analysis in the next comment point towards the latter.
The text was updated successfully, but these errors were encountered: