-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use FI_MR_CACHE_MONITOR=kdreg2
for all nersc machines
#6687
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ndkeen
added
Machine Files
BFB
PR leaves answers BFB
pm-gpu
Perlmutter machine at NERSC (GPU nodes)
pm-cpu
Perlmutter at NERSC (CPU-only nodes)
labels
Oct 15, 2024
|
I'm now seeing it looks like this will fix #6516, which would be great as it will allow me to update the GCC compiler version. Confirmed that using |
ambrad
approved these changes
Oct 19, 2024
ndkeen
added a commit
that referenced
this pull request
Oct 19, 2024
With new slighshot software (s2.2 h11.0.1), now installed on Perlmutter, there were some hangs in init for certain cases at higher node counts. Using environment variable FI_MR_CACHE_MONITOR=kdreg2 avoids any issues so far. kdreg2 is another option for memory cache monitoring -- it is a Linux kernel module using open-source licensing. It comes with HPE Slingshot host software distribution (optionally installed) and may one day be the default. Regarding performance, it seems about the same. For one HR F-case at 256 nodes, using kdreg2 was about 1% slower. Fixes #6655 I also found some older issues (some with lower node-count) that this fixes: Fixes #6516 Fixes #6451 Fixes #6521 [bfb]
Merged to next |
rljacob
approved these changes
Oct 20, 2024
ndkeen
added a commit
that referenced
this pull request
Oct 21, 2024
…' into next (PR #6702) On pm-cpu we were using updated Intel compiler and other module versions that were compatible, but had not yet updated the others due to #6516. After #6687, I think they are resolved and we can now update. The main change here is updating gcc compiler, but other module versions are also updated at the same time. Also, try to clean up the machine config settings across all NERSC machines to be more consistent. module current machine defaults (in this PR) gcc 12.2.0 12.3 (gcc-native) PrgEnv-gnu 8.3.3 8.5.0 cray-libsci 23.02.1.1 23.12.5 cray-mpich 8.1.25 8.1.28 craype 2.7.20 2.7.30 cray-hdf5-parallel 1.12.2.3 1.12.2.9 cray-netcdf-hdf5parallel 4.9.0.3 4.9.0.9 cray-parallel-netcdf 1.12.3.3 1.12.3.9 On muller-cpu, already tried updating the compiler versions and was testing a work-around with special compiler flags for GNU. With this PR, can remove that work-around. Removing FI_CXI_RX_MATCH_MODE=software for machines other than primary pm-cpu. Will remove this in another PR as the default FI_CXI_RX_MATCH_MODE=hybrid now seems fine. We might see some cases not BFB using GNU compiler. For the e3sm-developer tests (what we test nightly), the only test that did not pass baseline compare was ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
BFB
PR leaves answers BFB
Machine Files
pm-cpu
Perlmutter at NERSC (CPU-only nodes)
pm-gpu
Perlmutter machine at NERSC (GPU nodes)
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With new slighshot software (s2.2 h11.0.1), now installed on Perlmutter, there were some hangs in init for certain cases at higher node counts. Using environment variable
FI_MR_CACHE_MONITOR=kdreg2
avoids any issues so far.kdreg2
is another option for memory cache monitoring -- it is a Linux kernel module using open-source licensing.It comes with HPE Slingshot host software distribution (optionally installed) and may one day be the default.
Regarding performance, it seems about the same. For one HR F-case at 256 nodes, using
kdreg2
was about 1% slower.Fixes #6655
I also found some older issues (some with lower node-count) that this fixes:
Fixes #6516
Fixes #6451
Fixes #6521
[bfb]