Use `FI_MR_CACHE_MONITOR=kdreg2` for all nersc machines #6687

ndkeen · 2024-10-15T17:22:13Z

With new slighshot software (s2.2 h11.0.1), now installed on Perlmutter, there were some hangs in init for certain cases at higher node counts. Using environment variable FI_MR_CACHE_MONITOR=kdreg2 avoids any issues so far.
kdreg2 is another option for memory cache monitoring -- it is a Linux kernel module using open-source licensing.
It comes with HPE Slingshot host software distribution (optionally installed) and may one day be the default.

Regarding performance, it seems about the same. For one HR F-case at 256 nodes, using kdreg2 was about 1% slower.

Fixes #6655

I also found some older issues (some with lower node-count) that this fixes:
Fixes #6516
Fixes #6451
Fixes #6521

[bfb]

github-actions · 2024-10-15T17:24:21Z

PR Preview Action v1.4.8
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6687/
on branch `gh-pages` at 2024-10-15 17:24 UTC

ndkeen · 2024-10-15T21:20:44Z

I'm now seeing it looks like this will fix #6516, which would be great as it will allow me to update the GCC compiler version.

Confirmed that using kdreg2 is allowing those hanging tests with newer GCC to now work. Verified that I can update GCC to latest version and all testing seems OK so far. Would like to see this PR in first, then follow with PR to update module versions.

ndkeen · 2024-10-17T04:17:20Z

Fixes #6451
Fixes #6521

With new slighshot software (s2.2 h11.0.1), now installed on Perlmutter, there were some hangs in init for certain cases at higher node counts. Using environment variable FI_MR_CACHE_MONITOR=kdreg2 avoids any issues so far. kdreg2 is another option for memory cache monitoring -- it is a Linux kernel module using open-source licensing. It comes with HPE Slingshot host software distribution (optionally installed) and may one day be the default. Regarding performance, it seems about the same. For one HR F-case at 256 nodes, using kdreg2 was about 1% slower. Fixes #6655 I also found some older issues (some with lower node-count) that this fixes: Fixes #6516 Fixes #6451 Fixes #6521 [bfb]

ndkeen · 2024-10-19T22:42:14Z

Merged to next

…' into next (PR #6702) On pm-cpu we were using updated Intel compiler and other module versions that were compatible, but had not yet updated the others due to #6516. After #6687, I think they are resolved and we can now update. The main change here is updating gcc compiler, but other module versions are also updated at the same time. Also, try to clean up the machine config settings across all NERSC machines to be more consistent. module current machine defaults (in this PR) gcc 12.2.0 12.3 (gcc-native) PrgEnv-gnu 8.3.3 8.5.0 cray-libsci 23.02.1.1 23.12.5 cray-mpich 8.1.25 8.1.28 craype 2.7.20 2.7.30 cray-hdf5-parallel 1.12.2.3 1.12.2.9 cray-netcdf-hdf5parallel 4.9.0.3 4.9.0.9 cray-parallel-netcdf 1.12.3.3 1.12.3.9 On muller-cpu, already tried updating the compiler versions and was testing a work-around with special compiler flags for GNU. With this PR, can remove that work-around. Removing FI_CXI_RX_MATCH_MODE=software for machines other than primary pm-cpu. Will remove this in another PR as the default FI_CXI_RX_MATCH_MODE=hybrid now seems fine. We might see some cases not BFB using GNU compiler. For the e3sm-developer tests (what we test nightly), the only test that did not pass baseline compare was ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu

use FI_MR_CACHE_MONITOR=kdreg2 for all nersc machines

e1dff58

ndkeen self-assigned this Oct 15, 2024

ndkeen added Machine Files BFB PR leaves answers BFB pm-gpu Perlmutter machine at NERSC (GPU nodes) pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Oct 15, 2024

ndkeen requested a review from rljacob October 15, 2024 18:11

ndkeen mentioned this pull request Oct 15, 2024

Trying to update GNU compilers on pm-cpu, encounter hang/FPE with certain tests #6516

Closed

ndkeen mentioned this pull request Oct 17, 2024

Hang for test using nvidia compiler only for certain smaller MPI counts ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst #6521

Closed

ambrad approved these changes Oct 19, 2024

View reviewed changes

rljacob approved these changes Oct 20, 2024

View reviewed changes

ndkeen merged commit 812c88c into master Oct 20, 2024
9 checks passed

ndkeen deleted the ndk/machinefiles/nersc-use-kdreg2 branch October 20, 2024 21:47

ndkeen mentioned this pull request Oct 21, 2024

For pm-cpu, increase compiler version for gcc,nvidia,amd (and other modules to be consistent across all NERSC machines) #6702

Merged

ndkeen mentioned this pull request Oct 22, 2024

Data partitioning mechanism used in E3SM #6701

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `FI_MR_CACHE_MONITOR=kdreg2` for all nersc machines #6687

Use `FI_MR_CACHE_MONITOR=kdreg2` for all nersc machines #6687

ndkeen commented Oct 15, 2024 •

edited

Loading

github-actions bot commented Oct 15, 2024

ndkeen commented Oct 15, 2024 •

edited

Loading

ndkeen commented Oct 17, 2024 •

edited

Loading

ndkeen commented Oct 19, 2024

Use FI_MR_CACHE_MONITOR=kdreg2 for all nersc machines #6687

Use FI_MR_CACHE_MONITOR=kdreg2 for all nersc machines #6687

Conversation

ndkeen commented Oct 15, 2024 • edited Loading

github-actions bot commented Oct 15, 2024

ndkeen commented Oct 15, 2024 • edited Loading

ndkeen commented Oct 17, 2024 • edited Loading

ndkeen commented Oct 19, 2024

Use `FI_MR_CACHE_MONITOR=kdreg2` for all nersc machines #6687

Use `FI_MR_CACHE_MONITOR=kdreg2` for all nersc machines #6687

ndkeen commented Oct 15, 2024 •

edited

Loading

ndkeen commented Oct 15, 2024 •

edited

Loading

ndkeen commented Oct 17, 2024 •

edited

Loading