Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use FI_MR_CACHE_MONITOR=kdreg2 for all nersc machines #6687

Merged
merged 1 commit into from
Oct 20, 2024

Conversation

ndkeen
Copy link
Contributor

@ndkeen ndkeen commented Oct 15, 2024

With new slighshot software (s2.2 h11.0.1), now installed on Perlmutter, there were some hangs in init for certain cases at higher node counts. Using environment variable FI_MR_CACHE_MONITOR=kdreg2 avoids any issues so far.
kdreg2 is another option for memory cache monitoring -- it is a Linux kernel module using open-source licensing.
It comes with HPE Slingshot host software distribution (optionally installed) and may one day be the default.

Regarding performance, it seems about the same. For one HR F-case at 256 nodes, using kdreg2 was about 1% slower.

Fixes #6655

I also found some older issues (some with lower node-count) that this fixes:
Fixes #6516
Fixes #6451
Fixes #6521

[bfb]

@ndkeen ndkeen self-assigned this Oct 15, 2024
@ndkeen ndkeen added Machine Files BFB PR leaves answers BFB pm-gpu Perlmutter machine at NERSC (GPU nodes) pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Oct 15, 2024
Copy link

PR Preview Action v1.4.8
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6687/
on branch gh-pages at 2024-10-15 17:24 UTC

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 15, 2024

I'm now seeing it looks like this will fix #6516, which would be great as it will allow me to update the GCC compiler version.

Confirmed that using kdreg2 is allowing those hanging tests with newer GCC to now work. Verified that I can update GCC to latest version and all testing seems OK so far. Would like to see this PR in first, then follow with PR to update module versions.

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 17, 2024

Fixes #6451
Fixes #6521

ndkeen added a commit that referenced this pull request Oct 19, 2024
With new slighshot software (s2.2 h11.0.1), now installed on Perlmutter, there were some hangs in init for
certain cases at higher node counts. Using environment variable FI_MR_CACHE_MONITOR=kdreg2 avoids any issues so far.
kdreg2 is another option for memory cache monitoring -- it is a Linux kernel module using open-source licensing.
It comes with HPE Slingshot host software distribution (optionally installed) and may one day be the default.

Regarding performance, it seems about the same. For one HR F-case at 256 nodes, using kdreg2 was about 1% slower.

Fixes #6655

I also found some older issues (some with lower node-count) that this fixes:
Fixes #6516
Fixes #6451
Fixes #6521

[bfb]
@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 19, 2024

Merged to next

@ndkeen ndkeen merged commit 812c88c into master Oct 20, 2024
9 checks passed
@ndkeen ndkeen deleted the ndk/machinefiles/nersc-use-kdreg2 branch October 20, 2024 21:47
ndkeen added a commit that referenced this pull request Oct 21, 2024
…' into next (PR #6702)

On pm-cpu we were using updated Intel compiler and other module versions that were compatible, but had not yet updated the others due to #6516. After #6687, I think they are resolved and we can now update.

The main change here is updating gcc compiler, but other module versions are also updated at the same time.
Also, try to clean up the machine config settings across all NERSC machines to be more consistent.

module                      current        machine defaults (in this PR)
gcc                         12.2.0         12.3 (gcc-native)
PrgEnv-gnu                  8.3.3          8.5.0
cray-libsci                 23.02.1.1      23.12.5
cray-mpich                  8.1.25         8.1.28
craype                      2.7.20         2.7.30
cray-hdf5-parallel          1.12.2.3       1.12.2.9
cray-netcdf-hdf5parallel    4.9.0.3        4.9.0.9
cray-parallel-netcdf        1.12.3.3       1.12.3.9
On muller-cpu, already tried updating the compiler versions and was testing a work-around with special compiler flags for GNU. With this PR, can remove that work-around.

Removing FI_CXI_RX_MATCH_MODE=software for machines other than primary pm-cpu. Will remove this in another PR as the default FI_CXI_RX_MATCH_MODE=hybrid now seems fine.

We might see some cases not BFB using GNU compiler. For the e3sm-developer tests (what we test nightly), the only test that did not pass baseline compare was ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment