Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn off xpmem in OFED 5.8 on Chrysalis #6359

Merged
merged 1 commit into from
May 1, 2024
Merged

Conversation

rljacob
Copy link
Member

@rljacob rljacob commented Apr 19, 2024

Add env var to chrysalis to turn off xpmem when using the new OFED 5.8 network drivers.
A bug in xpmem can leave nodes stuck in an unkillable state after a model crash.

[BFB]

@rljacob rljacob requested a review from amametjanov April 19, 2024 00:14
@rljacob rljacob self-assigned this Apr 19, 2024
@rljacob
Copy link
Member Author

rljacob commented Apr 19, 2024

@amametjanov Can you check that this doesn't slow runs down? It did not on my test with an ne30 production coupled case.

Copy link

github-actions bot commented Apr 19, 2024

PR Preview Action v1.4.7
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6359/
on branch gh-pages at 2024-04-23 03:07 UTC

Copy link
Member

@amametjanov amametjanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with 2 tests in

create_test e3sm_prod_bench

PFS.ne30pg2_r05_IcoswISC30E3r5.F2010.chrysalis_intel.bench-noio:

2024-04-18 22:53:17: MEMCOMP: Memory usage highwater changed by -0.16%: baseline=6373.210 MB, tolerance=5%, current=6362.930 MB
 ---------------------------------------------------
2024-04-18 22:53:17: TPUTCOMP: Throughput changed by 0.50%: baseline=1.791 sypd, tolerance=5%, current=1.782 sypd

PFS.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.bench-noio:

2024-04-18 22:52:00: MEMCOMP: Memory usage highwater changed by -0.45%: baseline=4902.090 MB, tolerance=5%, current=4879.850 MB
 ---------------------------------------------------
2024-04-18 22:52:00: TPUTCOMP: Throughput changed by 0.90%: baseline=1.997 sypd, tolerance=5%, current=1.979 sypd

<1% throughput tradeoff for <1% memory.

Turn off xpmem in OpenMPI

Add env var to turn off xpmem when using OpenMPI.
Avoids leaving nodes in unkillable state.
@rljacob rljacob force-pushed the rljacob/chrysalis/no-xpmem branch from b757987 to 71db923 Compare April 19, 2024 21:36
@rljacob
Copy link
Member Author

rljacob commented Apr 19, 2024

@amametjanov please try again with this new version that doesn't have the typo.

@amametjanov
Copy link
Member

PFS.ne30pg2_r05_IcoswISC30E3r5.F2010.chrysalis_intel.bench-noio:

2024-04-19 20:09:14: MEMCOMP: Memory usage highwater changed by -3.54%: baseline=6373.210 MB, tolerance=5%, current=6147.640 MB
 ---------------------------------------------------
2024-04-19 20:09:14: TPUTCOMP: Throughput changed by 0.22%: baseline=1.791 sypd, tolerance=5%, current=1.787 sypd

PFS.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.bench-noio:

2024-04-19 20:03:46: MEMCOMP: Memory usage highwater changed by -4.62%: baseline=4902.090 MB, tolerance=5%, current=4675.830 MB
 ---------------------------------------------------
2024-04-19 20:03:46: TPUTCOMP: Throughput changed by 0.60%: baseline=1.997 sypd, tolerance=5%, current=1.985 sypd

@rljacob
Copy link
Member Author

rljacob commented Apr 23, 2024

This is now in the openmpi module by default so don't need to add it.

@rljacob rljacob closed this Apr 23, 2024
@rljacob
Copy link
Member Author

rljacob commented Apr 23, 2024

Removed it from module. Was in place 2pm to 10pm April 22.

@rljacob rljacob reopened this Apr 23, 2024
rljacob added a commit that referenced this pull request Apr 30, 2024
Add env var to chrysalis to turn off xpmem when using OpenMPI.
Avoids leaving nodes in unkillable state.
Workaround for bug in xpmem.

[BFB]
@rljacob rljacob changed the title Turn off xpmem in OpenMPI on Chrysalis Turn off xpmem in OFED 5.8 on Chrysalis May 1, 2024
@rljacob
Copy link
Member Author

rljacob commented May 1, 2024

revised title and comment because this variable is needed all the time, not just OpenMPI.

@rljacob rljacob merged commit 8141b60 into master May 1, 2024
40 checks passed
@rljacob rljacob deleted the rljacob/chrysalis/no-xpmem branch May 1, 2024 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants