-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For NCEP regtests, update to spack-stack 1.6, add option for gnu compiler and add Hercules #1145
For NCEP regtests, update to spack-stack 1.6, add option for gnu compiler and add Hercules #1145
Conversation
Thanks @JessicaMeixner-NOAA, I'll start the testing. |
Fyi, @JessicaMeixner-NOAA on
I get a crash:
I'll re-run this test to see if it goes away, though given it's SCOTCH on orion I'm a little concerned it could be an undiagnosed issue. |
@MatthewMasarik-NOAA for this PR for orion only changed the location of the METIS install, so it shouldn't be this PR as the source of the issue. But if this issue is reproducible it's definitely an issue we should look into. I didn't see this in my tests: /work2/noaa/marine/jmeixner/PR_WW3/GNUPR/regtests/matrix13.out of this PR, but will be interested to see how the re-runs go on your end. |
regtests/bin/matrix_cmake_ncep
Outdated
echo 'export KMP_STACKSIZE=2G' >> matrix.head | ||
echo 'export FI_OFI_RXM_BUFFER_SIZE=128000' >> matrix.head | ||
echo 'export FI_OFI_RXM_RX_SIZE=64000' >> matrix.head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JessicaMeixner-NOAA these are retained in for orion, but not present for hercules. Also, their position for orion has been moved above the module loads. Are these both intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't see these being used in ufs-weather-model submit scripts. I'm not sure why these are here although the ulimit -s unlimited
is usually needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those three were originally recommended on WCOSS2 when I was doing testing with SCOTCH a bit ago. They were then also added to orion's job card when there were issues updating modules at a previous time and the issues seemed to go away. I suspect we may need to include them on WCOSS2 when that update happens. We have had all the export
's together at after the module loads, I don't know whether ordering matters with modules/exports, but I'd vote to choose one or the other to keep all job cards consistent between platforms
I find it convenient to have the cd
to regtests directory right after the #SBATCH
s. Could we keep that location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into this tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MatthewMasarik-NOAA in my comparions of what is happening in the develop branch on orion and what is happening in this branch in terms of the lodation of #sbatch and the "cd regtests" it is the same. Is this different behavior for you?
Since the hercules tests are working fine without extra envrionement variables beyond the unlimit -s, I prefer to not add anything. We could add the environement variable to remove the one warning, but I'm okay leaving things as is. I'll re-run a test on hercules to see if there are any system changes causing errors for me after my successful tests earlier this week/late last week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JessicaMeixner-NOAA, to tie up a loose end regarding the "cd regtests". Checking now the job card for hera
, I see it doesn't have either the extra export
's or those ulimit
lines, so the "cd regtests" does come directly after the Slurm directives, as I had in mind. However, for orion
, these lines are currently above the "cd regtests", so I was wrong on that. Since nothing changed, no fix needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MatthewMasarik-NOAA
I have tested that we can remove the extra orion variables for our regression tests. Orion is a bit unstable right now so we should re-test that when we have a resolution to #1152 and are ready to re-test this PR.
Update
|
@MatthewMasarik-NOAA can you share more information offline about the Hercules errors? The other updates are as I intended I'll do a double check on hercules environment variables though. |
During testing, @MatthewMasarik-NOAA discovered that we have an issue with omplace commands. While an error message is seen on hercules+gnu, further inspection shows that this is likely an issue in develop as well. See #1152 for additional details. This will be addressed before merging this PR. Marking the PR as a draft until changes are made as requested by @MatthewMasarik-NOAA and it's ready for re-review and testing. |
@JessicaMeixner-NOAA and I did discuss this at the time. I have some resolution on the hercules issues to update everyone. @JessicaMeixner-NOAA found that the errors I had in hercules/intel were due to not downloading the regtest data from ftp, so that was resolved. The hercules/gnu was interesting. I did a second set that both had crashes like the first. It seems that invoking conda setup in my .bashrc was the issue. After disabling it both runs went through OK. |
@MatthewMasarik-NOAA thanks for letting me know! Once we have a resolution to #1152 I'll re-test this PR and re open for review. |
@MatthewMasarik-NOAA I've updated the title to reflect the additions of spack-stack 1.6.0. While we still get some output from #1152 we confirmed that we are getting the expected performance of the OMP and think that we're doing okay as of now. Issue 1152 will remain open, but we will move forward with this PR, especially as we need to move to hera rock-8 nodes to continue testing other PRs. My final set of tests results will be posted to this PR later this afternoon, but I think it's okay if you start to test this now as needing Rocky-8 on hera will hold up other PRs until we have that. |
@MatthewMasarik-NOAA all logs are now posted. |
Thanks @JessicaMeixner-NOAA |
The follow up issue to track gnu issues can be found here: #1146 |
Quick update to keep everyone informed. The review for this PR is being finalized and will be posted shortly this morning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review
- Pass
Testing
- Pass
hera
intel: develop v. fix
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR1_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (15 files differ)
mww3_test_03/./work_PR1_MPI_d2 (11 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (15 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (18 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2 (15 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (12 files differ)
mww3_test_09/./work_MPI_ASCII (0 files differ)
ww3_tp2.10/./work_MPI_OMPH (7 files differ)
ww3_tp2.16/./work_MPI_OMPH (4 files differ)
ww3_tp2.6/./work_ST4_ASCII (0 files differ)
ww3_ufs1.3/./work_a (3 files differ)
**********************************************************************
************************ identical cases *****************************
**********************************************************************
hera.intel.dev.fix.matrixCompSummary.txt
hera.intel.dev.fix.matrixCompFull.txt
hera.intel.dev.fix.matrixDiff.txt
intel: fix v. fix
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR3_UQ_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (16 files differ)
mww3_test_03/./work_PR1_MPI_d2 (11 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (12 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (17 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (18 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2 (16 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (16 files differ)
mww3_test_09/./work_MPI_ASCII (0 files differ)
ww3_tp2.10/./work_MPI_OMPH (7 files differ)
ww3_tp2.16/./work_MPI_OMPH (4 files differ)
ww3_tp2.6/./work_ST4_ASCII (0 files differ)
ww3_ufs1.3/./work_a (3 files differ)
**********************************************************************
************************ identical cases *****************************
**********************************************************************
hera.intel.fix.fix.matrixCompSummary.txt
hera.intel.fix.fix.matrixCompFull.txt
hera.intel.fix.fix.matrixDiff.txt
gnu: fix v. fix
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR2_UNO_MPI_d2 (17 files differ)
mww3_test_03/./work_PR1_MPI_d2 (13 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (12 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (14 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2 (15 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (15 files differ)
mww3_test_09/./work_MPI_ASCII (0 files differ)
ww3_tp2.10/./work_MPI_OMPH (7 files differ)
ww3_tp2.16/./work_MPI_OMPH (4 files differ)
ww3_tp2.17/./work_ma (1 files differ)
ww3_tp2.17/./work_a (1 files differ)
ww3_tp2.17/./work_mc1 (1 files differ)
ww3_tp2.17/./work_mb (1 files differ)
ww3_tp2.17/./work_mc (1 files differ)
ww3_tp2.17/./work_ma1 (1 files differ)
ww3_tp2.17/./work_c (1 files differ)
ww3_tp2.17/./work_b (1 files differ)
ww3_tp2.6/./work_ST4_ASCII (0 files differ)
ww3_ts1/./work_ST4_WRT (1 files differ)
ww3_ufs1.1/./work_c_npl (1 files differ)
ww3_ufs1.1/./work_d (1 files differ)
ww3_ufs1.1/./work_c_nth (1 files differ)
ww3_ufs1.1/./work_c (1 files differ)
ww3_ufs1.2/./work_a (4 files differ)
ww3_ufs1.2/./work_l (1 files differ)
ww3_ufs1.2/./work_c (3 files differ)
ww3_ufs1.2/./work_b (4 files differ)
ww3_ufs1.3/./work_a (29 files differ)
**********************************************************************
************************ identical cases *****************************
**********************************************************************
hera.gnu.fix.fix.matrixCompSummary.txt
hera.gnu.fix.fix.matrixCompFull.txt
hera.gnu.fix.fix.matrixDiff.txt
orion
intel: develop v. fix
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR2_UQ_MPI_d2 (14 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (16 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (16 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (15 files differ)
mww3_test_03/./work_PR1_MPI_d2 (8 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (18 files differ)
mww3_test_09/./work_MPI_ASCII (0 files differ)
ww3_tp2.10/./work_MPI_OMPH (6 files differ)
ww3_tp2.16/./work_MPI_OMPH (4 files differ)
ww3_tp2.6/./work_ST4_ASCII (0 files differ)
ww3_ufs1.3/./work_a (3 files differ)
**********************************************************************
************************ identical cases *****************************
**********************************************************************
orion.intel.dev.fix.matrixCompSummary.txt
orion.intel.dev.fix.matrixCompFull.txt
orion.intel.dev.fix.matrixDiff.txt
intel: fix v. fix
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR2_UQ_MPI_d2 (15 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (18 files differ)
mww3_test_03/./work_PR1_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (14 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (16 files differ)
mww3_test_03/./work_PR1_MPI_d2 (6 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (10 files differ)
mww3_test_09/./work_MPI_ASCII (0 files differ)
ww3_tp2.10/./work_MPI_OMPH (7 files differ)
ww3_tp2.16/./work_MPI_OMPH (4 files differ)
ww3_tp2.6/./work_ST4_ASCII (0 files differ)
ww3_ufs1.3/./work_a (3 files differ)
**********************************************************************
************************ identical cases *****************************
**********************************************************************
orion.intel.fix.fix.matrixCompFull.txt
orion.intel.fix.fix.matrixCompSummary.txt
orion.intel.fix.fix.matrixDiff.txt
hercules
intel: fix v. fix
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR2_UQ_MPI_d2 (15 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (17 files differ)
mww3_test_03/./work_PR1_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (17 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (13 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (15 files differ)
mww3_test_03/./work_PR1_MPI_d2 (12 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (14 files differ)
mww3_test_09/./work_MPI_ASCII (0 files differ)
ww3_tp2.10/./work_MPI_OMPH (5 files differ)
ww3_tp2.6/./work_ST4_ASCII (0 files differ)
ww3_ufs1.3/./work_a (3 files differ)
**********************************************************************
************************ identical cases *****************************
**********************************************************************
hercules.intel.fix.fix.matrixCompSummary.txt
hercules.intel.fix.fix.matrixCompFull.txt
hercules.intel.fix.fix.matrixDiff.txt
gnu: fix v. fix
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR2_UQ_MPI_d2 (15 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (12 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (16 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (12 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (15 files differ)
mww3_test_03/./work_PR1_MPI_d2 (13 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (15 files differ)
mww3_test_07/./work_PR3_UQ (1 files differ)
mww3_test_09/./work_MPI_ASCII (0 files differ)
ww3_tic1.4/./work_IC0IS2_1000 (1 files differ)
ww3_tic1.4/./work_IC1IS2_1000 (1 files differ)
ww3_tp2.21/./work_ma (1 files differ)
ww3_tp2.21/./work_b_metis (1 files differ)
ww3_tp2.21/./work_a (1 files differ)
ww3_tp2.21/./work_b (1 files differ)
ww3_tp2.6/./work_ST0 (1 files differ)
ww3_tp2.6/./work_pdlib (1 files differ)
ww3_tp2.6/./work_ST4 (1 files differ)
ww3_tp2.6/./work_ST4_ASCII (2 files differ)
ww3_tp2.7/./work_ST0 (1 files differ)
ww3_ufs1.1/./work_unstr_c (1 files differ)
ww3_ufs1.1/./work_unstr_b (1 files differ)
ww3_ufs1.1/./work_unstr_a (1 files differ)
ww3_ufs1.3/./work_a (3 files differ)
**********************************************************************
************************ identical cases *****************************
**********************************************************************
hercules.gnu.fix.fix.matrixCompSummary.txt
hercules.gnu.fix.fix.matrixCompFull.txt
matrixDiff.txt was too big to include (~25GB!)
Approved.
Thanks @JessicaMeixner-NOAA, adding a new platform and the gnu compiler gives us more options for validating PRs at NCEP. |
…ting/pdlib-restarts * origin/testing/pdlib-restarts: Fix compiler remarks for ST6 and GMD (NOAA-EMC#1206) For NCEP regtests, add option for gnu compiler and new machine Hercules (NOAA-EMC#1145)
Pull Request Summary
This PR adds an option for the capability to run intel or gnu compilers on hera and adds the new machine hercules as an option for regression testing. Moreover, this PR updates hera to the rocky-8 notes and hera, orion and hercules to spack-stack 1.6.0.
Description
For regtests/bin/matrix_cmake_ncep
There are now two inputs, the first being the path to the model directory and a second which is compiler for the regtests, intel (default) or gnu.
@MatthewMasarik-NOAA created the installs for parMETIS on the new machines and updated the paths for consistency on the existing machines including moving us to spack-stack 1.6.0 and the rocky-8 nodes on hera.
The transition to the Rocky-8 node exposed a very subtle uninitialized variable, which as been added here. No answer changes are expected with this PR.
Please also include the following information:
Issue(s) addressed
Commit Message
For NCEP regtests, add option for gnu compiler and new machine Hercules
Check list
Testing
** hera - intel (compare with develop) & gnu (compare with itself, no develop compare option)
** orion - intel (compare with develop)
** Hercules - intel (compare with instelf, no develop compare option) & gnu (compare with istelf, no develop compare option)
Updated output for comparisons for3/27/24:
Hera compared with develop:
heravDev.matrixCompFull.txt
heravDev.matrixCompSummary.txt
heravDev.matrixDiff.txt
Hera intel vs itself: (Only expected non-b4b)
hera.intel.matrixDiff.txt
hera.intel.matrixCompFull.txt
hera.intel.matrixCompSummary.txt
Hera gnu vs istelf (no existing baseline for comparion) - this has more diffs than "expected" and will be addressed on subsequent PRs to resolve outstanding issues:
hera.gnu.matrixCompFull.txt
hera.gnu.matrixCompSummary.txt
hera.gnu.matrixDiff.txt
Orion vs develop intel (only expected non b4b):
orionvdev.matrixCompFull.txt
orionvdev.matrixCompSummary.txt
orionvdev.matrixDiff.txt
Hercules intel vs itself (no existing basline for comparison, only expected non b4b):
hercules.intel.matrixCompFull.txt
hercules.intel.matrixCompSummary.txt
hercules.intel.matrixDiff.txt
Hercules gnu vs itself (no existing baselines, some additional non b4b to be addressed in follow-up PRs):
hercules.gnu.matrixCompFull.txt
hercules.gnu.matrixCompSummary.txt
(Diff file too big to post).