Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update modulefiles to use spack-stack unified environment #1707

Merged
merged 85 commits into from
Aug 24, 2023

Conversation

ulmononian
Copy link
Collaborator

@ulmononian ulmononian commented Apr 12, 2023

Description

Now that spack-stack/1.4.1 has been released and the spack-stack Unified Environment (UE) installations are underway on all supported platforms (those pertinent to the WM are Acorn, Cheyenne, Gaea, Hera, Jet, Orion, NOAA Cloud/Parallelworks, and S4), modulefiles for these machines should be updated to use spack-stack in place of hpc-stack. Further, ufs_common will need to be updated to use the module versions included in the UE (which are at least up-to-date or newer than the current ufs_common modules). UE installation and (some) testing information can be found here and here. More background info on the UE within the context of the WM can be found in #1651.

Preliminary testing of the WM against the official UE installations has been performed on Hera, Orion, Cheyenne, Jet, Gaea, NOAA Cloud (Parallelworks), S4, and Acorn.

Modulefiles to be updated through this PR include: Acorn, Cheyenne, Gaea, Hera, Jet, Linux, MacOSX, Orion, NOAA Cloud (Parallelworks), and S4.

Some additional modifications may be required for certain platforms outside of the modulefiles alone (e.g., Cheyenne's fv3_conf files to address switch from mpt to impi, etc.).

While spack-stack is available and currently being tested on Hercules and Gaea C5, these machines are being addressed in separate PRs (#1733 and #1784, respectively).

This work is in collaboration with @AlexanderRichert-NOAA, @climbfuji, @mark-a-potts, and @srherbener.

Testing Progress (cpld_control_p8):

spack-stack 1.4.1:

Top of commit queue on: TBD

Input data additions/changes

  • No changes are expected to input data.
  • There will be new input data.
  • Input data will be updated.

Anticipated changes to regression tests:

  • No changes are expected to any regression test.
  • Changes are expected to the following tests:
    Anticipation is that the majority of RTs will change as this is a fundamental stack change. New baselines will most likely be required.

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Combined with PR's (If Applicable):

Commit Queue Checklist:

  • Link PR's from all sub-components involved
  • Confirm reviews completed in sub-component PR's
  • Add all appropriate labels to this PR.
  • Run full RT suite on either Hera/Cheyenne with both Intel/GNU compilers
  • Add list of any failed regression tests to "Anticipated changes to regression tests" section.

Linked PR's and Issues:

Depends on #1745 (currently GOCART submodule hash update and two .rc file changes cherry-picked from this PR)

#1651

Testing Day Checklist:

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.

Testing Log (for CM's):

  • RDHPCS
    • Intel
      • Hera
      • Orion
      • Jet
      • Gaea
      • Cheyenne
    • GNU
      • Hera
      • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@jkbk2004
Copy link
Collaborator

@ulmononian @Hang-Lei-NOAA What about status of spack stack installation on wcoss2?

@Hang-Lei-NOAA
Copy link

@jkbk2004 The v1.3.0 has been installed on acorn for UFS testing:
Acorn: /lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.3.0/envs/unified-env-compute

@ulmononian
Copy link
Collaborator Author

@ulmononian @Hang-Lei-NOAA What about status of spack stack installation on wcoss2?

following on what @Hang-Lei-NOAA mentioned: @AlexanderRichert-NOAA is currently conducting tests w/ this acorn installation, and we will merge his updates to the ufs_acorn.intel.lua modulefile soon.

@jkbk2004
Copy link
Collaborator

@ulmononian hopefully, wcoss2 modulefiles as well to make PR fully ready. BTW, as you expect baseline changes, can we work on quantify the effect of changes? I can work on with you. let me know.

@ulmononian
Copy link
Collaborator Author

@ulmononian hopefully, wcoss2 modulefiles as well to make PR fully ready. BTW, as you expect baseline changes, can we work on quantify the effect of changes? I can work on with you. let me know.

it is out of my wheelhouse to update wcoss2...@Hang-Lei-NOAA would you be willing to make any necessary updates to the wcoss2 modulefile to reflect implementation of spack-stack 1.3.0 unified environment? you could submit a PR directly to the branch associated w/ this PR, if that would work for you.

@jkbk2004 yes -- let's touch base about quantifying the changes.

@ulmononian
Copy link
Collaborator Author

@ulmononian hopefully, wcoss2 modulefiles as well to make PR fully ready. BTW, as you expect baseline changes, can we work on quantify the effect of changes? I can work on with you. let me know.

update: the acorn modulefile updates from @AlexanderRichert-NOAA were just merged.

@ulmononian
Copy link
Collaborator Author

@DusanJovic-NOAA by chance, did you ever have success running cpld_control_p8 on hera using gnu/9.2 w/ openmpi/3.1.4 (the hera system versions)? asking in reference to #1465 (comment), because there seem to be issues with some of the gnu RTs on hera with this compiler/mpi combination in our spack-stack testing.

@DeniseWorthen
Copy link
Collaborator

@ulmononian Will the SCOTCH library (required for the unstructured WW3) be available in this unified spack-stack? I am working on committing that capability to UFS and will need it available.

I was able to compile on cheyenne.intel (NOAA-EMC/hpc-stack#501 (comment)) using your install there.

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA by chance, did you ever have success running cpld_control_p8 on hera using gnu/9.2 w/ openmpi/3.1.4 (the hera system versions)? asking in reference to #1465 (comment), because there seem to be issues with some of the gnu RTs on hera with this compiler/mpi combination in our spack-stack testing.

No, I did not run any gnu/openmpi tests on Hera recently.

@ulmononian
Copy link
Collaborator Author

@ulmononian Will the SCOTCH library (required for the unstructured WW3) be available in this unified spack-stack? I am working on committing that capability to UFS and will need it available.

I was able to compile on cheyenne.intel (NOAA-EMC/hpc-stack#501 (comment)) using your install there.

looking into getting this added now. apologies for the delay. what version would be ideal?

@DeniseWorthen
Copy link
Collaborator

@ulmononian I know they are debugging issues w/ the SCOTCH library and are expecting to eventually need to rebuild once the issue is found. But for the low task counts in the RTs, the issue they're debugging shouldn't be problem. So I think the 7.0.3 (which is what I tested on Cheyenne) would be fine, unless you have a better idea.

@JessicaMeixner-NOAA
Copy link
Collaborator

Matching what you have on cheyenne should be fine for the SCOTCH version. There is a known bug in scotch that the developers are working to solve. Once that fix is available (still in the debugging process) I will let you know and we'll need the new version.

@ulmononian
Copy link
Collaborator Author

ulmononian commented Apr 19, 2023

@DeniseWorthen @JessicaMeixner-NOAA thanks for this information. i'll try to add 7.0.3 for now. we should then be able to add it to the unified-environment installations on some platforms pretty quickly, i think. would orion, hera, and cheyenne work to start?

also, any eta on that bugfix that will necessitate the version change?

@MatthewMasarik-NOAA
Copy link
Collaborator

also, any eta on that bugfix that will necessitate the version change?

Hi @ulmononian, I'm working with the SCOTCH developer to resolve this, so I'll try to give an ETA. My best estimate is ~1-3 weeks. We are narrowing down where there issue is, though it is within the SCOTCH code base, so it is a little hard to be more exact as an outside developer. Hopefully we are near the end!

@DeniseWorthen
Copy link
Collaborator

@JessicaMeixner-NOAA Can better answer the timeline question.

The other option, would be to add the METIS library for now to spack-stack (not on wcoss2 obviously) but it would also work for my purposes. Once SCOTCH is fixed, the tests could be switched to using that.

I would like to be able to commit both a gnu and intel test for the unstructured mesh in UFS.

@uturuncoglu
Copy link
Collaborator

I wonder if how multiple version of same package is handling with spack-stack. Some development work might need to use newer version of the library which is required for the PR at the end. For example, I am working on land component development which requires ESMF 8.5.0 beta release. Once I bring these changes to my fork, it will break my development unless I still keep using hpc-stack. Any idea, suggestion?

@ulmononian
Copy link
Collaborator Author

also, any eta on that bugfix that will necessitate the version change?

Hi @ulmononian, I'm working with the SCOTCH developer to resolve this, so I'll try to give an ETA. My best estimate is ~1-3 weeks. We are narrowing down where there issue is, though it is within the SCOTCH code base, so it is a little hard to be more exact as an outside developer. Hopefully we are near the end!

thanks for this estimate, @MatthewMasarik-NOAA!

@DeniseWorthen while i see the functional purpose of temporarily using METIS, my feeling is that it would be prudent to go directly to scotch in spack-stack given the wcoss2 rejection of METIS and overall workload. on another note: i definitely understand you wanting to inlcude gnu/intel tests. the issue isn't adding scotch actually (that should be simple), rather, sorting out the GNU/openmpi configuration of the spack-stack UE on hera and intel/cheyenne. until this is sorted, this modulefile PR will not be ready for review/merge. depending on your timeline to get the scotch/ww3 changes into develop, it may be best to utilize the hpc-stack installations.

@ulmononian
Copy link
Collaborator Author

ulmononian commented Apr 19, 2023

I wonder if how multiple version of same package is handling with spack-stack. Some development work might need to use newer version of the library which is required for the PR at the end. For example, I am working on land component development which requires ESMF 8.5.0 beta release. Once I bring these changes to my fork, it will break my development unless I still keep using hpc-stack. Any idea, suggestion?

@uturuncoglu spack-stack is able to support multiple version within the same environment (e.g. the current unified environment does support multiple esmf and mapl versions). i believe @climbfuji or @AlexanderRichert-NOAA could provide more details regarding the paradigm/process for adding package versions needed for ufs-wm development (e.g. esmf 8.5.0 beta as you mentione), though i expect to the process to be similar to how hpc-stack updates are handled now.

@MatthewMasarik-NOAA
Copy link
Collaborator

Fyi, if it's helpful SCOTCH should work fine for the range of use by the UFS RT's (MPI tasks < ~2K, would be OK).

At this time I think it's most likely that the re-build for scotch that will be needed will be due to a change in the source code. It's possible, but unlikely, that we will need to alter the build process in any significant way.

@DeniseWorthen
Copy link
Collaborator

@ulmononian You comments on using METIS temporarily are understood.

Is SCOTCH in the hpc-stack? I couldn't find it.

@ulmononian
Copy link
Collaborator Author

@ulmononian You comments on using METIS temporarily are understood.

Is SCOTCH in the hpc-stack? I couldn't find it.

i am going to transfer this conversation to NOAA-EMC/hpc-stack#501.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Apr 20, 2023

@ulmononian Actually I think scotch is available in the stack, at least on hera for both intel and gnu. Many apologies for the confusion.

module use ../modulefiles/
module load ufs_hera.intel

module spider scotch

----------------------------------------------------------------------------------------------------------------------------------
  scotch: scotch/7.0.3
----------------------------------------------------------------------------------------------------------------------------------
    Description:
      scotch library


    You will need to load all module(s) on any one of the lines below before the "scotch/7.0.3" module is available to load.

      ufs_hera.intel  hpc/1.2.0  hpc-intel/2022.1.2  hpc-impi/2022.1.2

@ulmononian
Copy link
Collaborator Author

ulmononian commented Apr 20, 2023

@ulmononian Actually I think scotch is available in the stack, at least on hera for both intel and gnu. Many apologies for the confusion.

module use ../modulefiles/
module load ufs_hera.intel

module spider scotch

----------------------------------------------------------------------------------------------------------------------------------
  scotch: scotch/7.0.3
----------------------------------------------------------------------------------------------------------------------------------
    Description:
      scotch library


    You will need to load all module(s) on any one of the lines below before the "scotch/7.0.3" module is available to load.

      ufs_hera.intel  hpc/1.2.0  hpc-intel/2022.1.2  hpc-impi/2022.1.2

@DeniseWorthen the confusion was my fault because scotch/7.0.3 was NOT available until you mentioned it yesterday. i just did not notify in time that i ran the installations last night for the orion (intel) and hera (intel/gnu) stacks currently used by WM develop. let me know if you have any issues, please.

further, scotch/7.0.3 should be ready for testing within the spack-stack unified environment by the end of today. i will post confirmation when done. any testing of that stack is appreciated!

@ulmononian
Copy link
Collaborator Author

ulmononian commented Apr 20, 2023

@DeniseWorthen @JessicaMeixner-NOAA @MatthewMasarik-NOAA i installed scotch/7.0.3 in the spack-stack UE on hera (both intel and gnu). it's not currently in my fork branch ufs_common (i expect scotch will need to be added to this file in develop soon?), so please load https://github.com/ulmononian/ufs-weather-model/blob/feature/spack_stack_ue/modulefiles/ufs_hera.intel.lua and then do module load scotch/7.0.3. the spack-stack UE GNU instllation is not fully-functional yet on hera (most rt_gnu.conf RTs pass, but we are debugging cpld_control_p8), so proceed with caution if you run any GNU tests yet.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Apr 20, 2023

@ulmononian Thanks. I've been able to compile and run my unstructured feature branch with SCOTCH (using hpc-stack) on hera by adding

diff --git a/modulefiles/ufs_common.lua b/modulefiles/ufs_common.lua
index ae8b8e6c..cad3ed7e 100644
--- a/modulefiles/ufs_common.lua
+++ b/modulefiles/ufs_common.lua
@@ -53,4 +53,7 @@ load(pathJoin("gftl-shared", gftl_shared_ver))
 mapl_ver=os.getenv("mapl_ver") or "2.22.0-esmf-8.3.0b09"
 load(pathJoin("mapl", mapl_ver))

+scotch_ver=os.getenv("scotch_ver") or "7.0.3"
+load(pathJoin("scotch", scotch_ver))

Is that the correct method I should be using?

@ulmononian
Copy link
Collaborator Author

@ulmononian Thanks. I've been able to compile and run my unstructured feature branch with SCOTCH (using hpc-stack) on hera by adding

diff --git a/modulefiles/ufs_common.lua b/modulefiles/ufs_common.lua
index ae8b8e6c..cad3ed7e 100644
--- a/modulefiles/ufs_common.lua
+++ b/modulefiles/ufs_common.lua
@@ -53,4 +53,7 @@ load(pathJoin("gftl-shared", gftl_shared_ver))
 mapl_ver=os.getenv("mapl_ver") or "2.22.0-esmf-8.3.0b09"
 load(pathJoin("mapl", mapl_ver))

+scotch_ver=os.getenv("scotch_ver") or "7.0.3"
+load(pathJoin("scotch", scotch_ver))

Is that the correct method I should be using?

yes -- that looks right to me. very awesome that is it working for you!:

@DeniseWorthen
Copy link
Collaborator

@ulmononian I having trouble coordinating all the places this issue is being discussed. But as a heads-up, turnaround on hera is abysmal right now. I know scotch/hpc-stack/intel compiles and runs. I'm trying gnu now. I will test your spack-ue branch also, but it will be slow.

On Cheyenne, I can also build and run w/ scotch/hpc-stack/intel. Scotch is not in the gnu hpc-stack there.

Is everything available to try cheyenne w/ scotch for the spack-stack? Turnaround is much faster for me there.

@ulmononian
Copy link
Collaborator Author

@ulmononian I having trouble coordinating all the places this issue is being discussed. But as a heads-up, turnaround on hera is abysmal right now. I know scotch/hpc-stack/intel compiles and runs. I'm trying gnu now. I will test your spack-ue branch also, but it will be slow.

On Cheyenne, I can also build and run w/ scotch/hpc-stack/intel. Scotch is not in the gnu hpc-stack there.

Is everything available to try cheyenne w/ scotch for the spack-stack? Turnaround is much faster for me there.

i understand. this is sort of an hpc-stack, spack-stack, and WM issue all at once so i'm not sure the best place for it.

no rush for hera GNU (hpc-stack) or hera spack-stack UE tests.

i can add it to cheyenne gnu hpc-stack and cheyenne spack-stack -- might be a few hours.

@jkbk2004
Copy link
Collaborator

@FernandoAndrade-NOAA can you run develop branch only to build compile_s2swa_faster_intel on jet? In that way, we may distinguish issues like current jet system situation or spack stack side.

@DeniseWorthen
Copy link
Collaborator

Using the current commit-1, on hera these were the compile times reported:

Compile s2swa_faster_intel elapsed time 1043 seconds. -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v17_coupled_p8 -DFASTER=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Release -DMOM6SOLO=ON
Compile s2swa_intel elapsed time 743 seconds. -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v17_coupled_p8 -DMPI=ON -DCMAKE_BUILD_TYPE=Release -DMOM6SOLO=ON

For jet, they were:

Compile s2swa_faster_intel elapsed time 5467 seconds. -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v17_coupled_p8 -DFASTER=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Release -DSIMDMULTIARCH=ON -DMOM6SOLO=ON
Compile s2swa_intel elapsed time 2020 seconds. -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v17_coupled_p8 -DMPI=ON -DCMAKE_BUILD_TYPE=Release -DSIMDMULTIARCH=ON -DMOM6SOLO=ON

@jkbk2004
Copy link
Collaborator

So, 01:36:02 compile time for compile_s2swa_faster_intel looks normal behavior on jet. at least nothing to do with this pr.

@DavidHuber-NOAA
Copy link
Collaborator

I believe I know the cause of this on Jet. The applications are built with two sets of instructions -axSSE4.2,CORE-AVX2. This slows the compilation time considerably. I have noticed the same issue when compiling the GSI, which uses the same instructions, including -O3. An alternative approach could be to compile based on the instructions for the lowest-supported architecture, perhaps -march=ivybridge for vjet or -march=haswell for xjet. However, doing this will disable CORE-AVX2 instructions, which would have a deleterious effect on kjet run times.

@jkbk2004
Copy link
Collaborator

I believe I know the cause of this on Jet. The applications are built with two sets of instructions -axSSE4.2,CORE-AVX2. This slows the compilation time considerably. I have noticed the same issue when compiling the GSI, which uses the same instructions, including -O3. An alternative approach could be to compile based on the instructions for the lowest-supported architecture, perhaps -march=ivybridge for vjet or -march=haswell for xjet. However, doing this will disable CORE-AVX2 instructions, which would have a deleterious effect on kjet run times.

@DavidHuber-NOAA Thanks for the note.

@zach1221
Copy link
Collaborator

Regression testing is complete. I've sent review requests.

@zach1221
Copy link
Collaborator

Oh @ulmononian , can you please resolve the 4 conversations above?

@BrianCurtis-NOAA
Copy link
Collaborator

@zach1221 Dusan is out this week, but I read through those and they can be addressed post PR, so they can be resolved. @junwang-noaa needs to comment on her question if she's OK addressing that outside of the PR. Otherwise we need to wait for those questions to be answered.

@junwang-noaa
Copy link
Collaborator

@ulmononian The PR said that the results will change, have you looked at the differences to confirm the changes are expected?

@junwang-noaa
Copy link
Collaborator

@mark-a-potts have you built the spack-stack on cloud? Would you please list the stack locations so that other developers can use them? Thanks a lot!

@mark-a-potts
Copy link
Contributor

Yes. The unified environment is installed under /contrib/EPIC/space-stack/spack-stack-1.4.1/envs/unified-dev

@ulmononian
Copy link
Collaborator Author

@ulmononian The PR said that the results will change, have you looked at the differences to confirm the changes are expected?

@jkbk2004 ran butterfly tests and compared against hpc-stack results. everything looked reasonable.

@DeniseWorthen
Copy link
Collaborator

I have no interest in delaying this PR, but are the butterfly test results shown anywhere? I couldn't find anything in the associated issue (#1651)

@ulmononian
Copy link
Collaborator Author

@DeniseWorthen
Copy link
Collaborator

@ulmononian Thanks. I've linked the comment to the 1651 issue.

tests/RegressionTests_s4.intel.log Outdated Show resolved Hide resolved
tests/rt.sh Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. jenkins-ci Jenkins CI: ORT build/test on docker container Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.