Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactors the Omega build to collect build configurations from the E3SM case #85

Merged
merged 3 commits into from
Jun 11, 2024

Conversation

grnydawn
Copy link

This PR collects the Omega build configuration from the E3SM case instead of using the Omega-internal Python script.
To achieve this, the Omega build system creates a generic E3SM case at the beginning of the Omega build.
By using the E3SM case, the Omega build can utilize dynamically generated parameters, making its configuration
closer to that of E3SM, which Omega will eventually be combined with.

This PR is tested by running Omega unit tests on Frontier(CPU & GPU), Perlmutter(CPU & GPU) and Chrysalis(CPU).
But it is not tested within Polaris test environment.

CIME compiler configuration files for Frontier (crayclanggpu_frontier.cmake, amdclanggpu_frontier.cmake, and gnugpu_frontier.cmake) are modified, which is out of Omega-project scope. The modifications include setting MPICXX compiler to Cray CC instead of "hipcc" and removing MPI-related compiler flags accordingly.

Checklist

  • Documentation:
    • Design document has been generated and added to the docs
    • User's Guide has been updated
    • Developer's Guide has been updated
    • Documentation has been built locally and changes look as expected
  • Testing
    • A comment in the PR documents testing used to verify the changes including any tests that are added/modified/impacted.
    • CTest unit tests for new features have been added per the approved design.
    • Polaris tests for new features have been added per the approved design (and included in a test suite)
    • Unit tests have passed. Please provide a relevant CDash build entry for verification.
    • Polaris test suite has passed
    • Performance related PRs: Please include a relevant PACE experiment link documenting performance before and after.
  • Stealth Features
    • If any stealth features are included in the PR, please confirm that they have been documented.

@grnydawn grnydawn added Omega CMake CMake-related issues labels May 21, 2024
@grnydawn grnydawn requested review from philipwjones and cbegeman May 21, 2024 18:08
@philipwjones
Copy link

@grnydawn This works fine on Chrysalis, but on Frontier, the Cmake step fails with several errors. This was using the usual cmake invocation (tried both crayclang and crayclanggpu):
cmake -DOMEGA_CIME_MACHINE=frontier -DOMEGA_CIME_COMPILER=crayclanggpu -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT} -DOMEGA_BUILD_TEST=ON ~/MyOmegaSrc/components/omega

Here's the output - at least one error is that the path to env_mach_specific has an extra (.) in it:

Deprecated "arg" node detected in /lustre/orion/scratch/pwjones/cli115/omegagpu/e3smcase/env_batch.xml, check files /autofs/nccs-svm1_home1/pwjones/OmegaBuild/cime_config/machines/config_batch.xml
Deprecated "arg" node detected in /lustre/orion/scratch/pwjones/cli115/omegagpu/e3smcase/env_batch.xml, check files /autofs/nccs-svm1_home1/pwjones/OmegaBuild/cime_config/machines/config_batch.xml
Deprecated "arg" node detected in /lustre/orion/scratch/pwjones/cli115/omegagpu/e3smcase/env_batch.xml, check files /autofs/nccs-svm1_home1/pwjones/OmegaBuild/cime_config/machines/config_batch.xml
ERROR: Command: '/autofs/nccs-svm1_home1/pwjones/OmegaBuild/components/mpas-albany-landice/bld/build-namelist -infile /lustre/orion/scratch/pwjones/cli115/omegagpu/e3smcase/Buildconf/maliconf/cesm_namelist -caseroot /lustre/orion/scratch/pwjones/cli115/omegagpu/e3smcase -casebuild /lustre/orion/scratch/pwjones/cli115/omegagpu/e3smcase/Buildconf -cimeroot /autofs/nccs-svm1_home1/pwjones/OmegaBuild/cime -inst_string '' -decomp_prefix 'mpasli.graph.info.' -decomp_date_stamp '150910' -glc_grid 'mpas.ais20km' -ntasks_glc '64' -ninst_glc '1' -mali_dynamic 'True'' failed with error 'Can't locate Switch.pm in @inc (you may need to install the Switch module) (@inc contains: /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.26.1 /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.26.1 /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl) at /autofs/nccs-svm1_home1/pwjones/OmegaBuild/components/mpas-albany-landice/bld/build-namelist line 1080.
BEGIN failed--compilation aborted at /autofs/nccs-svm1_home1/pwjones/OmegaBuild/components/mpas-albany-landice/bld/build-namelist line 1080.' from dir '/lustre/orion/scratch/pwjones/cli115/omegagpu/e3smcase/Buildconf/maliconf'
bash: /lustre/orion/scratch/pwjones/cli115/omegagpu/e3smcase/.env_mach_specific.sh: No such file or directory
CMake Error at OmegaBuild.cmake:77 (string):
string sub-command REPLACE requires at least four arguments.
Call Stack (most recent call first):
OmegaBuild.cmake:150 (read_cime_config)
CMakeLists.txt:39 (init_standalone_build)

-- The C compiler identification is Clang 15.0.6
-- The CXX compiler identification is Clang 15.0.6
-- Cray Programming Environment 2.7.20 C
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/cray/pe/craype/2.7.20/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Cray Programming Environment 2.7.20 CXX
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/cray/pe/craype/2.7.20/bin/CC - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- OMEGA_LINK_OPTIONS =
CMake Error at OmegaBuild.cmake:553 (message):
MPI test failure: No such file or directory
Call Stack (most recent call first):
CMakeLists.txt:75 (update_variables)

-- Configuring incomplete, errors occurred!
See also "/lustre/orion/scratch/pwjones/cli115/omegagpu/CMakeFiles/CMakeOutput.log".

@grnydawn
Copy link
Author

@philipwjones , thanks for reporting the issue. It is new issue to me. Could you allow me to access to "/lustre/orion/scratch/pwjones/cli115/omegagpu/" directory and its subdirectories so that I can see the ouput logs?

@grnydawn
Copy link
Author

@philipwjones , the issue seems to be caused by a missing Perl module (Switch) on Frontier. Could you install the Switch module using something like CPAN and try again? I have created a ticket for installing the Perl module on Frontier, but it may take some time.

@philipwjones
Copy link

@grnydawn I tried to install, but cpan messed up my default perl environment and I couldn't load any of my other modules so clearly didn't configure correctly and I'm not familiar enough with perl to figure it out. Where is the build using perl? and maybe it should use given/when rather than an extra switch module...

@rljacob
Copy link
Member

rljacob commented May 30, 2024

I thought Perl was only needed in the ELM/EAM configure scripts (which are written in perl). And, I think, we can build EAM/ELM just fine on Frontier. Possibly because of the code in cime/utils/perl5lib.

@grnydawn
Copy link
Author

@philipwjones, the script that caused this failure is "E3SM/components/mpas-albany-landice/bld/build-namelist". Please see the of "use Switch;" near line 1073. I wonder why this script is used, apparently I did not use this in my case.

@rljacob , I will check the perl utilities. Thanks.

@philipwjones
Copy link

@grnydawn I'm not sure either but that explains one of the other errors pointing to a mangled path to some land ice stuff. I'll try and figure out why my setup is going there...

@rljacob
Copy link
Member

rljacob commented May 30, 2024

I was wrong then. Other components are using perl-based build-namelist scripts. They probably patterned them after the EAM/ELM ones before we encouraged python in any new scripts.

@philipwjones
Copy link

Ok, it appears the errors above resulted from user error (same mistake back-to-back for both cpu, gpu build directories). I can build frontier gpu now but for the frontier cpu-only build, it appears the back end flag is not being set:
error: "None of OMEGA_ENABLE_CUDA, OMEGA_ENABLE_HIP, and OMEGA_ENABLE_OPENMP is defined."

@philipwjones
Copy link

To be more specific crayclanggpu target builds fine. crayclang does not set the backend.

@grnydawn
Copy link
Author

@philipwjones , great that it works now with crayclanggpu. The issue with cpu build is a bug at a part of Kokkos setting in Omega code. It should be fixed easily. Thanks!

Copy link

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a reasonable approach for extracting some of the job launch and other configuration info from E3SM/CIME. Successfully builds and runs the unit tests on Chrysalis (Intel) and Frontier (cpu/gpu)

Sorry about sending you down a rabbit hole with the initial errors @grnydawn

@xylar xylar self-requested a review May 31, 2024 13:15
Comment on lines 60 to 61
--res T62_oQU120_ais20 \
--compset MPAS_LISIO_TEST \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunate timing. We made some fairly big changes to this test in E3SM: E3SM-Project#6437

This test is also maybe a little more complicated than we need, since it has active MALI, MPAS-Ocean and MPAS-Seaice. I would suggest using a C-case test instead, maybe:

Suggested change
--res T62_oQU120_ais20 \
--compset MPAS_LISIO_TEST \
--res T62_oQU120 \
--compset CMPASO-NYF \

This should be safe because it's in the e3sm_developer suite:
https://github.com/E3SM-Project/E3SM/blob/ef925d118e756f4f9379423ee4022581dd5aafad/cime_config/tests.py#L302

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. It will be included in the next commit.

Comment on lines -9 to +13
string(APPEND CMAKE_EXE_LINKER_FLAGS " -acc -gpu=cc70,cc60 -Minfo=accel")
string(APPEND CMAKE_EXE_LINKER_FLAGS " -acc -gpu=cc80 -Minfo=accel")
set(SCC "cc")
set(SCXX "CC")
set(SFC "ftn")
string(APPEND CMAKE_Fortran_FLAGS " -acc -gpu=cc70,cc60 -Minfo=accel")
string(APPEND CMAKE_Fortran_FLAGS " -acc -gpu=cc80 -Minfo=accel")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in our Slack huddle, let's make this its own commit and submit it to E3SM as a bug fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new commit only for this change. Thanks!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grnydawn, can you make a PR to E3SM as well with this commit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar I am not familiar with how to use this commit to create a PR to E3SM. Could you explain the general steps to do that?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can help with that. Your fork of Omega (https://github.com/grnydawn/Omega) is also a fork of https://github.com/E3SM-Project/E3SM. You can add it as a remote to wherever you usually develop E3SM:

git remote add grnydawn/Omega [email protected]:grnydawn/Omega.git
git fetch grnydawn/Omega -p

Then, make a new branch from E3SM master however you would normally do that, e.g.:

git fetch origin -p
git reset --hard origin/master
git checkout -b ykim/machinefiles/fix-nvidiagpu-pm-gpu

They, you just need to cherry-pick the commit from here to the branch:

git cherry-pick 294a5ac4d453c61999dee1e170597d80f73822ed

where I got the hash of the commit from the copy icon next to the commit listed here:
https://github.com/E3SM-Project/Omega/pull/85/commits

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From there, just push the branch to origin as usual and make a PR to E3SM as usual.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we will revert the commit here on Omega before merging master into Omega's develop so it's clear that the E3SM version of the commit, rather than this one, is considered to be the "right" one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @xylar for detail instructions. Before proceeding to create a PR, I have one question on the changes in Subproject commits. From running git diff HEAD~1 at "ykim/machinefiles/fix-nvidiagpu-pm-gpu" I got following Subproject commit changes. I wonder if I could ignore the git log messages:

diff --git a/components/elm/src/external_models/fates b/components/elm/src/external_models/fates
index 42d804ba54..698a8df848 160000
--- a/components/elm/src/external_models/fates
+++ b/components/elm/src/external_models/fates
@@ -1 +1 @@
-Subproject commit 42d804ba54d0cf013a9737018ff9920e0c9808ea
+Subproject commit 698a8df848ecdb81aa72fee6c86be2c41b2545e9
diff --git a/externals/ekat b/externals/ekat
index fb4babcf3a..2ff5853316 160000
--- a/externals/ekat
+++ b/externals/ekat
@@ -1 +1 @@
-Subproject commit fb4babcf3a24c14e9e94d71df152a7f097805c14
+Subproject commit 2ff5853316e15d4e8004c21890329fd257fa7459
diff --git a/externals/haero b/externals/haero
index 77c4a45584..574787f9b3 160000
--- a/externals/haero
+++ b/externals/haero
@@ -1 +1 @@
-Subproject commit 77c4a45584f2bd3bf38a18ea5bc24ef9ce3776a2
+Subproject commit 574787f9b31cc77b7902f07e73e10dcd18cdf29d
diff --git a/externals/mam4xx b/externals/mam4xx
index a19bc75891..8f6af492c8 160000
--- a/externals/mam4xx
+++ b/externals/mam4xx
@@ -1 +1 @@
-Subproject commit a19bc75891e235071426840742a2c530bbea9e52
+Subproject commit 8f6af492c8627ebe5ec7904c3185ab8294e49a12
diff --git a/externals/scorpio b/externals/scorpio
index cdd541e0cd..de0b1ca220 160000
--- a/externals/scorpio
+++ b/externals/scorpio
@@ -1 +1 @@
-Subproject commit cdd541e0cd708bece13b1f3ee42f66dfd6440aa7
+Subproject commit de0b1ca2200f62c6eb5e3fd40147965409e97123

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can probably ignore those. It will be clear as soon as you make a PR whether your commits contain any changes to these submodules. I think the more likely thing is simply that they have been changed between different branches you checked out and you would just need to do a new recursive update to make your submodules match what they are supposed to be on the current branch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @xylar. I think I will check if my commit has older submodules because I have not updated submodules recently, and then proceed to create a PR.

@grnydawn
Copy link
Author

grnydawn commented Jun 3, 2024

Thank you, @philipwjones and @xylar, for the reviews. I have applied all the suggestions and updated the commit history as suggested by Xylar.

I created three new CIME compiler configurations(GPU) for Omega on Frontier. Please use the following compiler names for OMEGA_CIME_COMPILER CMake macros:

amdomegagpu, crayomegagpu, and gnuomegagpu

@grnydawn grnydawn force-pushed the ykim/omega/cmakebuild branch from 4eed153 to 2c541a9 Compare June 3, 2024 01:39
@xylar
Copy link

xylar commented Jun 4, 2024

I need to make some fixes to the Omega CTest utility in Polaris but they are unrelated to this PR. The build worked great for me with the Polaris utility without any modification. The issues are related to needing 3 Omega meshes, not just one.

I think the changes look good. We will need to figure out how to handle the special compiler names for Frontier in Polaris but I'm not going to worry too much about that just yet since we don't yet support the GPU compilers anyway.

@grnydawn grnydawn removed the request for review from cbegeman June 10, 2024 16:55
@philipwjones philipwjones merged commit b9645f1 into E3SM-Project:develop Jun 11, 2024
2 checks passed
@grnydawn grnydawn deleted the ykim/omega/cmakebuild branch September 27, 2024 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake-related issues Omega
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants