-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactors the Omega build to collect build configurations from the E3SM case #85
Refactors the Omega build to collect build configurations from the E3SM case #85
Conversation
@grnydawn This works fine on Chrysalis, but on Frontier, the Cmake step fails with several errors. This was using the usual cmake invocation (tried both crayclang and crayclanggpu): Here's the output - at least one error is that the path to env_mach_specific has an extra (.) in it: Deprecated "arg" node detected in /lustre/orion/scratch/pwjones/cli115/omegagpu/e3smcase/env_batch.xml, check files /autofs/nccs-svm1_home1/pwjones/OmegaBuild/cime_config/machines/config_batch.xml -- The C compiler identification is Clang 15.0.6 -- Configuring incomplete, errors occurred! |
@philipwjones , thanks for reporting the issue. It is new issue to me. Could you allow me to access to "/lustre/orion/scratch/pwjones/cli115/omegagpu/" directory and its subdirectories so that I can see the ouput logs? |
@philipwjones , the issue seems to be caused by a missing Perl module (Switch) on Frontier. Could you install the Switch module using something like CPAN and try again? I have created a ticket for installing the Perl module on Frontier, but it may take some time. |
@grnydawn I tried to install, but cpan messed up my default perl environment and I couldn't load any of my other modules so clearly didn't configure correctly and I'm not familiar enough with perl to figure it out. Where is the build using perl? and maybe it should use given/when rather than an extra switch module... |
I thought Perl was only needed in the ELM/EAM configure scripts (which are written in perl). And, I think, we can build EAM/ELM just fine on Frontier. Possibly because of the code in cime/utils/perl5lib. |
@philipwjones, the script that caused this failure is "E3SM/components/mpas-albany-landice/bld/build-namelist". Please see the of "use Switch;" near line 1073. I wonder why this script is used, apparently I did not use this in my case. @rljacob , I will check the perl utilities. Thanks. |
@grnydawn I'm not sure either but that explains one of the other errors pointing to a mangled path to some land ice stuff. I'll try and figure out why my setup is going there... |
I was wrong then. Other components are using perl-based build-namelist scripts. They probably patterned them after the EAM/ELM ones before we encouraged python in any new scripts. |
Ok, it appears the errors above resulted from user error (same mistake back-to-back for both cpu, gpu build directories). I can build frontier gpu now but for the frontier cpu-only build, it appears the back end flag is not being set: |
To be more specific crayclanggpu target builds fine. crayclang does not set the backend. |
@philipwjones , great that it works now with crayclanggpu. The issue with cpu build is a bug at a part of Kokkos setting in Omega code. It should be fixed easily. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be a reasonable approach for extracting some of the job launch and other configuration info from E3SM/CIME. Successfully builds and runs the unit tests on Chrysalis (Intel) and Frontier (cpu/gpu)
Sorry about sending you down a rabbit hole with the initial errors @grnydawn
components/omega/OmegaBuild.cmake
Outdated
--res T62_oQU120_ais20 \ | ||
--compset MPAS_LISIO_TEST \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunate timing. We made some fairly big changes to this test in E3SM: E3SM-Project#6437
This test is also maybe a little more complicated than we need, since it has active MALI, MPAS-Ocean and MPAS-Seaice. I would suggest using a C-case test instead, maybe:
--res T62_oQU120_ais20 \ | |
--compset MPAS_LISIO_TEST \ | |
--res T62_oQU120 \ | |
--compset CMPASO-NYF \ |
This should be safe because it's in the e3sm_developer
suite:
https://github.com/E3SM-Project/E3SM/blob/ef925d118e756f4f9379423ee4022581dd5aafad/cime_config/tests.py#L302
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. It will be included in the next commit.
string(APPEND CMAKE_EXE_LINKER_FLAGS " -acc -gpu=cc70,cc60 -Minfo=accel") | ||
string(APPEND CMAKE_EXE_LINKER_FLAGS " -acc -gpu=cc80 -Minfo=accel") | ||
set(SCC "cc") | ||
set(SCXX "CC") | ||
set(SFC "ftn") | ||
string(APPEND CMAKE_Fortran_FLAGS " -acc -gpu=cc70,cc60 -Minfo=accel") | ||
string(APPEND CMAKE_Fortran_FLAGS " -acc -gpu=cc80 -Minfo=accel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed in our Slack huddle, let's make this its own commit and submit it to E3SM as a bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new commit only for this change. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@grnydawn, can you make a PR to E3SM as well with this commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xylar I am not familiar with how to use this commit to create a PR to E3SM. Could you explain the general steps to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can help with that. Your fork of Omega (https://github.com/grnydawn/Omega) is also a fork of https://github.com/E3SM-Project/E3SM. You can add it as a remote to wherever you usually develop E3SM:
git remote add grnydawn/Omega [email protected]:grnydawn/Omega.git
git fetch grnydawn/Omega -p
Then, make a new branch from E3SM master however you would normally do that, e.g.:
git fetch origin -p
git reset --hard origin/master
git checkout -b ykim/machinefiles/fix-nvidiagpu-pm-gpu
They, you just need to cherry-pick the commit from here to the branch:
git cherry-pick 294a5ac4d453c61999dee1e170597d80f73822ed
where I got the hash of the commit from the copy icon next to the commit listed here:
https://github.com/E3SM-Project/Omega/pull/85/commits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From there, just push the branch to origin
as usual and make a PR to E3SM as usual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we will revert the commit here on Omega before merging master
into Omega's develop
so it's clear that the E3SM version of the commit, rather than this one, is considered to be the "right" one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @xylar for detail instructions. Before proceeding to create a PR, I have one question on the changes in Subproject commits. From running git diff HEAD~1 at "ykim/machinefiles/fix-nvidiagpu-pm-gpu" I got following Subproject commit changes. I wonder if I could ignore the git log messages:
diff --git a/components/elm/src/external_models/fates b/components/elm/src/external_models/fates
index 42d804ba54..698a8df848 160000
--- a/components/elm/src/external_models/fates
+++ b/components/elm/src/external_models/fates
@@ -1 +1 @@
-Subproject commit 42d804ba54d0cf013a9737018ff9920e0c9808ea
+Subproject commit 698a8df848ecdb81aa72fee6c86be2c41b2545e9
diff --git a/externals/ekat b/externals/ekat
index fb4babcf3a..2ff5853316 160000
--- a/externals/ekat
+++ b/externals/ekat
@@ -1 +1 @@
-Subproject commit fb4babcf3a24c14e9e94d71df152a7f097805c14
+Subproject commit 2ff5853316e15d4e8004c21890329fd257fa7459
diff --git a/externals/haero b/externals/haero
index 77c4a45584..574787f9b3 160000
--- a/externals/haero
+++ b/externals/haero
@@ -1 +1 @@
-Subproject commit 77c4a45584f2bd3bf38a18ea5bc24ef9ce3776a2
+Subproject commit 574787f9b31cc77b7902f07e73e10dcd18cdf29d
diff --git a/externals/mam4xx b/externals/mam4xx
index a19bc75891..8f6af492c8 160000
--- a/externals/mam4xx
+++ b/externals/mam4xx
@@ -1 +1 @@
-Subproject commit a19bc75891e235071426840742a2c530bbea9e52
+Subproject commit 8f6af492c8627ebe5ec7904c3185ab8294e49a12
diff --git a/externals/scorpio b/externals/scorpio
index cdd541e0cd..de0b1ca220 160000
--- a/externals/scorpio
+++ b/externals/scorpio
@@ -1 +1 @@
-Subproject commit cdd541e0cd708bece13b1f3ee42f66dfd6440aa7
+Subproject commit de0b1ca2200f62c6eb5e3fd40147965409e97123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can probably ignore those. It will be clear as soon as you make a PR whether your commits contain any changes to these submodules. I think the more likely thing is simply that they have been changed between different branches you checked out and you would just need to do a new recursive update to make your submodules match what they are supposed to be on the current branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @xylar. I think I will check if my commit has older submodules because I have not updated submodules recently, and then proceed to create a PR.
Thank you, @philipwjones and @xylar, for the reviews. I have applied all the suggestions and updated the commit history as suggested by Xylar. I created three new CIME compiler configurations(GPU) for Omega on Frontier. Please use the following compiler names for
|
4eed153
to
2c541a9
Compare
I need to make some fixes to the Omega CTest utility in Polaris but they are unrelated to this PR. The build worked great for me with the Polaris utility without any modification. The issues are related to needing 3 Omega meshes, not just one. I think the changes look good. We will need to figure out how to handle the special compiler names for Frontier in Polaris but I'm not going to worry too much about that just yet since we don't yet support the GPU compilers anyway. |
This PR collects the Omega build configuration from the E3SM case instead of using the Omega-internal Python script.
To achieve this, the Omega build system creates a generic E3SM case at the beginning of the Omega build.
By using the E3SM case, the Omega build can utilize dynamically generated parameters, making its configuration
closer to that of E3SM, which Omega will eventually be combined with.
This PR is tested by running Omega unit tests on Frontier(CPU & GPU), Perlmutter(CPU & GPU) and Chrysalis(CPU).
But it is not tested within Polaris test environment.
CIME compiler configuration files for Frontier (crayclanggpu_frontier.cmake, amdclanggpu_frontier.cmake, and gnugpu_frontier.cmake) are modified, which is out of Omega-project scope. The modifications include setting MPICXX compiler to Cray CC instead of "hipcc" and removing MPI-related compiler flags accordingly.
Checklist