Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port UFS-WM to Ursa #2471

Draft
wants to merge 31 commits into
base: develop
Choose a base branch
from

Conversation

ulmononian
Copy link
Collaborator

@ulmononian ulmononian commented Oct 15, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

Enable the UFS-WM to run on Ursa (Hera's follow-on machine). Currently, just using the pre-TDS via Niagara, so configurations will likely need to be updated as the machine comes closer to full implementation.

spack-stack installation to support UFS applications using Intel LLVM compilers is in progress; see JCSDA/spack-stack#1297.

UFS-WM_RT data will be staged on shared Niagara/Ursa disk space for now, until dedicated Ursa filesystem is made available; once the stack installation is finished, RTs will be run using 8 available nodes (1 service, 7 compute).

Commit Message:

* UFSWM - Port UFS-WM to Ursa

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

  • None

UFSWM Blocking Dependencies:

  • None

Changes

Regression Test Changes (Please commit test_changes.list):

  • PR Adds New Tests/Baselines.

Input data Changes:

  • None (just staging on Ursa)

Library Changes/Upgrades:

  • Required

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@ulmononian
Copy link
Collaborator Author

ulmononian commented Nov 5, 2024

quick update: this work is on hold while we wait for ursa's oso system, which will have internet connectivity and allow the spack-stack/1.8.0 installation to proceed. see JCSDA/spack-stack#1297 (comment).

@ulmononian
Copy link
Collaborator Author

we are still waiting for internet connectivity to be enabled on ursa so that the spack-stack build can be completed. @RaghuReddy-NOAA do you have any update on this?

@RaghuReddy-NOAA
Copy link

@ulmononian The login node nfe91 is now able to access the external network. Please note that it is still behind a firewall, and any site that is reachable by Hera/Niagara should be reachable from nfe91 too.

@ulmononian
Copy link
Collaborator Author

ulmononian commented Jan 16, 2025

currently testing ufs-wm w/ spack-stack/1.8.0 on ursa.

some issues with cmake and locating mpi libs right now, i.e., when trying to run cmake for the atm-only model

cmake -DAPP=ATM -DCCPP_SUITES=FV3_GFS_v16,FV3_GFS_v16_flake,FV3_GFS_v17_p8,FV3_GFS_v17_p8_rrtmgp,FV3_GFS_v15_thompson_mynn_lam3km,FV3_WoFS_v0,FV3_GFS_v17_p8_mynn,FV3_GFS_v17_p8_ugwpv1 -D32BIT=ON ..

i get:

-- Could NOT find MPI_C (missing: MPI_C_WORKS)
CMake Error at /contrib/spack-stack/envs/1.8.0/ue-oneapi-ifort-2024.2.1/install/oneapi/2024.2.1/cmake-3.27.9-zfrh7no/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find MPI (missing: MPI_C_FOUND) (found version "3.1")
Call Stack (most recent call first):
  /contrib/spack-stack/envs/1.8.0/ue-oneapi-ifort-2024.2.1/install/oneapi/2024.2.1/cmake-3.27.9-zfrh7no/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  /contrib/spack-stack/envs/1.8.0/ue-oneapi-ifort-2024.2.1/install/oneapi/2024.2.1/cmake-3.27.9-zfrh7no/share/cmake-3.27/Modules/FindMPI.cmake:1837 (find_package_handle_standard_args)
  CMakeLists.txt:148 (find_package)

this corresponds to an error generated during the find MPI section of the top-level ufs-wm CMakeLists.txt. below is the module show for the stack-intel-oneapi-mpi/2021.13 modulefile that is loaded at compile time:

[Cameron.Book@nfe91 build]$ module show  stack-intel-oneapi-mpi/2021.13
--------------------------------------------------------------------------------------------------------------------------------------------------------
   /contrib/spack-stack/envs/1.8.0/ue-oneapi-ifort-2024.2.1/install/modulefiles/oneapi/2024.2.1/stack-intel-oneapi-mpi/2021.13.lua:
--------------------------------------------------------------------------------------------------------------------------------------------------------
help([[]])
family("MetaMPI")
conflict("stack-intel-mpi")
conflict("stack-intel-oneapi-mpi")
conflict("stack-cray-mpich")
conflict("stack-mpich")
conflict("stack-mpt")
load("intel-oneapi-mpi/2021.13.1")
prereq("intel-oneapi-mpi/2021.13.1")
prepend_path("MODULEPATH","/contrib/spack-stack/envs/1.8.0/ue-oneapi-ifort-2024.2.1/install/modulefiles/intel-oneapi-mpi/2021.13-eaajhcw/oneapi/2024.2.1")
setenv("MPICC","mpiicx")
setenv("MPICXX","mpiicpx")
setenv("MPIF77","mpiifort")
setenv("MPIF90","mpiifort")
setenv("MPI_CC","mpiicx")
setenv("MPI_CXX","mpiicpx")
setenv("MPI_F77","mpiifort")
setenv("MPI_F90","mpiifort")
setenv("I_MPI_CC","mpiicc")
setenv("I_MPI_CXX","mpiicpc")
setenv("I_MPI_F77","/apps/spack-2024-12/linux-rocky9-x86_64/gcc-11.4.1/intel-oneapi-compilers-2024.2.1-oqhstbmawnrsdw472p4pjsopj547o6xs/compiler/2024.2/bin/ifort")
setenv("I_MPI_F90","mpiifort")
setenv("I_MPI_FC","mpiifort")
setenv("intel_oneapi_mpi_ROOT","/apps/spack-2024-12/linux-rocky9-x86_64/oneapi-2024.2.1/intel-oneapi-mpi-2021.13.1-ss72gbndvat3oz22sa6lhmlbjkeabrn4")
whatis("Name: stack-intel-oneapi-mpi")
whatis("Version: 2021.13")
whatis("Category: library")
whatis("Description: stack-intel-oneapi-mpi mpi library and module access")

i've also tried changing the envvars set in the ufs_ursa modulefile (mirroring some of the approaches in the hercules/hera/gaeac6 llvm lua files), to no avail so far. note that we are using icx,icpx, and ifort for now, though we do have an ifx-based stack as well. i've also tried w/ system cmake, but no difference.

@ulmononian
Copy link
Collaborator Author

ulmononian commented Jan 24, 2025

update: testing w/ newly built spack-stack/1.6.0/fms-2024.01

control_c48 ran successfully w/ icx,icpx,ifx: /collab1/data/Cameron.Book/RT_RUNDIRS/Cameron.Book/FV3_RT/rt_1260965

baseline comp. failed because we don't have the newest bl data on ursa yet. should just run with -c since baselines need created for ursa anyway.

@ulmononian
Copy link
Collaborator Author

ulmononian commented Jan 24, 2025

i tried to build the WM in S2SWA mode using the same spack-stack/1.6.0/fms-2024.01 env as in my previous comment. at first, doing this via rt.sh, it failed in the cmake step for MOM6 /collab1/data/Cameron.Book/RT_RUNDIRS/Cameron.Book/FV3_RT/rt_1265768):

CMake Error at MOM6-interface/CMakeLists.txt:35 (add_library):
  Cannot find source file:

    MOM6/src/diagnostics/MOM_diagnose_MLD.F90

i checked some things, purged modules and ran again, only to see that same MPI_C error we hit using spack-stack/1.8.0 for the ATM-only model (/collab1/data/Cameron.Book/RT_RUNDIRS/Cameron.Book/FV3_RT/rt_1270036; see #2471 (comment)):

CMake Error at /contrib/spack-stack/envs/1.6.0/ue-intel-2023.2.0/install/intel/2021.10.0/cmake-3.23.1-zjdo26m/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find MPI (missing: MPI_C_FOUND) (found version "3.1")
Call Stack (most recent call first):
  /contrib/spack-stack/envs/1.6.0/ue-intel-2023.2.0/install/intel/2021.10.0/cmake-3.23.1-zjdo26m/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
  /contrib/spack-stack/envs/1.6.0/ue-intel-2023.2.0/install/intel/2021.10.0/cmake-3.23.1-zjdo26m/share/cmake-3.23/Modules/FindMPI.cmake:1830 (find_package_handle_standard_args)

we are testing different stacks & different compilers, hitting the same cmake error, once in an ATM build and once in a coupled build. i am not sure what is going on.

@rickgrubin-noaa @RatkoVasic-NOAA @RaghuReddy-NOAA @DusanJovic-NOAA @climbfuji fyi

@ulmononian
Copy link
Collaborator Author

ulmononian commented Jan 27, 2025

now seeing this when trying to compile S2SWA (it finds MPI_C, but i'm not sure it is finding these correctly:

-- Found MPI_C: /apps/spack-2024-12/linux-rocky9-x86_64/oneapi-2023.2.0/intel-oneapi-mpi-2021.13.1-66y2dsdqzran6wjlk7oglxlr5za3dbh6/mpi/2021.13/lib/libmpifort.so (found version "3.1")
-- Found MPI_CXX: /apps/spack-2024-12/linux-rocky9-x86_64/oneapi-2023.2.0/intel-oneapi-mpi-2021.13.1-66y2dsdqzran6wjlk7oglxlr5za3dbh6/mpi/2021.13/lib/libmpicxx.so (found version "3.1")
-- Found MPI_Fortran: /apps/spack-2024-12/linux-rocky9-x86_64/oneapi-2023.2.0/intel-oneapi-mpi-2021.13.1-66y2dsdqzran6wjlk7oglxlr5za3dbh6/mpi/2021.13/lib/libmpifort.so (found version "3.1")

cmake fails now in

Found Python: /contrib/spack-stack/envs/1.6.0/ue-intel-2023.2.0/install/intel/2021.10.0/python-3.10.13-vbszmcu/bin/python3.10
-- Compile stochastic_physics with 64-bit precision to match CCPP slow physics.
Calling CCPP code generator (ccpp_prebuild.py) for suites --suites=FV3_GFS_v17_coupled_p8,FV3_GFS_v17_coupled_p8_ugwpv ...
CMake Error at FV3/ccpp/CMakeLists.txt:40 (message):
  An error occured while running ccpp_prebuild.py, check
  /collab1/data/Cameron.Book/port_wm/build/cpl/FV3/ccpp/ccpp_prebuild.{out,err}


-- Configuring incomplete, errors occurred!

seems like something weird is going on here...

the ccpp_prebuild.err file shows

INFO: Logging level set to INFO
INFO: Found TYPEDEFS_NEW_METADATA dictionary in config, assume at least some data is in new metadata format
INFO: Parsing suite definition files ...
INFO: Parsing suite definition file suites/suite_FV3_GFS_v17_coupled_p8.xml ...
INFO: Parsing suite definition file suites/suite_FV3_GFS_v17_coupled_p8_ugwpv.xml ...
CRITICAL: Suite definition file suites/suite_FV3_GFS_v17_coupled_p8_ugwpv.xml not found.
ERROR: Parsing suite definition file suite_FV3_GFS_v17_coupled_p8_ugwpv.xml failed.
Traceback (most recent call last):
  File "/collab1/data/Cameron.Book/port_wm/FV3/ccpp/framework/scripts/ccpp_prebuild.py", line 829, in <module>
    main()
  File "/collab1/data/Cameron.Book/port_wm/FV3/ccpp/framework/scripts/ccpp_prebuild.py", line 739, in main
    raise Exception('Parsing suite definition files failed.')
Exception: Parsing suite definition files failed.
~

@climbfuji
Copy link
Collaborator

From the log:

  An error occured while running ccpp_prebuild.py, check
  /collab1/data/Cameron.Book/port_wm/build/cpl/FV3/ccpp/ccpp_prebuild.{out,err}

@ulmononian
Copy link
Collaborator Author

From the log:

  An error occured while running ccpp_prebuild.py, check
  /collab1/data/Cameron.Book/port_wm/build/cpl/FV3/ccpp/ccpp_prebuild.{out,err}

lol, i was wondering why it was trying to parse a suite that did not exist / was not being used in my cmake commdand. classic case of typo (i had entered FV3_GFS_v17_coupled_p8_ugwpv instead of V3_GFS_v17_coupled_p8_ugwpv1).

now just back to the same MOM6 issue:

CMake Error at MOM6-interface/CMakeLists.txt:35 (add_library):
  Cannot find source file:

    MOM6/src/diagnostics/MOM_diagnose_MLD.F90

  Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm .h
  .hh .h++ .hm .hpp .hxx .in .txx .f .F .for .f77 .f90 .f95 .f03 .hip .ispc


CMake Error at MOM6-interface/CMakeLists.txt:55 (add_library):
  Cannot find source file:

    MOM6/config_src/drivers/unit_tests/test_MOM_remapping.F90

  Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm .h
  .hh .h++ .hm .hpp .hxx .in .txx .f .F .for .f77 .f90 .f95 .f03 .hip .ispc


CMake Error at MOM6-interface/CMakeLists.txt:35 (add_library):
  No SOURCES given to target: mom6_obj


CMake Error at MOM6-interface/CMakeLists.txt:55 (add_library):
  No SOURCES given to target: mom6_nuopc_obj


CMake Error at MOM6-interface/CMakeLists.txt:75 (add_library):
  No SOURCES given to target: mom6


CMake Generate step failed.  Build files cannot be regenerated correctly.

@BrianCurtis-NOAA
Copy link
Collaborator

have you recursed all submodules in your git clone?

@ulmononian
Copy link
Collaborator Author

have you recursed all submodules in your git clone?

seems like my recent pull from develop was not successful and messed up the MOM6 src...

cloned fresh and it gets past this. thanks @BrianCurtis-NOAA.

@ulmononian
Copy link
Collaborator Author

using ifx, make fails in ww3:

[ 42%] Building Fortran object WW3/model/src/CMakeFiles/ww3_lib.dir/w3wavemd.F90.o
/collab1/data/Cameron.Book/0127/WW3/model/src/w3wavemd.F90(414): remark #6536: All symbols from this module are already visible due to another USE; the ONLY clause will have no effect. Rename clauses, if any, will be honored.   [W3ODATMD]
    USE W3ODATMD
--------^
/collab1/data/Cameron.Book/0127/WW3/model/src/w3wavemd.F90(458): remark #6536: All symbols from this module are already visible due to another USE; the ONLY clause will have no effect. Rename clauses, if any, will be honored.   [W3TIMEMD]
    USE W3TIMEMD
--------^
[ 42%] Building Fortran object WW3/model/src/CMakeFiles/ww3_lib.dir/wmwavemd.F90.o
[ 42%] Building Fortran object WW3/model/src/CMakeFiles/ww3_lib.dir/wav_comp_nuopc.F90.o
[ 43%] Linking Fortran static library ../../../lib/libww3.a
[ 43%] Built target ww3_lib
make: *** [Makefile:136: all] Error 2

@BrianCurtis-NOAA
Copy link
Collaborator

Those are remarks and I don't believe cause the make to fail. Look earlier for errors.

@ulmononian
Copy link
Collaborator Author

Those are remarks and I don't believe cause the make to fail. Look earlier for errors.

you're right. it was some issue in cmeps. this was with ifx...which i'm going to bypass for 1.6.0 testing for now.

@uturuncoglu
Copy link
Collaborator

@DeniseWorthen I have no access to that platform. If you want I could run cpld_control_p8_lnd_intel vs cpld_control_p8_intel and check the timings to see the extra overhead from land component.

@DeniseWorthen
Copy link
Collaborator

@uturuncoglu Sorry for not being clear. It seems we're having this same issue on other platforms. I brought it up here because extending wall clock is not the solution to this sort of problem.

Could you maybe derecho and/or hercules? Are they running close to wall-clock there?

@uturuncoglu
Copy link
Collaborator

@DeniseWorthen Okay. Let me check on Hercules. I'll update you soon.

@uturuncoglu
Copy link
Collaborator

@DeniseWorthen If I remember correctly. I did couple of test before and reducing the output interval of the land component was improving its performance. So, here is the result of my previous tests that I did for land DA,

ESMF_Profile.summary.r01 - 44/47/42 = ~ 44 sec
  original configuration
  12 hours of simulation 
  144+144 core (mediator is on first 144)

ESMF_Profile.summary.r02 - 35/30/30 = ~ 31 sec (%25 gain rt r01)
  change restart_n (in ufs.configure) from 1 to 12 since mediator/cdeps was spending time in outputting restart files. BTW, it will only write single restart at the end of the simulation.

ESMF_Profile.summary.r03 - 16/11/15 = ~ 14 sec (%50 gain rt r02)
  change output_freq (in ufs.configure) from 3600 (hourly) to 86400 (daily). So, land component will not have any output.

ESMF_Profile.summary.r04 - 17/17/15 = ~ 16 sec (no improvement)
  24+24 core for atm and land (mediator on all 48)
  Land C96 layout = 2:2

Of course this is for the configuration coupled with DATM. I'll do a similar test with the control_p8_atmlnd_intel and cpld_control_p8_lnd_intel.

@uturuncoglu
Copy link
Collaborator

@DeniseWorthen I run the control_p8 with and without land component. The standalone atmosphere case (original control_p8) took 223 sec in total and one with the external land component took 286 sec. Both flactuates little bit but not too much due to load in the system. In the land coupled case, FV3 writing every hour but this is not the case for standalone. So, probably I/O plays role in here in terms of timing difference. In any case, the run finish around 5 min. So, I am not sure why it's timing out. Maybe comparing files is taking time. Not sure. Anyway, I think the best way to speed up the case (if you want more performance) is to use same output interval with control_p8. So, I did following modification in the test.

diff --git a/tests/parm/ufs.configure.atm_lnd.IN b/tests/parm/ufs.configure.atm_lnd.IN
index c00fd9ea..b3c3c1d8 100644
--- a/tests/parm/ufs.configure.atm_lnd.IN
+++ b/tests/parm/ufs.configure.atm_lnd.IN
@@ -69,7 +69,7 @@ LND_attributes::
   surface_evap_resistance_option = 1 # not used, it is fixed to 4 in sfc_noahmp_drv.F90
   glacier_option = 1
   surface_thermal_roughness_option = 2
-  output_freq = 3600
+  output_freq = 21600
   restart_freq = -1
   calc_snet = @[CALC_SNET]
   initial_albedo = @[initial_albedo]
diff --git a/tests/tests/control_p8_atmlnd b/tests/tests/control_p8_atmlnd
index e9bc04d2..a7714b01 100644
--- a/tests/tests/control_p8_atmlnd
+++ b/tests/tests/control_p8_atmlnd
@@ -127,7 +127,7 @@ export precip_partition_option=4
 export initial_albedo=0.2
 export WRITE_DOPOST=.false.
 export OUTPUT_GRID=cubed_sphere_grid
-export OUTPUT_FH="1 -1"
+export OUTPUT_FH="0 12 24"
 if [[ "$ATMRES" = "$LNDRES" ]]; then
   export lnd_input_dir="INPUT/"
   export mosaic_file="INPUT/${LNDRES}_mosaic.nc"

So, this minimize the I/O footprint of the control_p8_atmlnd and also ensure the required files are generated. I am not sure the changes in ufs.configure.atm_lnd.IN affect other tests at this point. I also try to disable mediator restart and history for this run and it finished in 238 sec. which is very close to standalone atmosphere. At this point I am not seeing any performance issue at least in this case. Anyway, I could also do same for cpld_control_p8 and cpld_control_p8_lnd. Let me know what you think.

@ulmononian
Copy link
Collaborator Author

thanks for this testing @uturuncoglu! @DeniseWorthen @jkbk2004 preference here for these lands tests on ursa?

@ulmononian
Copy link
Collaborator Author

just to note: control_p8_atmlnd_intel passes if i change TPN to 128 instead if 192 in the test file (forcing run job to use 3 nodes on ursa). @uturuncoglu @DeniseWorthen

@BrianCurtis-NOAA
Copy link
Collaborator

just to note: control_p8_atmlnd_intel passes if i change TPN to 128 instead if 192 in the test file (forcing run job to use 3 nodes on ursa). @uturuncoglu @DeniseWorthen

is there some memory issue? Does a TPN of 192 make sense for Ursa's hardware?

@DeniseWorthen
Copy link
Collaborator

@uturuncoglu Thanks for your testing. I don't see there should be any issue w/ that sort of timing (~300s) so I don't think there is any cause to change your tests.

@uturuncoglu
Copy link
Collaborator

@DeniseWorthen Okay. Let me know if you need help. We could make them more efficient in the future. I am not sure at this point who will maintain land coupling related issues since we are in a process of finalizing JTTI project.

@ulmononian
Copy link
Collaborator Author

ulmononian commented Feb 4, 2025

just to note: control_p8_atmlnd_intel passes if i change TPN to 128 instead if 192 in the test file (forcing run job to use 3 nodes on ursa). @uturuncoglu @DeniseWorthen

is there some memory issue? Does a TPN of 192 make sense for Ursa's hardware?

@BrianCurtis-NOAA scontrol show node on nfe91 (the available ursa preTDS log-in node) shows CPUTot=192, so i went with that. it works for most tests, except the four i reported earlier. dropping to 128 for these specific tests to allocate more nodes is just a suggested fix (as done w/ gaea c6).

@ulmononian
Copy link
Collaborator Author

@BrianCurtis-NOAA moving to 128 TPN resolves the issues for all the lnd RTs, but not for the regional_atmaq_debug test. it is only getting like 0.2 hours into the fcst when i try either TPN=96 (4 nodes) or TPN=128 (3 nodes). i am seeing this in the err file, but it is only a warning so i assume it is not causing the slowness

175: WARNING: O3TOTCOL:Requested date is beyond available data on OMI file:  <0:00:00   Jan. 8, 2016

wallclock time is still at 30mins. times out every time. @chan-hoo mentioned you maintain this test, so wanted to check if you had any insight.

@ulmononian
Copy link
Collaborator Author

ulmononian commented Feb 10, 2025

just a note (in case anyone knows a fix) that rocoto will hang indefinitely when running rt.sh for multiple tests (i.e., ./rt.sh -a <account> -k -r -c -l rt.conf) if any tests are failing (in compile/run). it will not exit and just keeps showing: rt_utils.sh: Running one iteration of rocotorun and rocotostat.... when this happens, no regression test log is saved, so it is not easy to check which tests have passed/failed in a text file (you have to do rocotostat -d rocoto_workflow.db -w rocoto_workflow.xml). if ALL tests pass, rocoto will succesfully exit and save the regression test logs. @RatkoVasic-NOAA reported something similar for hera.

@ulmononian
Copy link
Collaborator Author

compile failures in the following using gcc/12.4.0 & openmpi/4.1.6: compile_datm_cdeps, compile_s2s, compile_s2swa, compile_s2swa_debug compile_s2sw_pdlib, compile_s2sw_pdlib_debug. the error for each seems to be the same, i.e.:

/collab1/data/Cameron.Book/test_wm_gnu/CMEPS-interface/CMEPS/mediator/med_utils_mod.F90:37:9:

   37 |     use mpi , only : MPI_ERROR_STRING, MPI_MAX_ERROR_STRING, MPI_SUCCESS
      |         1
Fatal Error: Cannot open module file 'mpi.mod' for reading at (1): No such file or directory
compilation terminated.
make[2]: *** [CMEPS-interface/CMakeFiles/cmeps.dir/build.make:569: CMEPS-interface/CMakeFiles/cmeps.dir/CMEPS/mediator/med_utils_mod.F90.o] Error 1
make[2]: *** Waiting for unfinished jobs....

all other gnu configurations & tests turned on for both hera/hercules succeed on ursa, i.e.:

[Cameron.Book@nfe91 tests]$ rocotostat -d rocoto_workflow.db -w rocoto_workflow.xml
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
197001010000         compile_atm_gnu                      503697           SUCCEEDED                   0         1         168.0
197001010000         control_c48_gnu                      503708           SUCCEEDED                   0         1         359.0
197001010000      control_stochy_gnu                      503709           SUCCEEDED                   0         1         125.0
197001010000         control_ras_gnu                      503710           SUCCEEDED                   0         1         196.0
197001010000          control_p8_gnu                      503711           SUCCEEDED                   0         1         197.0
197001010000    control_p8_ugwpv1_gnu                      503712           SUCCEEDED                   0         1         192.0
197001010000       control_flake_gnu                      503713           SUCCEEDED                   0         1         250.0
197001010000        compile_rrfs_gnu                      503698           SUCCEEDED                   0         1         166.0
197001010000         rap_control_gnu                      503714           SUCCEEDED                   0         1         384.0
197001010000         rap_sfcdiff_gnu                      503715           SUCCEEDED                   0         1         384.0
197001010000        hrrr_control_gnu                      503716           SUCCEEDED                   0         1         211.0
197001010000         rrfs_v1beta_gnu                      503717           SUCCEEDED                   0         1         382.0
197001010000      compile_csawmg_gnu                      503699           SUCCEEDED                   0         1         158.0
197001010000      control_csawmg_gnu                      503718           SUCCEEDED                   0         1         352.0
197001010000    compile_atm_dyn32_debug_gnu                      503700           SUCCEEDED                   0         1         277.0
197001010000    control_diag_debug_gnu                      503719           SUCCEEDED                   0         1          91.0
197001010000      regional_debug_gnu                      503720           SUCCEEDED                   0         1         336.0
197001010000    rap_control_debug_gnu                      503721           SUCCEEDED                   0         1         113.0
197001010000    hrrr_control_debug_gnu                      503722           SUCCEEDED                   0         1         109.0
197001010000       hrrr_gf_debug_gnu                      503723           SUCCEEDED                   0         1         112.0
197001010000       hrrr_c3_debug_gnu                      503724           SUCCEEDED                   0         1         116.0
197001010000      rap_diag_debug_gnu                      503725           SUCCEEDED                   0         1         128.0
197001010000    rap_noah_sfcdiff_cires_ugwp_debug_gnu                      503726           SUCCEEDED                   0         1         171.0
197001010000    rap_progcld_thompson_debug_gnu                      503727           SUCCEEDED                   0         1         111.0
197001010000    rrfs_v1beta_debug_gnu                      503728           SUCCEEDED                   0         1         114.0
197001010000    control_ras_debug_gnu                      503729           SUCCEEDED                   0         1          70.0
197001010000    control_stochy_debug_gnu                      503730           SUCCEEDED                   0         1          75.0
197001010000    control_debug_p8_gnu                      503731           SUCCEEDED                   0         1          93.0
197001010000     rap_flake_debug_gnu                      503732           SUCCEEDED                   0         1         113.0
197001010000    rap_clm_lake_debug_gnu                      503733           SUCCEEDED                   0         1         112.0
197001010000    gnv1_c96_no_nest_debug_gnu                      503734           SUCCEEDED                   0         1         188.0
197001010000    compile_wam_debug_gnu                      503701           SUCCEEDED                   0         1         100.0
197001010000    compile_atm_debug_dyn32_gnu                      503702           SUCCEEDED                   0         1         167.0
197001010000    control_csawmg_debug_gnu                      503735           SUCCEEDED                   0         1         117.0
197001010000    compile_rrfs_dyn32_phy32_gnu                      503703           SUCCEEDED                   0         1         164.0
197001010000    rap_control_dyn32_phy32_gnu                      503736           SUCCEEDED                   0         1         359.0
197001010000    hrrr_control_dyn32_phy32_gnu                      503737           SUCCEEDED                   0         1         197.0
197001010000    conus13km_control_gnu                      503738           SUCCEEDED                   0         1         149.0
197001010000    conus13km_restart_mismatch_gnu                      503745           SUCCEEDED                   0         1         101.0
197001010000    compile_atm_dyn64_phy32_gnu                      503704           SUCCEEDED                   0         1         383.0
197001010000    rap_control_dyn64_phy32_gnu                      503739           SUCCEEDED                   0         1         228.0
197001010000    compile_atm_dyn32_phy32_debug_gnu                      503705           SUCCEEDED                   0         1         278.0
197001010000    rap_control_debug_dyn32_phy32_gnu                      503740           SUCCEEDED                   0         1         111.0
197001010000    hrrr_control_debug_dyn32_phy32_gnu                      503741           SUCCEEDED                   0         1         111.0
197001010000     conus13km_debug_gnu                      503742           SUCCEEDED                   0         1         294.0
197001010000    conus13km_radar_tten_debug_gnu                      503743           SUCCEEDED                   0         1         289.0
197001010000    compile_atm_dyn64_phy32_debug_gnu                      503706           SUCCEEDED                   0         1         276.0
197001010000    rap_control_dyn64_phy32_debug_gnu                      503744           SUCCEEDED                   0         1         114.0

@ulmononian
Copy link
Collaborator Author

the error i reported above seems to be resolved if i include setenv("FC", "mpifort") in the ufs_ursa.gnu.lua file (replacing gfortran with mpifort to address the mpi.mod problem). however, if i include setenv("CC", "mpicc") or setenv("CXX", "mpic++"), CMake fails. so the ufs_ursa.gnu.lua file contains:

setenv("MPI_CC", "mpicc")
setenv("MPI_CXX", "mpic++")
setenv("MPI_FC", "mpifort")

setenv("FC", "mpifort")

@ulmononian
Copy link
Collaborator Author

all compile jobs pass with FC=mpifort added to the gnu modulefile. associated tests pass, w/ the exception of cpld_control_p8_gnu and cpld_debug_p8_gnu (both of these tests are turned OFF for hercules/hera).

[Cameron.Book@nfe91 tests]$ rocotostat -d rocoto_workflow.db -w rocoto_workflow.xml
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
197001010000       compile_s2swa_gnu                      503767           SUCCEEDED                   0         1         634.0
197001010000     cpld_control_p8_gnu                      503783                DEAD                   1         3          47.0
197001010000         compile_s2s_gnu                      503768           SUCCEEDED                   0         1         610.0
197001010000    cpld_control_nowave_noaero_p8_gnu                      503776           SUCCEEDED                   0         1         429.0
197001010000    compile_s2swa_debug_gnu                      503769           SUCCEEDED                   0         1         137.0
197001010000       cpld_debug_p8_gnu                      503779                DEAD                   1         3          74.0
197001010000    compile_s2sw_pdlib_gnu                      503770           SUCCEEDED                   0         1         636.0
197001010000    cpld_control_pdlib_p8_gnu                      503780           SUCCEEDED                   0         1         625.0
197001010000    compile_s2sw_pdlib_debug_gnu                      503771           SUCCEEDED                   0         1         127.0
197001010000    cpld_debug_pdlib_p8_gnu                      503774           SUCCEEDED                   0         1         506.0
197001010000    compile_datm_cdeps_gnu                      503772           SUCCEEDED                   0         1         607.0
197001010000    datm_cdeps_control_cfsr_gnu                      503777           SUCCEEDED                   0         1         110.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Porting to Ursa (Hera replacement)
7 participants