Updates to the NCO mode #443

danielabdi-noaa · 2022-10-29T10:32:22Z

DESCRIPTION OF CHANGES:

This PR is an update of NCO mode with the following changes

Currently there is no way to install only the binaries in HOMEdir because dependencies like GSI do not have the option to install for example include in a separate directory from bin. A temporary solution thanks to @MatthewPyle-NOAA is to install in build directory and move the exec back to $HOMEdir so that the rest share, lib, lib64, include do not pollute the HOMEdir.
```
./devbuild.sh -p=hera --install-dir=$PWD/build --move
```
If you want to have some time to inspect before moving exec to HOMEdir, make this a two step-process by
```
./devbuild.sh -p=hera --install-dir=$PWD/build
./devbuild.sh -p=hera --install-dir=$PWD/build --continue --move
```
Fix bug in calling setpdy without COMROOT set
Fix bug in simultaneous run of post with forecast
Remove use of DATA_SHARED directory and make all tasks use one temp directory each
Make sure KEEPDATA=FALSE works properly. $DATA directory is now clean with the flag on.
Fixes issue community_ensemble_008mems tests fail in NCO mode. #442 by undoing the for_ICS/LBCS addition to NCO mode. The COMIN directory still has $cycattached to it so should be fine for AQM?
Add templated data source paths that use compath.py for wcoss2 (needs testing). The paths are set like this in wcoss2.yaml
```
FV3GFS: compath.py ${envir}/gfs/${gfs_ver}/gfs.${PDY}
```
I am hoping the template variable mechanism will work this way.
First attempt at issue Optimize fundamental test suite coverage #445. Now about 40 test cases are covered compared to the previous 9, and the tests should finish in about the same time. This is comparable to a comprhensive test IMO. Note that hera is running entirely in nco mode, so there are some repeat test cases. Let me know if you have suggestions on how to optimize this better.
Potentially solves issue get_from_HPSS* test cases fail. #349 . get_extrn_ics/lbcs tasks are assigned 1 core but there is no minimum memory requirement. Sometimes these tasks fail with state DEAD (OUT_OF_MEMORY) message, which I am hoping will be solved by adding a <memory>4G</memory> requirement to rocoto xml file. Comparing the second and third jet runs looks like this issue maybe solved.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

TESTS CONDUCTED:

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

My code follows the style guidelines in the Contributor's Guide
I have performed a self-review of my own code using the Code Reviewer's Guide
I have commented my code, particularly in hard-to-understand areas
My changes need updates to the documentation. I have made corresponding changes to the documentation
My changes do not require updates to the documentation (explain).
My changes generate no new warnings
New and existing tests pass with my changes
Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

CONTRIBUTORS (optional):

@MatthewPyle-NOAA

venitahagerty · 2022-10-29T15:36:39Z

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1103924331/20221029152015/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on hera: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
All experiments completed

venitahagerty · 2022-10-29T15:54:11Z

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1103924331/20221029152014/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-jet-intel-WE
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on jet: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
All experiments completed

MatthewPyle-NOAA · 2022-11-04T16:59:07Z

@danielabdi-noaa Noticed an issue on WCOSS2 in machine/wcoss2.yaml.

FV3GFS: compath.py ${envir}/gfs/${fv3gfs_ver}/gfs.${PDY}

fv3gfs_ver isn't defined in the WCOSS2 versions file. Could you change fv3gfs_ver to gfs_ver here?

A longer term goal might be to change all FV3GFS references to GFS. This distinction might have made sense shortly after the GFS began running with the FV3 model, but shouldn't be necessary now.

MichaelLueken · 2022-11-04T17:16:01Z

@danielabdi-noaa The Jenkins CI tests on Gaea are now passing. I wish I knew why the machine wasn't accepting values for --mem=x. I submitted several tests yesterday afternoon and this morning. The only value that worked was using 0 (including 0G). This removed the --mem option from the launcher.

I will launch the Orion tests, then the Cheyenne tests.

danielabdi-noaa · 2022-11-04T17:37:24Z

@MichaelLueken Yes, I think 0G allocates all available memory on the node to the serial task, but most likely shared rather than exclusive. It has these limits though so may not be the entire thing.

DefMemPerCPU=2048 MaxMemPerCPU=4096

christinaholtNOAA

@danielabdi-noaa I think the code looks good.

I left a few minor comments below. And a couple of clarification questions. Nothing that should hold up progress, though.

christinaholtNOAA · 2022-11-04T17:58:54Z

scripts/exregional_run_fcst.sh

-fi
-#
-#-----------------------------------------------------------------------
-#


This is the part that means we are leaving these files in place instead of putting them in COM, right?

Yes. Forecast outputs remain in DATAROOT/run_fcst.unique_id but are accessible since this directory is shared. The RESTART file is also in there and is also accessible if it is needed by later cycles.

christinaholtNOAA · 2022-11-04T18:00:36Z

tests/WE2E/machine_suites/fundamental.cheyenne

+grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
+pregen_grid_orog_sfc_climo
+custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE
+custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_TRUE


These added machine_suite files seem outside the scope of NCO compliance. What am I missing?

ush/generate_FV3LAM_wflow.py

ush/setup.py

christinaholtNOAA · 2022-11-04T18:16:08Z

ush/job_preamble.sh

@@ -104,9 +114,28 @@ export -f POST_STEP
 if [ "${RUN_ENVIR}" = "nco" ]; then
    export COMIN="${COMIN_BASEDIR}/${RUN}.${PDY}/${cyc}"
    export COMOUT="${COMOUT_BASEDIR}/${RUN}.${PDY}/${cyc}"
+    export COMINaws="${AWSROOT}/${RUN}.${PDY}/${cyc}"


I'm curious what AWSROOT is? Is it already documented somewhere? Is this AWS different from Amazon Web Services?

I'm curious what AWSROOT is? Is it already documented somewhere? Is this AWS different from Amazon Web Services?

It is being used as a holding spot for input files being brought into the system. Has to do with some of the discussion/chatter recently about the GFS files not belonging in RRFS com space.
Maybe COMINexternal/EXTERNALROOT would be more general, and less likely to cause confusion?

I will rename them to COMINext, EXTROOT and OPSROOT/ext if that is Ok? When we use wcoss2 sources COMINgfs, COMINhrrr etc, EXTROOT will just symlink the files and is used as a place to communicate the ICS/LBCS data to the make_ics/lbcs.

That sounds good to me, @danielabdi-noaa

I think that is a nicer alternative given that we would certainly have external sources that are not aws. Thanks @danielabdi-noaa and @MatthewPyle-NOAA

MatthewPyle-NOAA · 2022-11-04T18:19:06Z

More changes are needed for WCOSS2 - use of compath.py throws an error since prod_util isn't being loaded. Can you:

Add export prod_util_ver=2.0.14 to the versions/run.ver.wcoss2 file?

And use it in in the modulefiles/tasks/wcoss2/*.lua definitions?

And another item that seems to have been an on again, off again issue is that GRIB output isn't going to com as it is generated.

danielabdi-noaa · 2022-11-04T20:27:47Z

@MichaelLueken Orion is currently under maintenance and I think the tests are stuck so best to stop them. In any case, those 5 test cases are ones who successfully run on orion before, so they should work once orion maintenance is over.

MichaelLueken · 2022-11-04T20:33:44Z

@danielabdi-noaa I've been testing nco mode on Hera manually. Once the tests there pass, I will approve (they had been failing due to COMINaws remnants in the preamble).

MichaelLueken

@danielabdi-noaa The manual tests on Hera successfully passed. I am now approving this PR.

danielabdi-noaa · 2022-11-05T00:05:24Z

@MichaelLueken I have run the comprhensive test case (83 of them) on Hera in NCO mode. All have exceeded except 4 which are known failures. @MatthewPyle-NOAA The grib2 files are all in COMIN but I do recall having the issue you mentioned when post processing failed. Here is the OPSROOT directory on Hera for the new run

/scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/OPSROOT-nco-2

Let me know if that still happens. I am going to merge this now.

mkavulich · 2022-12-03T01:52:33Z

tests/WE2E/run_WE2E_tests.sh

  if [ -n "${test_type}" ] ; then
-    # Check for a pre-defined set. It could be machine dependent or not.
-    user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}
+    # Check for a pre-defined set. It could be machine dependent or has the mode
+    # (community or nco), or default
+    user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}.nco
    if [ ! -f ${user_spec_tests_fp} ]; then
-        user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}
+        user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}.com
+        if [ ! -f ${user_spec_tests_fp} ]; then
+            user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}.${compiler}
+            if [ ! -f ${user_spec_tests_fp} ]; then
+                user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}
+                if [ ! -f ${user_spec_tests_fp} ]; then
+                    user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}
+                fi
+            fi
+        else
+            run_envir=${run_envir:-"community"}
+        fi
+    else
+        run_envir=${run_envir:-"nco"}


@danielabdi-noaa I hadn't checked on this PR before it was merged because I thought it would only change NCO mode running, but this seems to have introduced the additional change that the fundamental test suite on Hera will always be run in NCO mode. Was this an intentional change? If so, what was the reasoning?

@mkavulich Yes, it is intentional. The goal for the fundamental test suites has shifted with this PR, now it is more about expanded test coverage and variety. I chose Hera for NCO mode so that it would be tested with the Github Actions CI as well, which runs only on Hera/Jet. The fundamental test suite is now running different test cases on each platform, each compiler gnu/intel. Last time I check it was close to being "comprehensive" with 45 test cases or so.

The logic in this code is kind of convoluted and if you find a better way to rename the files while keeping the flexibility, that would be great!

Thanks for getting back to me so quickly. I may try to bring this up for discussion at the next code management meeting, because this feels like it may be confusing to developers testing their changes on a specific machine. Although perhaps a better topic for discussion would be the broader "which WE2E tests should developers be running to test their code prior to opening a PR", since this is not yet covered in the contributors guide, and clearly I didn't even understand what tests were being run when I ran the script before now!

danielabdi-noaa changed the title ~~Feature/nco part 2~~ Updates to the NCO mode Oct 29, 2022

danielabdi-noaa force-pushed the feature/nco_part_2 branch 4 times, most recently from 06543eb to c089041 Compare October 29, 2022 15:07

danielabdi-noaa added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Oct 29, 2022

venitahagerty removed ci-jet-intel-WE Kicks off automated workflow test on jet with intel ci-hera-intel-WE Kicks off automated workflow test on hera with intel labels Oct 29, 2022

danielabdi-noaa marked this pull request as ready for review October 29, 2022 17:23

danielabdi-noaa added 4 commits November 3, 2022 20:03

Fix calculate_cost.

e45182b

Fix gaea test list.

dc2b784

Remove a test case from cheyenne that is taking extremely long.

35dfd6e

Reduce memory requirement for serial job to 1G.

2202842

danielabdi-noaa force-pushed the feature/nco_part_2 branch from 4999244 to 2202842 Compare November 3, 2022 20:03

danielabdi-noaa added 6 commits November 3, 2022 22:58

Create COMINaws ics/lbcs staging directory.

b38aa18

Also symlink gfs ics/lbcs if on disk and in community mode.

d3dc8be

Remove failing tests on orion.

1a890d4

Exclude gaea from --mem specification.

0c61ce2

Add a run_vx modulefile for orion.

05741e2

Add separate fundamental list for cheyenne GNU runs.

270fe53

Bug fix for wcoss2 GFS version

9e21472

christinaholtNOAA approved these changes Nov 4, 2022

View reviewed changes

Load prod_util on wcoss2.

42ffaea

danielabdi-noaa force-pushed the feature/nco_part_2 branch from ab24f71 to 42ffaea Compare November 4, 2022 18:33

danielabdi-noaa added 2 commits November 4, 2022 13:45

Rename AWSROOT to EXTROOT.

1dce0de

Remove leftover COMINaws.

0fad99c

danielabdi-noaa force-pushed the feature/nco_part_2 branch from 8ae97e8 to 0fad99c Compare November 4, 2022 19:39

MichaelLueken approved these changes Nov 4, 2022

View reviewed changes

danielabdi-noaa merged commit 474ab7d into ufs-community:develop Nov 5, 2022

mkavulich reviewed Dec 3, 2022

View reviewed changes

EdwardSnyder-NOAA mentioned this pull request Dec 5, 2022

[develop] Remove Slurm Memory Option for noaacloud #506

Merged

37 tasks

MichaelLueken mentioned this pull request Sep 11, 2023

[develop] Component hash updates, FV3_HRRR namelist changes #906

Merged

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to the NCO mode #443

Updates to the NCO mode #443

danielabdi-noaa commented Oct 29, 2022 •

edited

Loading

venitahagerty commented Oct 29, 2022 •

edited

Loading

venitahagerty commented Oct 29, 2022 •

edited

Loading

MatthewPyle-NOAA commented Nov 4, 2022

MichaelLueken commented Nov 4, 2022

danielabdi-noaa commented Nov 4, 2022

christinaholtNOAA left a comment

christinaholtNOAA Nov 4, 2022

danielabdi-noaa Nov 4, 2022 •

edited

Loading

christinaholtNOAA Nov 4, 2022

christinaholtNOAA Nov 4, 2022

MatthewPyle-NOAA Nov 4, 2022 •

edited

Loading

danielabdi-noaa Nov 4, 2022

MatthewPyle-NOAA Nov 4, 2022

christinaholtNOAA Nov 4, 2022

MatthewPyle-NOAA commented Nov 4, 2022 •

edited

Loading

danielabdi-noaa commented Nov 4, 2022

MichaelLueken commented Nov 4, 2022

MichaelLueken left a comment

danielabdi-noaa commented Nov 5, 2022 •

edited

Loading

mkavulich Dec 3, 2022

danielabdi-noaa Dec 3, 2022 •

edited

Loading

mkavulich Dec 3, 2022

Updates to the NCO mode #443

Updates to the NCO mode #443

Conversation

danielabdi-noaa commented Oct 29, 2022 • edited Loading

DESCRIPTION OF CHANGES:

Type of change

TESTS CONDUCTED:

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

LABELS (optional):

CONTRIBUTORS (optional):

venitahagerty commented Oct 29, 2022 • edited Loading

venitahagerty commented Oct 29, 2022 • edited Loading

MatthewPyle-NOAA commented Nov 4, 2022

MichaelLueken commented Nov 4, 2022

danielabdi-noaa commented Nov 4, 2022

christinaholtNOAA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielabdi-noaa Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MatthewPyle-NOAA Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MatthewPyle-NOAA commented Nov 4, 2022 • edited Loading

danielabdi-noaa commented Nov 4, 2022

MichaelLueken commented Nov 4, 2022

MichaelLueken left a comment

Choose a reason for hiding this comment

danielabdi-noaa commented Nov 5, 2022 • edited Loading

Choose a reason for hiding this comment

danielabdi-noaa Dec 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielabdi-noaa commented Oct 29, 2022 •

edited

Loading

venitahagerty commented Oct 29, 2022 •

edited

Loading

venitahagerty commented Oct 29, 2022 •

edited

Loading

danielabdi-noaa Nov 4, 2022 •

edited

Loading

MatthewPyle-NOAA Nov 4, 2022 •

edited

Loading

MatthewPyle-NOAA commented Nov 4, 2022 •

edited

Loading

danielabdi-noaa commented Nov 5, 2022 •

edited

Loading

danielabdi-noaa Dec 3, 2022 •

edited

Loading