Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to the NCO mode #443

Merged

Conversation

danielabdi-noaa
Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa commented Oct 29, 2022

DESCRIPTION OF CHANGES:

This PR is an update of NCO mode with the following changes

  • Currently there is no way to install only the binaries in HOMEdir because dependencies like GSI do not have the option to install for example include in a separate directory from bin. A temporary solution thanks to @MatthewPyle-NOAA is to install in build directory and move the exec back to $HOMEdir so that the rest share, lib, lib64, include do not pollute the HOMEdir.
    ./devbuild.sh -p=hera --install-dir=$PWD/build --move
    
    If you want to have some time to inspect before moving exec to HOMEdir, make this a two step-process by
    ./devbuild.sh -p=hera --install-dir=$PWD/build
    ./devbuild.sh -p=hera --install-dir=$PWD/build --continue --move
    
  • Fix bug in calling setpdy without COMROOT set
  • Fix bug in simultaneous run of post with forecast
  • Remove use of DATA_SHARED directory and make all tasks use one temp directory each
  • Make sure KEEPDATA=FALSE works properly. $DATA directory is now clean with the flag on.
  • Fixes issue community_ensemble_008mems tests fail in NCO mode. #442 by undoing the for_ICS/LBCS addition to NCO mode. The COMIN directory still has $cycattached to it so should be fine for AQM?
  • Add templated data source paths that use compath.py for wcoss2 (needs testing). The paths are set like this in wcoss2.yaml
    FV3GFS: compath.py ${envir}/gfs/${gfs_ver}/gfs.${PDY}
    
    I am hoping the template variable mechanism will work this way.
  • First attempt at issue Optimize fundamental test suite coverage #445. Now about 40 test cases are covered compared to the previous 9, and the tests should finish in about the same time. This is comparable to a comprhensive test IMO. Note that hera is running entirely in nco mode, so there are some repeat test cases. Let me know if you have suggestions on how to optimize this better.
  • Potentially solves issue get_from_HPSS* test cases fail. #349 . get_extrn_ics/lbcs tasks are assigned 1 core but there is no minimum memory requirement. Sometimes these tasks fail with state DEAD (OUT_OF_MEMORY) message, which I am hoping will be solved by adding a <memory>4G</memory> requirement to rocoto xml file. Comparing the second and third jet runs looks like this issue maybe solved.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel (comprehensive test)
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted
    Since this PR also changes Jenkins (increases number of tests from 9 to 40), I need help in running Jenkins once to get info
    on load balancing and test cases that fail (or not reliable) so that they can be removed from the list

CONTRIBUTORS (optional):

@MatthewPyle-NOAA

@danielabdi-noaa danielabdi-noaa changed the title Feature/nco part 2 Updates to the NCO mode Oct 29, 2022
@danielabdi-noaa danielabdi-noaa force-pushed the feature/nco_part_2 branch 4 times, most recently from 06543eb to c089041 Compare October 29, 2022 15:07
@danielabdi-noaa danielabdi-noaa added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Oct 29, 2022
@venitahagerty venitahagerty removed ci-jet-intel-WE Kicks off automated workflow test on jet with intel ci-hera-intel-WE Kicks off automated workflow test on hera with intel labels Oct 29, 2022
@venitahagerty
Copy link
Collaborator

venitahagerty commented Oct 29, 2022

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1103924331/20221029152015/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on hera: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
All experiments completed

@venitahagerty
Copy link
Collaborator

venitahagerty commented Oct 29, 2022

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1103924331/20221029152014/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-jet-intel-WE
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on jet: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
All experiments completed

@MatthewPyle-NOAA
Copy link
Collaborator

@danielabdi-noaa Noticed an issue on WCOSS2 in machine/wcoss2.yaml.

FV3GFS: compath.py ${envir}/gfs/${fv3gfs_ver}/gfs.${PDY}

fv3gfs_ver isn't defined in the WCOSS2 versions file. Could you change fv3gfs_ver to gfs_ver here?

A longer term goal might be to change all FV3GFS references to GFS. This distinction might have made sense shortly after the GFS began running with the FV3 model, but shouldn't be necessary now.

@MichaelLueken
Copy link
Collaborator

@danielabdi-noaa The Jenkins CI tests on Gaea are now passing. I wish I knew why the machine wasn't accepting values for --mem=x. I submitted several tests yesterday afternoon and this morning. The only value that worked was using 0 (including 0G). This removed the --mem option from the launcher.

I will launch the Orion tests, then the Cheyenne tests.

@danielabdi-noaa
Copy link
Collaborator Author

@MichaelLueken Yes, I think 0G allocates all available memory on the node to the serial task, but most likely shared rather than exclusive. It has these limits though so may not be the entire thing.

DefMemPerCPU=2048 MaxMemPerCPU=4096

Copy link
Collaborator

@christinaholtNOAA christinaholtNOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa I think the code looks good.

I left a few minor comments below. And a couple of clarification questions. Nothing that should hold up progress, though.

fi
#
#-----------------------------------------------------------------------
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that means we are leaving these files in place instead of putting them in COM, right?

Copy link
Collaborator Author

@danielabdi-noaa danielabdi-noaa Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Forecast outputs remain in DATAROOT/run_fcst.unique_id but are accessible since this directory is shared. The RESTART file is also in there and is also accessible if it is needed by later cycles.

grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
pregen_grid_orog_sfc_climo
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_TRUE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These added machine_suite files seem outside the scope of NCO compliance. What am I missing?

ush/generate_FV3LAM_wflow.py Show resolved Hide resolved
ush/setup.py Outdated Show resolved Hide resolved
@@ -104,9 +114,28 @@ export -f POST_STEP
if [ "${RUN_ENVIR}" = "nco" ]; then
export COMIN="${COMIN_BASEDIR}/${RUN}.${PDY}/${cyc}"
export COMOUT="${COMOUT_BASEDIR}/${RUN}.${PDY}/${cyc}"
export COMINaws="${AWSROOT}/${RUN}.${PDY}/${cyc}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what AWSROOT is? Is it already documented somewhere? Is this AWS different from Amazon Web Services?

Copy link
Collaborator

@MatthewPyle-NOAA MatthewPyle-NOAA Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what AWSROOT is? Is it already documented somewhere? Is this AWS different from Amazon Web Services?

It is being used as a holding spot for input files being brought into the system. Has to do with some of the discussion/chatter recently about the GFS files not belonging in RRFS com space.
Maybe COMINexternal/EXTERNALROOT would be more general, and less likely to cause confusion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rename them to COMINext, EXTROOT and OPSROOT/ext if that is Ok? When we use wcoss2 sources COMINgfs, COMINhrrr etc, EXTROOT will just symlink the files and is used as a place to communicate the ICS/LBCS data to the make_ics/lbcs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me, @danielabdi-noaa

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a nicer alternative given that we would certainly have external sources that are not aws. Thanks @danielabdi-noaa and @MatthewPyle-NOAA

@MatthewPyle-NOAA
Copy link
Collaborator

MatthewPyle-NOAA commented Nov 4, 2022

More changes are needed for WCOSS2 - use of compath.py throws an error since prod_util isn't being loaded. Can you:

Add export prod_util_ver=2.0.14 to the versions/run.ver.wcoss2 file?

And use it in in the modulefiles/tasks/wcoss2/*.lua definitions?

And another item that seems to have been an on again, off again issue is that GRIB output isn't going to com as it is generated.

@danielabdi-noaa
Copy link
Collaborator Author

@MichaelLueken Orion is currently under maintenance and I think the tests are stuck so best to stop them. In any case, those 5 test cases are ones who successfully run on orion before, so they should work once orion maintenance is over.

@MichaelLueken
Copy link
Collaborator

@danielabdi-noaa I've been testing nco mode on Hera manually. Once the tests there pass, I will approve (they had been failing due to COMINaws remnants in the preamble).

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa The manual tests on Hera successfully passed. I am now approving this PR.

@danielabdi-noaa
Copy link
Collaborator Author

danielabdi-noaa commented Nov 5, 2022

@MichaelLueken I have run the comprhensive test case (83 of them) on Hera in NCO mode. All have exceeded except 4 which are known failures. @MatthewPyle-NOAA The grib2 files are all in COMIN but I do recall having the issue you mentioned when post processing failed. Here is the OPSROOT directory on Hera for the new run

/scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/OPSROOT-nco-2

Let me know if that still happens. I am going to merge this now.

@danielabdi-noaa danielabdi-noaa merged commit 474ab7d into ufs-community:develop Nov 5, 2022
Comment on lines 432 to +450
if [ -n "${test_type}" ] ; then
# Check for a pre-defined set. It could be machine dependent or not.
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}
# Check for a pre-defined set. It could be machine dependent or has the mode
# (community or nco), or default
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}.nco
if [ ! -f ${user_spec_tests_fp} ]; then
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}.com
if [ ! -f ${user_spec_tests_fp} ]; then
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}.${compiler}
if [ ! -f ${user_spec_tests_fp} ]; then
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}
if [ ! -f ${user_spec_tests_fp} ]; then
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}
fi
fi
else
run_envir=${run_envir:-"community"}
fi
else
run_envir=${run_envir:-"nco"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa I hadn't checked on this PR before it was merged because I thought it would only change NCO mode running, but this seems to have introduced the additional change that the fundamental test suite on Hera will always be run in NCO mode. Was this an intentional change? If so, what was the reasoning?

Copy link
Collaborator Author

@danielabdi-noaa danielabdi-noaa Dec 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Yes, it is intentional. The goal for the fundamental test suites has shifted with this PR, now it is more about expanded test coverage and variety. I chose Hera for NCO mode so that it would be tested with the Github Actions CI as well, which runs only on Hera/Jet. The fundamental test suite is now running different test cases on each platform, each compiler gnu/intel. Last time I check it was close to being "comprehensive" with 45 test cases or so.

The logic in this code is kind of convoluted and if you find a better way to rename the files while keeping the flexibility, that would be great!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting back to me so quickly. I may try to bring this up for discussion at the next code management meeting, because this feels like it may be confusing to developers testing their changes on a specific machine. Although perhaps a better topic for discussion would be the broader "which WE2E tests should developers be running to test their code prior to opening a PR", since this is not yet covered in the contributors guide, and clearly I didn't even understand what tests were being run when I ran the script before now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants