-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to the NCO mode #443
Updates to the NCO mode #443
Conversation
06543eb
to
c089041
Compare
Machine: hera |
Machine: jet |
4999244
to
2202842
Compare
@danielabdi-noaa Noticed an issue on WCOSS2 in machine/wcoss2.yaml.
fv3gfs_ver isn't defined in the WCOSS2 versions file. Could you change fv3gfs_ver to gfs_ver here? A longer term goal might be to change all FV3GFS references to GFS. This distinction might have made sense shortly after the GFS began running with the FV3 model, but shouldn't be necessary now. |
@danielabdi-noaa The Jenkins CI tests on Gaea are now passing. I wish I knew why the machine wasn't accepting values for I will launch the Orion tests, then the Cheyenne tests. |
@MichaelLueken Yes, I think
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielabdi-noaa I think the code looks good.
I left a few minor comments below. And a couple of clarification questions. Nothing that should hold up progress, though.
fi | ||
# | ||
#----------------------------------------------------------------------- | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part that means we are leaving these files in place instead of putting them in COM, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Forecast outputs remain in DATAROOT/run_fcst.unique_id
but are accessible since this directory is shared. The RESTART file is also in there and is also accessible if it is needed by later cycles.
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta | ||
pregen_grid_orog_sfc_climo | ||
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE | ||
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_TRUE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These added machine_suite files seem outside the scope of NCO compliance. What am I missing?
ush/job_preamble.sh
Outdated
@@ -104,9 +114,28 @@ export -f POST_STEP | |||
if [ "${RUN_ENVIR}" = "nco" ]; then | |||
export COMIN="${COMIN_BASEDIR}/${RUN}.${PDY}/${cyc}" | |||
export COMOUT="${COMOUT_BASEDIR}/${RUN}.${PDY}/${cyc}" | |||
export COMINaws="${AWSROOT}/${RUN}.${PDY}/${cyc}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what AWSROOT is? Is it already documented somewhere? Is this AWS different from Amazon Web Services?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what AWSROOT is? Is it already documented somewhere? Is this AWS different from Amazon Web Services?
It is being used as a holding spot for input files being brought into the system. Has to do with some of the discussion/chatter recently about the GFS files not belonging in RRFS com space.
Maybe COMINexternal/EXTERNALROOT would be more general, and less likely to cause confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will rename them to COMINext, EXTROOT and OPSROOT/ext if that is Ok? When we use wcoss2 sources COMINgfs, COMINhrrr etc, EXTROOT will just symlink the files and is used as a place to communicate the ICS/LBCS data to the make_ics/lbcs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me, @danielabdi-noaa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a nicer alternative given that we would certainly have external sources that are not aws. Thanks @danielabdi-noaa and @MatthewPyle-NOAA
More changes are needed for WCOSS2 - use of compath.py throws an error since prod_util isn't being loaded. Can you: Add And use it in in the modulefiles/tasks/wcoss2/*.lua definitions? And another item that seems to have been an on again, off again issue is that GRIB output isn't going to com as it is generated. |
ab24f71
to
42ffaea
Compare
8ae97e8
to
0fad99c
Compare
@MichaelLueken Orion is currently under maintenance and I think the tests are stuck so best to stop them. In any case, those 5 test cases are ones who successfully run on orion before, so they should work once orion maintenance is over. |
@danielabdi-noaa I've been testing nco mode on Hera manually. Once the tests there pass, I will approve (they had been failing due to COMINaws remnants in the preamble). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielabdi-noaa The manual tests on Hera successfully passed. I am now approving this PR.
@MichaelLueken I have run the comprhensive test case (83 of them) on Hera in NCO mode. All have exceeded except 4 which are known failures. @MatthewPyle-NOAA The grib2 files are all in COMIN but I do recall having the issue you mentioned when post processing failed. Here is the OPSROOT directory on Hera for the new run
Let me know if that still happens. I am going to merge this now. |
if [ -n "${test_type}" ] ; then | ||
# Check for a pre-defined set. It could be machine dependent or not. | ||
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine} | ||
# Check for a pre-defined set. It could be machine dependent or has the mode | ||
# (community or nco), or default | ||
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}.nco | ||
if [ ! -f ${user_spec_tests_fp} ]; then | ||
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type} | ||
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}.com | ||
if [ ! -f ${user_spec_tests_fp} ]; then | ||
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine}.${compiler} | ||
if [ ! -f ${user_spec_tests_fp} ]; then | ||
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type}.${machine} | ||
if [ ! -f ${user_spec_tests_fp} ]; then | ||
user_spec_tests_fp=${scrfunc_dir}/machine_suites/${test_type} | ||
fi | ||
fi | ||
else | ||
run_envir=${run_envir:-"community"} | ||
fi | ||
else | ||
run_envir=${run_envir:-"nco"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielabdi-noaa I hadn't checked on this PR before it was merged because I thought it would only change NCO mode running, but this seems to have introduced the additional change that the fundamental test suite on Hera will always be run in NCO mode. Was this an intentional change? If so, what was the reasoning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkavulich Yes, it is intentional. The goal for the fundamental test suites has shifted with this PR, now it is more about expanded test coverage and variety. I chose Hera for NCO mode so that it would be tested with the Github Actions CI as well, which runs only on Hera/Jet. The fundamental test suite is now running different test cases on each platform, each compiler gnu/intel. Last time I check it was close to being "comprehensive" with 45 test cases or so.
The logic in this code is kind of convoluted and if you find a better way to rename the files while keeping the flexibility, that would be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting back to me so quickly. I may try to bring this up for discussion at the next code management meeting, because this feels like it may be confusing to developers testing their changes on a specific machine. Although perhaps a better topic for discussion would be the broader "which WE2E tests should developers be running to test their code prior to opening a PR", since this is not yet covered in the contributors guide, and clearly I didn't even understand what tests were being run when I ran the script before now!
DESCRIPTION OF CHANGES:
This PR is an update of NCO mode with the following changes
HOMEdir
because dependencies like GSI do not have the option to install for exampleinclude
in a separate directory frombin
. A temporary solution thanks to @MatthewPyle-NOAA is to install inbuild
directory and move theexec
back to$HOMEdir
so that the restshare
,lib
,lib64
,include
do not pollute theHOMEdir
.exec
toHOMEdir
, make this a two step-process byCOMROOT
setDATA_SHARED
directory and make all tasks use one temp directory eachKEEPDATA=FALSE
works properly.$DATA
directory is now clean with the flag on.community_ensemble_008mems
tests fail in NCO mode. #442 by undoing thefor_ICS/LBCS
addition to NCO mode. TheCOMIN
directory still has$cyc
attached to it so should be fine for AQM?wcoss2.yaml
get_from_HPSS*
test cases fail. #349 .get_extrn_ics/lbcs
tasks are assigned 1 core but there is no minimum memory requirement. Sometimes these tasks fail withstate DEAD (OUT_OF_MEMORY)
message, which I am hoping will be solved by adding a<memory>4G</memory>
requirement to rocoto xml file. Comparing the second and third jet runs looks like this issue maybe solved.Type of change
TESTS CONDUCTED:
DEPENDENCIES:
DOCUMENTATION:
ISSUE:
CHECKLIST
LABELS (optional):
A Code Manager needs to add the following labels to this PR:
Since this PR also changes Jenkins (increases number of tests from 9 to 40), I need help in running Jenkins once to get info
on load balancing and test cases that fail (or not reliable) so that they can be removed from the list
CONTRIBUTORS (optional):
@MatthewPyle-NOAA