Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release/rrfsv1.0 #497

Conversation

MarcelCaron-NOAA
Copy link
Contributor

@MarcelCaron-NOAA MarcelCaron-NOAA commented Jul 11, 2024

Pull Request Testing

This PR concerns changes in the EVS that are needed to run RRFS (det) verification.

NOTE: As it stands, no planned cam-specific EVSv2 changes have been applied to the new or edited scripts in this PR (e.g., added restart capabilities to firewxnest jobs, bugzilla fixes). However, top-level EVSv2 changes are included (e.g., version numbers, imports).

  • Describe testing already performed for this Pull Request:

The jobs listed below have been running in the cron for three weeks. Output logs have been checked for the "check" keywords listed in the "(4) Verifying jobs" section below, and hits were investigated and corrected if needed. Output statistics were compared with ops output or rrfs-para output. The currently running jobs are clean.

  • Recommend testing for the reviewer(s) to perform, including the location of input datasets, and any additional instructions:

Since EVS-RRFS will likely be implemented after EVSv2 is implemented, we only need to test the single EVSv2 configuration of each job.

(1) Set up jobs

• symlink the EVS_fix directory locally as "fix"

In all driver scripts...
• ... point all exports of "HOMEevs" to your EVS test directory
• ... change COMIN to my test directory (/lfs/h2/emc/vpppg/noscrub/marcel.caron/test/evs_release_rrfsv1.0/$NET/$evs_ver_2d)
• ... change COMOUT to your test COMOUT directory
• ... change DATAROOT to your test DATA root directory
• ... add the following line to developer settings:
export COMINrrfs=/lfs/h2/emc/ptmp/emc.lam/rrfs/${rrfs_ver}/prod

In stats and plots driver scripts...
• ... set SENDMAIL to "NO" (or adjust MAILTO)

(2) I recommend testing the following jobs:

jevs_cam_rrfs_precip_prep
jevs_cam_rrfs_severe_prep
jevs_cam_severe_prep
jevs_cam_rrfs_firewxnest_grid2obs_stats
jevs_cam_nam_firewxnest_grid2obs_stats
jevs_cam_rrfs_grid2obs_stats
jevs_cam_rrfs_precip_stats
jevs_cam_rrfs_radar_stats
jevs_cam_rrfs_severe_stats
jevs_cam_rrfs_snowfall_stats
jevs_cam_hireswarw_snowfall_stats
jevs_cam_hireswarwmem2_snowfall_stats
jevs_cam_hireswfv3_snowfall_stats
jevs_cam_hrrr_snowfall_stats
jevs_cam_namnest_snowfall_stats
jevs_cam_grid2obs_plots
jevs_cam_precip_plots
jevs_cam_snowfall_plots
jevs_cam_firewxnest_grid2obs_plots

The following jobs exist in develop, however I don't think they are being run operationally. Nevertheless, they have been updated for RRFS. I'll defer to code managers whether or not they should be tested and included in this PR.

jevs_cam_radar_nbrcnt_plots_last31days
jevs_cam_radar_nbrcnt_plots_last90days
jevs_cam_radar_nbrctc_plots_last31days
jevs_cam_radar_nbrctc_plots_last90days
jevs_cam_severe_nbrcnt_plots_last31days
jevs_cam_severe_nbrcnt_plots_last90days
jevs_cam_severe_nbrctc_plots_last31days
jevs_cam_severe_nbrctc_plots_last90days

[Total: 3 prep jobs; 12 stats jobs; 12 plots jobs]

(3) Running jobs

Most jobs need to be fed "vhr".
• Apart from the exceptions listed below, all driver scripts can be submitted using qsub -v vhr=00 $driver
• For jevs_cam_severe_prep, use qsub -v vhr=07 $driver
• For jevs_cam_rrfs_severe_stats, use qsub -v vhr=08 $driver
• For jevs_cam_rrfs_grid2obs_stats, use qsub -v vhr=02 $driver
• For jevs_cam_rrfs_precip_stats, use qsub -v vhr=21,VDATE=$PDYm3 $driver where $PDYm3 is set to three days before present
• For jevs_cam_rrfs_precip_plots, use qsub -v vhr=00,VDATE=$PDYm3 $driver where $PDYm3 is set to three days before present

When to submit jobs:
all prep jobs: on the valid day anytime after the "vhr" hour fed to the job scheduler.
severe stats jobs: on the valid day anytime after the "vhr" hour fed to the job scheduler.
all other jobs: anytime.

(4) Verifying jobs

Log files should be checked by the developer for the following keywords:

check="FATAL\|WARNING\|error\|Killed\|Cgroup\|argument expected\|No such file\|cannot\|failed\|unexpected\|exceeded"
grep "$check" $logfile

Additionally, output statistics, plot tar balls, and other major files should be compared to relevant archives (EVS parallel, RRFS routine verification, etc.).

  • Has the code been checked to ensure that no errors occur during the execution? [Yes (see note)]

Note: Some jobs encounter a METplus WARNING while computing VL1L2 statistics. Such instances are normal, and the issue is currently being addressed by DTC (dtcenter/MET#2395), by changing the message type from WARNING to a DEBUG.

  • Do these updates/additions include sufficient testing updates? [Yes]

  • Please complete this pull request review by See Note.

    NOTE: Due to uncertainties in the RRFS and REFS timelines, I'll defer to code managers about the prioritization of this PR.

Pull Request Checklist

  • Review the source issue metadata (required labels, projects, and milestone).

  • Complete the PR description above.

  • Ensure the PR title matches the feature branch name.

  • Check the following:

  • Instructions provided on how to run

  • Developer's name is replaced by ${user} where necessary throughout the code

  • Check that the ecf file has all the proper definitions of variables

  • Check that the jobs file has all the proper settings of COMIN and COMOUT and other input variables

  • Check to see that the output directory structure is followed

  • Be sure that you are not using MET utilities outside the METplus wrapper structure

  • After submitting the PR, select Development issue with the original issue number.

  • After the PR is approved, merge your changes. If permissions do not allow this, request that the reviewer do the merge.

  • Close the linked issue.

@MarcelCaron-NOAA
Copy link
Contributor Author

I'm planning to add/update the ECF scripts tomorrow. Figured this could be submitted in advance as they are not involved in testing.

@ShelleyMelchior-NOAA
Copy link
Contributor

Hi Marcel! I have cloned your fork and checked out branch release/rrfsv1.0.

@MarcelCaron-NOAA
Copy link
Contributor Author

MarcelCaron-NOAA commented Aug 6, 2024

@ShelleyMelchior-NOAA My fault: I used lower-case letters rather than upper-case to define EVAL_PERIOD. The fix I just pushed corrects that. Should be good to go

@ShelleyMelchior-NOAA
Copy link
Contributor

@ShelleyMelchior-NOAA Looks like we are sync'd already 👍

Ah! OK! I was looking at it compared to develop. My mistake.

@ShelleyMelchior-NOAA
Copy link
Contributor

@ShelleyMelchior-NOAA My fault: I used lower-case letters rather than upper-case to define EVAL_PERIOD. The fix I just pushed corrects that. Should be good to go

I see corrections to the severe drivers. Do the radar drivers also need this correction?

@MarcelCaron-NOAA
Copy link
Contributor Author

@ShelleyMelchior-NOAA My fault: I used lower-case letters rather than upper-case to define EVAL_PERIOD. The fix I just pushed corrects that. Should be good to go

I see corrections to the severe drivers. Do the radar drivers also need this correction?

Yup, it does. 🤦‍♂️
I just pushed the same fix for radar. Both radar and severe jobs (all 8) should be good to go.

@ShelleyMelchior-NOAA
Copy link
Contributor

Great! Things look much better. I'll let you expand the tarballs for review.

COMOUT=/lfs/h2/emc/vpppg/noscrub/shelley.melchior/evs/v2.0/plots/cam/atmos.20240730 (severe)
COMOUT=/lfs/h2/emc/vpppg/noscrub/shelley.melchior/evs/v2.0/plots/cam/atmos.20240804 (radar)
logfiles: /lfs/h2/emc/ptmp/shelley.melchior/EVS_out/jevs_cam_*_nbrc*_plots_last*days_00.o147064*

❗❗Go ahead and make the typo fix to the ecf scripts, too.

@MarcelCaron-NOAA
Copy link
Contributor Author

MarcelCaron-NOAA commented Aug 6, 2024

@ShelleyMelchior-NOAA The following jobs completed cleanly and output are as expected:
✔️ jevs_cam_severe_nbrcnt_plots_last31days
✔️ jevs_cam_severe_nbrcnt_plots_last90days
✔️ jevs_cam_severe_nbrctc_plots_last31days
✔️ jevs_cam_severe_nbrctc_plots_last90days
✔️ jevs_cam_radar_nbrcnt_plots_last31days

The following are almost there, but I needed to bump up the ncpu:
🏁 jevs_cam_radar_nbrcnt_plots_last90days
🏁 jevs_cam_radar_nbrctc_plots_last31days
🏁 jevs_cam_radar_nbrctc_plots_last90days

Copy link

@AndrewBenjamin-NOAA AndrewBenjamin-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, great work and conversation by all getting this PR ready. I had some questions, but I don't there anything that will hold up the PR. The comments are more so for my own clarification, as someone new to EVS.

In particular, I was curious about the differences in the firewx MODEL, MODELNAME nomenclature throughout the workflow.

@@ -138,7 +139,7 @@ fi
####################################
if [ $VERIF_CASE = radar ] || [ $VERIF_CASE = severe ]; then
$HOMEevs/scripts/${STEP}/${COMPONENT}/exevs_${COMPONENT}_${VERIF_CASE}_${STEP}.sh
elif [ $MODELNAME = nam_firewxnest ]; then
elif [ $MODELNAME = nam_firewxnest ] || [ $MODELNAME = rrfs_firewxnest ]; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver scripts and ecf scripts have the MODELNAME changed to firewxnest. Where do the nam_firewxnest and rrfs_firewxnest overwrite this in the jobcard, for example in JEVS_CAM_STATS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, MODELNAME is changed to "firewxnest", but only for the firewxnest plots job. For the two firewxnest stats jobs (nam_firewxnest and rrfs_firewxnest), the MODELNAME is set to "nam_firewxnest" and "rrfs_firewxnest" respectively. That's why in this JEVS_CAM_STATS J-job script, the MODELNAME is able to match with either name when firewx stats jobs are run.

In this PR, the nam_firewxnest plots job was expanded to include rrfs_firewxnest, so the MODELNAME was expanded to just "firewxnest". The nam_firewxnest stats job was not expanded, but an additional "rrfs_firewxnest" stats job was added.

@@ -26,7 +26,7 @@ export RUN_GRID2OBS_PLOTS="NO"
#model_list: model names
#model_stat_dir_list: directory path to model .stat files
#model_file_format_list: file format of model files
export model_list="NAM_FIREWXNEST"
export model_list="NAM_FIREWXNEST, RRFS_FIREWXNEST"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question as before, can you provide context as to why the MODELNAME is changed in the high level scripts to remove specific models, then added in other areas of the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, plots was expanded to include rrfs_firewxnest so that graphics include both nam_firewxnest and rrfs_firewxnest lines. Thus, "MODELNAME" was broadened from "nam_firewxnest" to "firewxnest", and then the list of models to plot was expanded to include rrfs_firewxnest.

@@ -77,7 +77,7 @@ set -x
# Case type settings are used to check if settings are allowed. **
export VERIF_CASE="grid2obs"
export VERIF_TYPE="conus_sfc"
export MODELS="nam_firewxnest"
export MODELS="nam_firewxnest, rrfs_firewxnest"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question as above, just wondering the context here and the differences between the model strings used at different parts of the EVS workflow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add here re: different model strings used at different parts of the EVS workflow: I'm not sure that "MODELNAME" is an NCO standard, but I believe it's an EVS standard; each job has a MODELNAME defined in the driver and ecf script. For plots jobs, this is often set to something like "cam" because multiple models are being plotted, whereas for stats this will usually be set to a single model (e.g., "hrrr"). Elsewhere in the code, other model lists like the one linked here exist to specify things like, what to plot, or what stats to look for before plotting.

@ShelleyMelchior-NOAA
Copy link
Contributor

The following are almost there, but I needed to bump up the ncpu: 🏁 jevs_cam_radar_nbrcnt_plots_last90days 🏁 jevs_cam_radar_nbrctc_plots_last31days 🏁 jevs_cam_radar_nbrctc_plots_last90days

I will re-run w/ the updated ncpu.

@MarcelCaron-NOAA
Copy link
Contributor Author

@ShelleyMelchior-NOAA The final three jobs completed cleanly and output look normal:
✔️ jevs_cam_radar_nbrcnt_plots_last90days
✔️ jevs_cam_radar_nbrctc_plots_last31days
✔️ jevs_cam_radar_nbrctc_plots_last90days

👍 No other concerns about this PR from me!

Copy link
Contributor

@ShelleyMelchior-NOAA ShelleyMelchior-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good and is approved for merge.

Copy link
Contributor

@malloryprow malloryprow Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nproc was increase to 128 in the job specifications, but not further down on the driver script. In the log file for this job is the mpi/cfp command using only 64? If so we shouldn't request resources we aren't using. This seems to be common across the drivers so please check the others that don't match too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header info here is outdated.

Copy link
Contributor

@malloryprow malloryprow Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the ouages of cpreq here. I thought it one of the EE2 reviews we removed uages of that. Not sure I remember the reason why. Am I remembering wrong? @AliciaBentley-NOAA @PerryShafran-NOAA @ShelleyMelchior-NOAA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Header info here is outdated.

@malloryprow
Copy link
Contributor

Lots of changes here, but will summarize here:

  1. Concerns about requesting more resources than is being used in some drivers/ecf scripts. Resources being requested and the nproc setting differ in some. I would like when @MarcelCaron-NOAA returns to confirm the resources being requested and the resources being used in match and we are not requesting more than actually using.
  2. I noticed the usage of cpreq. I thought I remember in one of the EE2 reviews having to switch all usages of cpreq to just cp. I may remember wrong because I can't remember the reason as to why were told to switch. If aren't suppose to be using it, I'd like cpreq usage across cam to be reviewed.
  3. Some brand new ex-scripts had very outdated information, like Logan being the author, which is not true. I want that to be reviewed and corrected.

These all can be corrected and new PRs submitted if necessary so I will approve this.

Copy link
Contributor

@malloryprow malloryprow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved but please see my comments here and in the PR conversation and submit new PRs if necessary to address concerns.

@ShelleyMelchior-NOAA ShelleyMelchior-NOAA merged commit 3ef7e72 into NOAA-EMC:feature/rrfs_refs_v1 Aug 7, 2024
@malloryprow malloryprow added this to the EVS v2.1.0 milestone Aug 30, 2024
@MarcelCaron-NOAA MarcelCaron-NOAA deleted the release/rrfsv1.0 branch December 2, 2024 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants