Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Enable deterministic verification to be run from staged forecast files #566

Merged
merged 13 commits into from
Feb 7, 2023

Conversation

gsketefian
Copy link
Collaborator

@gsketefian gsketefian commented Jan 30, 2023

DESCRIPTION OF CHANGES:

This PR enables running of only the SRW App's deterministic verification (vx) tasks on staged forecast files from previous runs of the App. It partially resolves Issue #565 (it resolves the issue for deterministic vx but not ensemble vx).

Specific changes:

  • Update lua module file for vx tasks to suppress "Logging error" messages in vx task log files.
  • Rename experiment variable MODEL to VX_FCST_MODEL_NAME to clarify that this is the name of the forecast model in the context of verification (and which will be used in the vx output files). This requires updates to most (all?) of the METplus configuration files and the verification ex-scripts.
  • Create the new variable VX_FCST_INPUT_BASEDIR to allow the user to specify a directory in which to look for staged forecast output (instead of running a forecast).
  • Modify the rocoto template xml (FV3LAM_wflow.xml) to make dependencies of vx tasks on post-processing tasks appear only when the post tasks are enabled.
  • Add a new WE2E test category subdirectory named verification in which to group all vx tests (since more vx tasks will be coming in future PRs). Move the two existing tests MET_verification and MET_ensemble_verification from wflow_features to verification, and add a new test named MET_verification_only_vxto test the capability that this PR introduces.
    Note: The new WE2E test MET_verification_only_vx requires new data, specifically post-processed forecast output from the SRW App. This data needs to be staged on each platform; currently, it is located in a personal directory on Hera.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

All tests were run on hera.intel only.

The following fundamental WE2E tests (not involving vx) were run:

  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
  • grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
  • grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
  • grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
  • grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
  • grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
  • nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR

In addition, the following WE2E verification tests were run:

  • MET_ensemble_verification
  • MET_verification
  • MET_verification_only_vx (this is the new test that verifies that the vx tasks can be run with staged forecast files)

All tests were successful.

DOCUMENTATION:

This PR does require some documentation changes, but I would prefer to do that after also merging a follow-up PR for enabling vx from staged files for ensemble verification.

ISSUE:

Partially fixes Issue #565. It enables deterministic verification from staged forecast files but not yet ensemble verification.

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@michelleharrold, @willmayfield, @JeffBeck-NOAA, @mkavulich

…isting MET vx tests to it; add new vx test that only calls the vx tasks.
… in dependencies only if it is activated in the workflow.
…UT_BASEDIR. Details below:

1) Rename MODEL to VX_FCST_MODEL_NAME to clarify that this is the name of the forecast model in the context of verification.
2) Create the new variable VX_FCST_INPUT_BASEDIR to allow the user to specify a directory in which to look for staged forecast output (instead of running a forecast).
3) Place both VX_FCST_MODEL_NAME and VX_FCST_INPUT_BASEDIR in a new mapping named "verification" in config_defaults.yaml.  Other vx-related workflow variables will be placed under "verification" in later PRs.
…ll tasks so that new external model data is not needed.
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsketefian Overall, the changes in this PR look good! I have provided some minor feedback for some of the changes. I was also able to run the new MET_verification_only_vx test and it successfully ran.

parm/FV3LAM_wflow.xml Outdated Show resolved Hide resolved
tests/WE2E/run_WE2E_tests.sh Show resolved Hide resolved
# the GET_OBS_... tasks and instead specify the obs staging directories.
#
RUN_TASK_GET_OBS_CCPA: false
CCPA_OBS_DIR: '/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/DTC_ensemble_task/staged/obs/ccpa/proc'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there plans to make the staged obs available on other machines? Or is it planned that this test will only be able to be run on Hera?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelLueken I was working under the impression that yes, the obs (and staged forecasts) need to be made available on other machines. I think at least Jet and Cheyenne but likely also Orion and NOAAcloud. This work is partly for the development of a community framework for testing RRFS prototypes that needs to be available to the general user, so I would think Cheyenne would be especially important. I'll also let @JeffBeck-NOAA, @michelleharrold, @willmayfield, and @mkavulich chime in.

Btw, in my next vx PR, I'm going to need to stage forecast data for an ensemble. It currently has 9 members, and I don't think it's a lot of data, but we can reduce the number of members if that becomes an issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the explanation, @gsketefian. Shall I add this as a topic for an upcoming SRW App CM meeting, so that AUS can assist with getting this data to the NOAAcloud and other machines?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelLueken Yes, good idea! I don't know what their procedure is, but if it's a significant effort, it might be worthwhile to first get all the data on Hera (including the ensemble vx data) and then do the transfer to other platforms in one go. That would mean I first have to get a PR into develop for ensemble vx. We can discuss on Thursday.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would be easy enough to include obs data in the static data that's already staged. There are already observations staged under UFS_SRW_App/develop/obs_data/ on all platforms.

Copy link
Collaborator Author

@gsketefian gsketefian Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Yeah as you say there's already a place for obs in the data directories. We need to decide on where to put staged forecast data from the SRW itself (not from other external models), including ensemble output (which will be needed for a future PR). I'm open to suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why staged forecast data would be needed. Why can't we have a test that runs a forecast and then runs VX on that forecast?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelLueken Sorry just saw this. The reason is that someone else may have run the forecasts beforehand, and the user needs to only run vx on that output. For example, for the DTC Ensemble Design task that I'm part of, the RRFS development group at GSL already ran a bunch of ensemble forecasts, and we want to only run vx on it. In addition, one may want to run vx on forecasts not generated by the SRW, e.g. there's another (I think non-DTC) task that is using the version of the SRW App I've put together in my fork (that I'm now bringing into the develop branch) to run vx on HREF output.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsketefian Thank you very much for addressing my concerns! I will go ahead and approve these changes now and submit the Jenkins tests.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Jan 31, 2023
@MichaelLueken
Copy link
Collaborator

@gsketefian If possible, could you try running the MET_verification test on either Hera or Cheyenne using the GNU compiler? The Jenkins test on Hera GNU for Met_verification is failing. A manual test on Hera using the GNU compiler shows the following error:

/contrib/met/10.1.1/bin/grid_stat: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory

Interestingly, the Hera Intel MET_ensemble_verification test is running without issue. There appears to be an issue only with Hera GNU (and I'm working on testing Cheyenne GNU as well).

Are you aware of why MET_verification would fail on Hera with the GNU compiler?

@MichaelLueken
Copy link
Collaborator

@gsketefian I'm also seeing similar errors:

/glade/p/ral/jntp/MET/MET_releases/10.1.1/bin/grid_stat: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory

On Cheyenne with GNU compilers while running the MET_verification test.

@gsketefian
Copy link
Collaborator Author

gsketefian commented Jan 31, 2023

@MichaelLueken Ok, I'll give it a shot. Some questions:

  1. Do we know whether MET_verification using the develop branch (i.e. before this PR) was succeeding with the gnu compiler? I just want to be sure it's not an inherited failure.
  2. Is the only difference between your runs with intel vs gnu setting COMPILER to "gnu" in config.yaml (and rebuilding with ./devbuild --platform=hera --compiler=gnu before running)? I want to know exactly the steps I need to take for my gnu run.
  3. Is it possible to run the new test MET_verification_only_vx on Jenkins?
  4. Do you mind sending me your manual test's experiment directory(ies) on Hera? It might help to look at the log files.

Thanks.

@MichaelLueken
Copy link
Collaborator

MichaelLueken commented Jan 31, 2023

@MichaelLueken Ok, I'll give it a shot. Some questions:

  1. Do we know whether MET_verification using the develop branch (i.e. before this PR) was succeeding with the gnu compiler? I just want to be sure it's not an inherited failure.

Yes, the develop branch was succeeding with the gnu compiler. The Jenkins tests have all passed up to this point.

  1. Is the only difference between your runs with intel vs gnu setting COMPILER to "gnu" in config.yaml (and rebuilding with ./devbuild --platform=hera --compiler=gnu before running)? I want to know exactly the steps I need to take for my gnu run.

Correct, the only difference is adding --compiler=gnu (-c=gnu) to the ./devbuild.sh call.

  1. Is it possible to run the new test MET_verification_only_vx on Jenkins?

If you add the test to tests/WE2E/machine_suites/fundamental.hera.intel.nco or fundamental.hera.gnu.com and push the updated file to your branch, then it can be tested on Jenkins.

  1. Do you mind sending me your manual test's experiment directory(ies) on Hera? It might help to look at the log files.

Unfortunately, I made changes to my working copy that has made the tests pass (I added back in the changes that you had removed in modulefiles/task/hera/run_vx.local.lua). The Jenkins tests can be found - /scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-566/expt_dirs/MET_verification/log

For my manual tests that currently pass, you can see - /scratch1/NCEPDEV/nems/Michael.Lueken/VX/expt_dirs/MET_verification/log

Thanks.

Please let me know if there is any additional information you need. Thanks!

@gsketefian
Copy link
Collaborator Author

gsketefian commented Jan 31, 2023

@MichaelLueken Thanks for the answers. I will add MET_verification_only_vx to fundamental.hera.gnu.com and push, then work on the gnu test.

Related question:
I assume fundamental.hera.intel.nco lists the Jenkins tests that get run in NCO mode while fundamental.hera.gnu.com lists the ones that get run in community mode. But in fundamental.hera.intel.nco there are tests whose configuration files specify that they should be run in community mode (i.e. they have RUN_ENVIR: community), e.g. the tests MET_ensemble_verification and grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 (and probably many more, if not all the tests!). Does Jenkins reset RUN_ENVIR for these tests to nco before running them?

…WE2E tests that gets run on Jenkins for PR testing purposes.
@MichaelLueken
Copy link
Collaborator

Related question: I assume fundamental.hera.intel.nco lists the Jenkins tests that get run in NCO mode while fundamental.hera.gnu.com lists the ones that get run in community mode. But in fundamental.hera.intel.nco there are tests whose configuration files specify that they should be run in community mode (i.e. they have RUN_ENVIR: community), e.g. the tests MET_ensemble_verification and grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 (and probably many more, if not all the tests!). Does Jenkins reset RUN_ENVIR for these tests to nco before running them?

@gsketefian Yes, @danielabdi-noaa added logic to the run_WE2e_tests.sh (Please see lines 442-461) that will overwrite the RUN_ENVIR from the config*.yaml file. So, fundamental.hera.gnu.com will force all tests to be run in community mode, while the fundamental.hera.intel.nco will force all tests to be run in nco mode.

@gsketefian
Copy link
Collaborator Author

gsketefian commented Feb 1, 2023

@gsketefian Yes, @danielabdi-noaa added logic to the run_WE2e_tests.sh (Please see lines 442-461) that will overwrite the RUN_ENVIR from the config*.yaml file. So, fundamental.hera.gnu.com will force all tests to be run in community mode, while the fundamental.hera.intel.nco will force all tests to be run in nco mode.

@MichaelLueken Ok, thanks, I hadn't noticed that new code.

@gsketefian
Copy link
Collaborator Author

@MichaelLueken I think I found the problem. In modifying the task module file ufs-srweather-app/modulefiles/tasks/hera/run_vx.local.lua to get rid of the "Logging error" message appearing the vx tasks' log files, I replaced all its contents with the line

load("miniconda_regional_workflow")

Turns out I should have kept this line from the original file:

load(pathJoin("intel", os.getenv("intel_ver") or "18.0.5.274"))

Now the vx tasks can find that libimf.so library (and there are no "Logging error" messages). I am retesting all 3 vx WE2E tests with both intel and gnu:

  • MET_verification
  • MET_verification_only_vx
  • MET_ensemble_verification

Once I verify that they are successful, I'll push my latest changes.

Question:
Should I go ahead and make the same changes to the vx module file on the other platforms? Assuming it doesn't cause the vx tasks to crash, we would also have to check that the changes get rid of the "Logging error" messages in the vx log files. I tried looking at Jenkins output yesterday to see if I can read the log files, but apparently I can't (don't have a proper Jenkins account). Is that something you can check? Probably just checking one vx log file from one of the vx tests is sufficient for each platform.

@MichaelLueken
Copy link
Collaborator

@gsketefian I think the issue with GNU is that metplus was likely built on the various machines using Intel compilers, so I think this is an issue (at least currently) only for Hera and Cheyenne. As for checking the "Logging error" messages, I can certainly check the log files for those. Having said that, MET_verification, MET_ensemble_verification, and MET_verification_only_vx are only run on Hera (Intel and GNU). No other platform currently run verification tests.

@gsketefian
Copy link
Collaborator Author

@gsketefian I think the issue with GNU is that metplus was likely built on the various machines using Intel compilers, so I think this is an issue (at least currently) only for Hera and Cheyenne. As for checking the "Logging error" messages, I can certainly check the log files for those. Having said that, MET_verification, MET_ensemble_verification, and MET_verification_only_vx are only run on Hera (Intel and GNU). No other platform currently run verification tests.

@MichaelLueken Yes, good point about METplus being build only with intel, so we have to keep specifying the path to the intel library in the task module file.

Since the vx tests are currently only run on Hera, if it's ok with you, to simplify/modularize the work, I'd like to get all the vx changes working on Hera and worry about porting to other platforms in a separate set of PRs (since that will likely require installation or updating of MET/METplus and thus admin help).

@MichaelLueken
Copy link
Collaborator

@gsketefian That sounds good to me. I think it would be fine to add:
load(pathJoin("intel", os.getenv("intel_ver") or "18.0.5.274"))
to modulefiles/tasks/hera/run_vx.local.lua and worry about the other machines in subsequent PRs.

Once your tests on Hera with Intel and GNU compilers are complete and you have pushed your changes, I will resubmit the Jenkins Hera tests to make sure that those pass as well.

… running with GNU compiler, the intel libraries are still needed because MET/METplus is built with intel only).
@gsketefian
Copy link
Collaborator Author

@MichaelLueken I just made the change to the Hera module file for vx tasks. All 3 vx tasks were successful (on Hera) with both intel and gnu. So ready for the Jenkins tests now. Thanks.

@MichaelLueken
Copy link
Collaborator

Relaunching the Jenkins test on Hera, it looks like all of the Hera GNU tests have failed with the following:

       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
201907010000               make_grid                    41656609           SUCCEEDED                   0         1          24.0
201907010000               make_orog                    41656698           SUCCEEDED                   0         1          68.0
201907010000          make_sfc_climo                    41656928                DEAD                 256         2           4.0
201907010000           get_extrn_ics                    41656610           SUCCEEDED                   0         1          18.0
201907010000          get_extrn_lbcs                    41656611           SUCCEEDED                   0         1          16.0
+ load_modules_run_task.sh[111]: module load build_hera_gnu
++ bash[82]: /apps/lmod/8.5.2/libexec/lmod bash load build_hera_gnu
Lmod has detected the following error: Cannot load module "hpc-gnu/9.2"
without these module(s) loaded:
   gnu/9.2

While processing the following module(s):
    Module fullname  Module Filename
    ---------------  ---------------
    hpc-gnu/9.2      /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/modulefiles/core/hpc-gnu/9.2.lua
    build_hera_gnu   /scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-566/modulefiles/build_hera_gnu.lua

The experiment directories for the Hera GNU tests can be found:
/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-566/expt_dirs

It is unclear to me why load_modules_run_task.sh isn't using the make_sfc_climo.local.lua modulefile. I'll try to look into this.

@gsketefian
Copy link
Collaborator Author

@MichaelLueken Thanks for looking into it. Hopefully the MET_verification_only_vx task passed since that doesn't call the make_sfc_climo task.

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a couple requested changes.

{%- endif %}
{%- endif %}
</and>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this logic can be simplified to the following:

      <and>
{#- Redundant dependency to simplify jinja code. #}
        <streq><left>TRUE</left><right>TRUE</right></streq>
      {%- if run_task_get_obs_ccpa %}
        <taskdep task="&GET_OBS_CCPA_TN;"/>
      {%- endif %}
      {%- if write_dopost %}
        <taskdep task="&RUN_FCST_TN;{{ uscore_ensmem_name }}"/>
      {%- elif run_task_run_post %}
        <metataskdep metatask="&RUN_POST_TN;{{ uscore_ensmem_name }}"/>
      {%- endif %}
      </and>

The nested "and"s are superfluous, and if run_task_get_obs_ccpa does not require an "else" statement.

The same logic should apply to the subsequent changes for run_task_get_obs_mrms and run_task_get_obs_ndas as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Yes I noticed that too but had decided to make only the minimal change necessary (in terms of when you do a diff) since all this will get rewritten anyway pretty soon. I'll put in your suggestion and rerun.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich I put in the new simpler logic and reran the vx tests (with intel compiler only) and all 3 succeeded. The other (fundamental) tests aren't affected by this, so I didn't rerun them. I'll let the Jenkins testing handle those.

# the GET_OBS_... tasks and instead specify the obs staging directories.
#
RUN_TASK_GET_OBS_CCPA: false
CCPA_OBS_DIR: '/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/DTC_ensemble_task/staged/obs/ccpa/proc'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would be easy enough to include obs data in the static data that's already staged. There are already observations staged under UFS_SRW_App/develop/obs_data/ on all platforms.

#----------------------------
# verification parameters
#
# VX_FCST_MODEL_NAME:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see a more descriptive variable name here 👍

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, looks good

@MichaelLueken MichaelLueken merged commit 373d1aa into ufs-community:develop Feb 7, 2023
MichaelLueken added a commit that referenced this pull request Feb 7, 2023
MichaelLueken pushed a commit that referenced this pull request Feb 7, 2023
PR #566 changed the variable "MODEL" to a more descriptive name, but failed to make this change in config.community.yaml. The unit tests for generate_FV3LAM_wflow.py make use of this file as an input config.yaml, so they are now failing due to this incorrect variable name. This wasn't caught because prior to #558 the unit tests were broken for a different reason.

This change simply makes the appropriate rename, which should fix the failing unit test. Also created an f-string that was missed in a setup.py error message.
@gsketefian
Copy link
Collaborator Author

@MichaelLueken Thanks for merging this PR. One thing that still needs to be done is to move the staged forecast data that's in my directory (on Hera) to the official SRW data directory. I don't think we decided an exact location within where to put it. We can do that (along with @mkavulich) after Hera is back up this afternoon or tomorrow. For other platforms, I'd like to wait until a couple more of the vx PRs are in since they will have more data to stage. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants