Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul and consolidate WE2E tests, identify needed additional tests #587

Closed
mkavulich opened this issue Feb 8, 2023 · 17 comments · Fixed by #871
Closed

Overhaul and consolidate WE2E tests, identify needed additional tests #587

mkavulich opened this issue Feb 8, 2023 · 17 comments · Fixed by #871
Assignees

Comments

@mkavulich
Copy link
Collaborator

mkavulich commented Feb 8, 2023

Description

It has been a while since we re-visited the suite of WE2E tests. In that time, a lot of development has happened, and the way various test suites are used has changed. We should assess the current state of the WE2E tests, and discuss ways in which they can be improved.

This issue will be modified and expanded as the effort continues.

Examples to justify further work

Duplication

There are many examples of this, but just to pick out one, these nine tests only have a single difference among them: the specific output grid being used

  • config.grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml
  • config.grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml
  • config.grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml
  • config.grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml
  • config.grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml
  • config.grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml
  • config.grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml
  • config.grid_RRFS_SUBCONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml
  • config.grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml

These tests could be used to test other capabilities, or be rolled into other existing tests for other capabilities

Missing capabilities

There are important capabilities either already present or being introduced soon that are not currently covered by any WE2E test:

Some others may or may not be appropriate for WE2E testing:

  • Plotting scripts
  • New CMAQ capabilities

Other changes

  • Recently the scope of "Fundamental" tests has changed to run a different set of tests on every machine. This is not ideal, as it leaves us without a lightweight suite of tests that will be the same across all platforms, smoke-testing the "most important" parts of the system
  • NCO mode may no longer need dedicated tests, as this is a command-line option to the run_WE2E_tests script.

Solution

In this issue I will describe a strategy for consolidating the WE2E tests into (ideally) fewer tests with wider coverage. This will involve a more in-depth assessment of the issues mentioned above, proposing changes to the set of tests, and strategizing the best way to maintain the WE2E tests going forward. This issue will eventually contain a proposal for changes to be made. @mkavulich will take the lead on this effort, but others should feel free to add to this discussion and make modifications.

Related issues/PRs

Issues

PRs

@MichaelLueken
Copy link
Collaborator

@mkavulich @EdwardSnyder-NOAA -
I have created two new WE2E tests for @EdwardSnyder-NOAA's PR #526 on Jet:

config.get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2mems.yaml
config.get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2mems.yaml

These tests can be found: /mnt/lfs4/HFIP/hfv3gfs/Michael.Lueken/test_configs

If you would like, I can commit these two new tests in @EdwardSnyder-NOAA feature/rrfs-ensemble-bc branch so that they will be available once this work has been merged.

@EdwardSnyder-NOAA
Copy link
Collaborator

EdwardSnyder-NOAA commented Feb 8, 2023

I'm fine with adding those tests with my #526 PR. @MichaelLueken

@mkavulich
Copy link
Collaborator Author

I'll be posting mildly unstructured thoughts here in the comments so they can hopefully be rolled into a more coherent proposal later. This one is about

Domains

There are several points I'd like to make about testing domains:

Test all domains?

There are 23 "pre-defined" domains in ush/predef_grid_params.yaml. However, we only test 14 of these:

  • CONUS_25km_GFDLgrid
  • CONUS_3km_GFDLgrid
  • RRFS_AK_13km
  • RRFS_AK_3km
  • RRFS_CONUS_13km
  • RRFS_CONUS_25km
  • RRFS_CONUS_3km
  • RRFS_CONUScompact_13km
  • RRFS_CONUScompact_25km
  • RRFS_CONUScompact_3km
  • RRFS_NA_13km
  • RRFS_NA_3km
  • RRFS_SUBCONUS_3km
  • SUBCONUS_Ind_3km

The other 9 are untested:

  • WoFS_3km
  • EMC_AK
  • EMC_HI
  • EMC_PR
  • EMC_GU
  • GSL_HAFSV0.A_25km
  • GSL_HAFSV0.A_13km
  • GSL_HAFSV0.A_3km
  • GSD_HRRR_AK_50km

Should we test all of these or remove them? I don't think there's any middle ground: either they are important enough to be kept (and so should be tested), or they are not, and should be removed.

Custom domains?

We currently only have a single custom domain for the ESG grid. I wonder if there are a variety of parameters we'd like to test with more custom grids? At the very least I would recommend that we add a custom domain for the southern hemisphere, since there is a lot of opportunities for hemisphere-specific errors.

Removing domains

Aside from my above recommendations (test or delete untested domains), is there any reason we are still including "GFDLgrid" as a valid grid type?. From my understanding, this is an obsolete grid that we do not support and should not waste resources on testing.

That would remove the following tests:

  • config.custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE.yaml
  • config.custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_TRUE.yaml
  • config.custom_GFDLgrid.yaml
  • config.grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml
  • config.grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml

And the following pre-defined grids:

  • CONUS_25km_GFDLgrid
  • CONUS_3km_GFDLgrid

@christinaholtNOAA
Copy link
Collaborator

Is anyone in the Community using the SUBCONUS or compact domains? Should they be supported?

@mkavulich
Copy link
Collaborator Author

Is anyone in the Community using the SUBCONUS or compact domains? Should they be supported?

I believe SUBCONUS grids are used for tutorials, as they are very small domains and can run quickly even at high resolution.

The "compact" domains were made in order to fit inside the HRRR CONUS domain, so that they can use HRRR as input BCs. I can see some utility in keeping them, though I'm open to other opinions.

@mkavulich
Copy link
Collaborator Author

Another easy test improvement

There are two tests that take almost twice as long to finish as the next longest: grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 and grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16:. We should find a way to reduce these times, either by increasing the number of cores used, decreasing the forecast length, or some other change.

@mkavulich
Copy link
Collaborator Author

Testing "suites"

When originally discussing testing strategies (see #277), the decision was made to maintain two "suites" of tests, with the following intended purposes:

  • comprehensive is a list of tests that are known to succeed, achieving maximum test coverage when run on a specific platform
  • fundamental is a smaller list of tests that achieves maximum coverage of the most important features of the SRW Application

The idea being that in addition to targeted tests for their changes, a developer could run the fundamental suite of tests to serve as a quick, cheap "smoke test" of the App on their platform of choice, ensuring their changes did not unexpectedly break any important App functionality downstream. For more intrusive changes, and/or as a regular sanity check, the comprehensive suite of tests could be run to achieve wider coverage of the system's capabilities at the cost of longer time and more resources spent.

The logic behind the fundamental test suite was changed as detailed in #445, so that "fundamental" tests no longer serve their originally intended function. The automated CI tests are meant to run on each platform, and running the fundamental suite on each platform was pointed out (correctly, I might add) as a waste of resources. So instead of running the same set of tests on each platform, a small, unique subset of tests is run on each specific platform, achieving maximum test coverage on all platforms.

While I do see the advantages of this test strategy, in my opinion it degrades the original use case of the fundamental case as a standardized but simple list of tests that can be run on any platform, since different developers have different platforms where they perform their development. For example, I do much of my development on Hera, and now the fundamental test suite (when functioning as intended, see #571) only runs nco-mode, which has the potential to hide errors specific to the community mode (which is my main concern). This renders the fundamental test suite less useful for my purposes, and I am sure similar problems will affect others.

As a solution to this problem, I would like to propose the following list of suites going forward, returning fundamental to its original use and adding a new coverage suite to cover the use case mentioned in #445. To lay out all the descriptions in one place:

  • fundamental is a small list of tests that achieves maximum coverage of the most important features of the SRW Application. For use by developers as a "smoke test" prior to opening a PR, run on their platform of choice.
  • comprehensive is a list of tests that are known to succeed, achieving maximum test coverage when run on a specific platform. For use by developers introducing large and especially invasive changes to ensure all cases are covered, and also on an infrequent basis to check the state of the Application (for example, prior to a code release). Specifically excludes tests known to fail on a given platform.
  • coverage is a platform-specific list of tests, achieving full coverage of comprehensive tests and running on all platforms without the expense of running the entire comprehensive suite on each platform. To be run by CI or other automated tests on a regular basis or in response to individual PRs.
  • all is the list of all WE2E tests, whether they are known to fail on a specific platform or not.

@mkavulich
Copy link
Collaborator Author

Default wall times

The history of default wall times is a bit all over the place. Originally it was much harder to specify on a case-by-case basis, so the default wall times were made very long to ensure jobs wouldn't time out. We now have a lot more flexibility, so we should re-visit the topic.

Why do we care?

Theoretically, jobs are only charged for the amount of time used (at least, that is the case on Cheyenne and NOAA RDHPC platforms). So there is seemingly no harm in having extra-long wall times specified for tests.

However, there are some scenarios where unnecessarily long wall times can be inconvenient or even harmful. For example, one of the failure modes of the ufs weather model is that when it experiences a fault, it does not actually exit, but rather hangs, leading to the job spending core hours wastefully on what is usually the most expensive task. Secondly, paring down submission wall times will theoretically help jobs get into and out of the queue faster, as on most if not all machines, jobs with shorter wall times will be prioritized.

What should we do?

Most tasks have relatively reasonable default wall times of 30 minutes or less. However, the RUN_FCST step has a very long default wall time of 4.5 hours, which is way longer than we need for testing basic functionality. I would propose we modify all WE2E test files to reduce this walltime as much as possible: as most tests should complete this step in less than 30 minutes, and certainly an hour at most. I would propose that as a starting point. This would have the added benefit of capping the theoretical wall time of the full set of WE2E tests in general. And many other tasks could have other steps reduced in time, such as WTIME_GET_EXTRN_LBCS and WTIME_GET_EXTRN_ICS for tests where the data is stored locally.

@gsketefian
Copy link
Collaborator

@mkavulich Here are some randomly arranged thoughts/questions on the issues raised above:

  1. Predefined domains:
    I don't think anyone is using EMC_AK, EMC_HI, EMC_PR, or EMC_GU, so we can remove those. @JeffBeck-NOAA was for a while using the GSL_HAFS... domains, but I think he's done with those (Jeff, can you confirm?). If so, we can remove those.
    Finally, I thought the WoFS_3km domain was put in relatively recently (less than a year ago) in anticipation of further additions to the SRW App by NSSL, but I don't know where that stands. @ywangwof, do you have any info/thoughts on this domain?
  2. SUBCONUS domains:
    The SUBCONUS_Ind_3km domain was (and will be?) used in trainings, as you point out. I and @JeffBeck-NOAA added the RRFS_SUBCONUS_3km in response to a request by Chunhua Zhou a couple of years ago, but I don't know if she's still using it. If not, we should remove it. (I can't find her github username to tag her here.)
  3. Additional custom ESG domains:
    I made 6 such domains for one of the SRW trainings. I have ones over Peru and New Zealand for tests over the Southern Hemisphere. I also have ones over California, Central Asia, and over the Indian Ocean. Let me know if you want the numbers for those. It was actually a useful exercise to try out these grids because I found a bug in the weather model by testing in the southern hemisphere with a certain kind of write-component grid (I forget which).
  4. GFDLgrid:
    Whether to remove this is a question for higher ups, e.g. Curtis (I can't find his username!) and @JacobCarley-NOAA. There was some talk of there being nested grids within the regional domain in the near future in the weather model. I imagine GFDL would add that capability only for regional grids carved out of the global cubed sphere grid, in which case we'd need the GFDLgrid capability. But if it can also be done with ESG grids, then no, we don't need to keep the GFDLgrid capability. Removing it would not only reduce the testing load but also simplify the make_grid task.
  5. When adding a new feature, should it be required that one or more WE2E tests should be added to test it? Obviously, it's more work, but if not added, that feature may break in future PRs without anyone being aware. @MichaelLueken thoughts?

@MichaelLueken
Copy link
Collaborator

@gsketefian When adding new features, I would agree that it would be a best practice to either add a new WE2E test so that the new functionality can be tested, or modify an existing WE2E test so that it can also test the new feature. Since the idea of this issue is to ultimately reduce the number of WE2E tests, I would like to stress that modifying existing WE2E tests so that they can be used to test new features, would be the best path.

@mkavulich
Copy link
Collaborator Author

Underutilized nodes

This has been brought up a time or two in history, but again, worth revisiting because we now have the ability to do something about it. Currently for all jobs except run_fcst, the number of cores per node is hard-coded, rather than being scaled based on the available cores per node. Using all nodes (except on machines like Hera where a single core specified results in a partial node use) would be an easy way to slightly reduce cost.

@mkavulich
Copy link
Collaborator Author

Cost and state of existing tests

All tests core hour cost from example runs can be found in this spreadsheet: https://docs.google.com/spreadsheets/d/1npubj78GW1a3htk8Ksh_rDPiIRkjlaWNXo-gEQgpxXE/edit

Hera

  • Fundamental: 172.36
  • Comprehensive: 14607.47
    • Known failures:
      • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR DEAD 9.47
      • grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 DEAD 8.68
  • All: 22572.74
    • Known failures:
      • subhourly_post_ensemble_2mems DEAD 20.7
      • nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR DEAD 3.72
      • get_from_HPSS_ics_HRRR_lbcs_RAP DEAD 5.86
      • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR DEAD 9.47
      • subhourly_post DEAD 22.91
      • grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 DEAD 8.68

Jet

  • Fundamental: 261.64
  • Comprehensive: 10491.02
    • Known failures:
      • grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta DEAD 12.28
      • nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 DEAD 7.51
      • grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 DEAD 12.71
  • All: 15273.93
    • Known failures:
      • grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta DEAD 567.57
      • grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 DEAD 6.62
      • subhourly_post DEAD 6.32
      • pregen_grid_orog_sfc_climo DEAD 9.87
      • nco_ensemble DEAD 88.99
      • subhourly_post_ensemble_2mems DEAD 11.15
      • MET_verification_only_vx DEAD 0
      • nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_ DEAD 13.27
      • nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 DEAD 6.98

Cheyenne

  • Fundamental: 345.62
  • Comprehensive: 28450.64
    • Known failures:
      • grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 DYING 4903.5
      • grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 DEAD 11.92
      • grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 DEAD 2764.3
  • All: 29770.64
    • Known failures:
      • grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 DYING 4903.5
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2021062000 DEAD 1.77
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019101818 DEAD 1.76
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019101818 DEAD 1.72
      • subhourly_post_ensemble_2mems DEAD 13.87
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2021010100 DEAD 1.72
      • grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 DEAD 11.92
      • get_from_HPSS_ics_RAP_lbcs_RAP DEAD 2.85
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2020022600 DEAD 1.76
      • get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2 DEAD 5.46
      • subhourly_post DEAD 8.39
      • grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta DEAD 56.77
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio DEAD 1.72
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2020022518 DEAD 1.74
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2020022600 DEAD 1.77
      • get_from_HPSS_ics_HRRR_lbcs_RAP DEAD 2.73
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200 DEAD 1.78
      • nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR DEAD 4.44
      • get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me DEAD 60.88
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021010100 DEAD 1.76
      • grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 DEAD 2764.3
      • get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS DEAD 2.79
      • specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS DEAD 1.71
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h DEAD 1.73
      • nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_ DEAD 0.24
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200 DEAD 1.74
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2020022518 DEAD 1.76

@mkavulich
Copy link
Collaborator Author

Plan

Here is an outline of the plan to overhaul, consolidate, improve, and expand the coverage of the WE2E tests suites.

Stage 1: Consolidate existing tests

This step will involve combining tests of different parts of the system into a single test. For example, a test designed to test a particular grid and physics suite can have inline post added to it rather than having a stand-alone inline post test.

Stage 2: Pare down long/expensive tests

Ideally a WE2E test should take 30 minutes or less. Some tests are much longer than this, and should be reduced as much as possible.

Stage 3: Overhaul suites

As per this comment above, restore old behavior of fundamental and comprehensive suites, and add "coverage" test type. In addition, modify "comprehensive" suite so that known failures for a given platform are omitted.

Stage 4: Add new capabilities/tests

Add un-tested options to existing tests as much as possible. Add new tests where necessary.

@mkavulich
Copy link
Collaborator Author

Unit tests

We are now at the point where we could be testing some actual workflow capabilities (not just workflow generation) using unit tests. For example, tests for retrieving specific data (such as the wflow_features/config.get_from_HPSS* workflow end-to-end tests) could be replaced by unit tests on ush/retrieve_data.py.

As efforts to pythonize the workflow continue, we should continue to lean more and more heavily on lightweight unit tests like these rather than bulky WE2E tests.

@mkavulich mkavulich moved this from Todo to In Progress in RRFS Merge to SRW Mar 15, 2023
@christinaholtNOAA
Copy link
Collaborator

Update: Rebasing latest changes from develop on the feature branch, and plans to open PR this afternoon.

Looks like significant cost savings for the comprehensive tests -- another 30% off the top.

@mkavulich
Copy link
Collaborator Author

mkavulich commented Mar 21, 2023

Major takeaways from first PR (#686):

  • Comprehensive tests now take ~1 hour vs ~2 hours on Hera (depends on queue wait times), core hours reduced ~33% (~14k --> ~9k), all while increasing the test coverage of various capabilities.
  • Total number of unique tests reduced from to 93 to 71; 46 of which are included in the "comprehensive" suite. See note below about the tests left off of this list.

Test changes

Removed grid

  • RRFS_SUBCONUS_3km
    • Un-tested, superseded by other 3km domains

Removed tests

  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional
    • This is a near-duplicate of grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot just without the plotting. Since plotting tasks are incredibly cheap, no need for two different tests
  • grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
    • Identical to grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR, but with RAP lbcs (BCs combo covered by other tests on this grid)
  • grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
    • Identical to grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta, but with HRRR lbcs (BCs combo covered by other tests on this grid)
  • grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_GFS_v15p2
    • GFS_v15p2 suite is not designed for 3km resolution, and we already have another test for that anyway
  • community_ensemble_008mems
    • This was designed to test functionality that no longer exists (matching leading zeroes in task names)
  • custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_TRUE
    • This functionality is already tested by custom_GFDLgrid
  • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019101818, get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2021010100, get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019101818, and get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021010100
    • Stated purpose was essentially to test random dates from HPSS, this is redundant to other date-based tests
  • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2021062000
    • Redundant to get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h

Combined tests

In the cases where one test was absorbed into another, a symlink remains for the old test

Test1 Test2 Combined test Core hours saved vs. running both tests (from Hera, rounded)
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta community_ensemble_2mems_stoch community_ensemble_2mems_stoch 30
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp community_ensemble_2mems grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp 79
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 inline_post grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 13
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR 35
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta specify_DOT_OR_USCORE grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta 13
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 specify_RESTART_INTERVAL specify_RESTART_INTERVAL 10
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR MET_verification grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_GFS_v16 0
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 MET_ensemble_verification grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_HRRR 0
Total 180

Changed tests

There were two new tests that have not been added to the comprehensive suite that are now added:

  • grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0
  • grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0

A number of tests still do not exist in the comprehensive suite because they just have not been added yet. Most of these are tests that are known to fail on one or more platforms; specifically a lot are tests sourcing input data from HPSS. I will have these handled more gracefully in the second round of changes, so they can be included only on platforms where they are known to succeed.

And there remains one test that is still left out of the comprehensive suite: grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta, testing the very large RRFS_NA_3km domain, takes twice as long as the next longest test (nearly 2 hours on Hera), and so should not be run outside of special occasions until we can reduce that number.

Several tests were modified to run quicker and more efficiently:

  • grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml Changed physics suite to more resolution-appropriate and cheaper FV3_GFS_v16, reduced forecast hours from 6 to 3. Saved ~50% core hours (~1600 -> ~800)
  • grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR Changed physics suite to more resolution-appropriate and cheaper FV3_HRRR, reduced forecast hours from 6 to 3. Saved ~50% core hours (~3300 -> ~1700)
  • grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 Reduced forecast hours from 6 to 3. Saved ~50% core hours (~1300 -> ~700)
  • grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 Reduced forecast hours from 6 to 3. Saved ~50% core hours (~1100 -> ~600)
  • grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta Reduced forecast hours from 6 to 3, increased DT_ATMOS to 40, increased OMP_NUM_THREADS_MAKE_OROG and NNODES_MAKE_ICS. Saved ~60% core hours (~3800 -> ~1700)

Additionally, the SUBCONUS_Ind_3km-domain tests were re-arranged to ensure all input IC/LBC combinations are tested on a 3-km domain.

MichaelLueken pushed a commit that referenced this issue Mar 29, 2023
This is the first set of changes overhauling the WE2E test suites. This represents steps 1 and 2 of the overhaul process as described in issue #587. In addition, some quality-of-life improvements to the WE2E test scripts are included.
@christinaholtNOAA christinaholtNOAA moved this from In Progress to In Review in RRFS Merge to SRW Apr 24, 2023
@mkavulich
Copy link
Collaborator Author

mkavulich commented Apr 26, 2023

#732 Accomplishes most of the points mentioned in this issue. Still remaining are:

MichaelLueken pushed a commit that referenced this issue Apr 26, 2023
…rovements!) (#732)

This test continues the overhaul of WE2E test suites as described in Issue #587 (specifically stage 3 and parts of stage 4 in this comment). The changes are summarized below, roughly in order of importance.
* "fundamental" tests are replaced by "coverage" test suites. "fundamental" tests are returned to their original purpose: a lightweight set of tests to be run the same on all platforms. "coverage" tests now evenly distribute all comprehensive tests across all platform/compiler combinations for use in Jenkins testing.
* "comprehensive" test list is updated to include all tests (except current known failures). For platforms that have known failures (for example, HPSS tests on on platforms without HPSS access), comprehensive.<platform>[.<compiler>] files are included to automatically run only the tests expected to succeed
* Fix several existing failures
  * Use correct date format in grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0
  * In config_parser.py, when populating a jinja template, keep dates in string format rather than converting to a datetime object (this fixes problem with get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS)
  * Fix unit tests for retrieve_data.py, there was a bug causing all tests to be run in nested subdirectories that eventually leads to failure when running all tests including HPSS retrieval
* Remove several "get_from_HPSS" tests in favor of new unit tests for HPSS data in test_retrieve_data.yaml
* Add several more dates and data sources to unit tests in test_retrieve_data.yaml
* The example config files in the ush/ directory (config.community.yaml and config.nco.yaml) are now included as WE2E tests (symbolically linked in the tests/WE2E/test_configs/default_configs/ directory)
* Remove long-known failing test grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 (WE2E test "grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16" fails with segmentation fault at run_fcst step #359). This is now an old capability with only legacy support (global spectral model was retired in 2019) and there are no immediate plans to fix the bug.
* WE2E_summary*.txt files are now written to the experiment directory rather than tests/WE2E
* Updated data_locations.yaml for latest RAP files on HPSS
* Reduce timeouts and delays between calls to wget to speed up remote data retrieval
* run_MET_GridStat_vx_APCP tasks fail randomly on occasion; increasing maxtries to 2 mitigates this problem
* Swap test of restart capabilty from grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 to grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8 for coverage reasons
* Convert some print_info_msg messages to logging.debug calls to allow suppression of superfluous output if desired
* A few miscellaneous minor fixes to log messages
* Made some docstrings more consistent format
* Removed some outdated documentation on validating config.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants