Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Introduce test cases from ufs-case-studies platform thru WE2E #822

Merged

Conversation

clouden90
Copy link
Contributor

@clouden90 clouden90 commented Jun 5, 2023

DESCRIPTION OF CHANGES:

The UFS Case Studies Platform provides a set of cases that reveal the forecast challenges of NOAA's operational Global Forecast System (GFS). Here we introduce one of these cases: [2020 Cold Air Damming](2020 Cold Air Damming) into UFS SRW thru WE2E testing framework. A yaml config file is added and moderate modifications are done for exregional_get_extrn_mdl_files.sh. This new function allows users to run any test cases from UFS Case Studies Platform directly thru WE2E framework without need of additional steps (e.g. download ICS/LBCS data from platform first). User can still modify the yaml file to suit their needs (e.g. increase fcst time, play with different grid resolution or CCPP suite).

Additionally, we added CCPP-SCM user and technical guide as a reference in Section 8.2 for users who are interested in running single column model.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

On Level 1 systems, git clone the feature branch, navigate to the ufs-srweather-app folder, and check out the external repositories. Then navigate to tests folder and follow the instruction below:

./build.sh orion intel
module use ../modulefiles
module load wflow_orion
conda activate regional_workflow
cd WE2E/
./run_WE2E_tests.py -t 2020_CAD -m orion -a epic-ps --expt_basedir "ufs_case_studies" --exec_subdir=install_intel/exec -q

The modeled T2M were compared with RAP analysis, and the conclusions are consistent with the results shown here

2020_CAD_t2m

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

@MichaelLueken MichaelLueken changed the title Introduce test cases from ufs-case-studies platform thru WE2E [develop] Introduce test cases from ufs-case-studies platform thru WE2E Jun 5, 2023
@MichaelLueken MichaelLueken added the enhancement New feature or request label Jun 5, 2023
michelleharrold pushed a commit to michelleharrold/ufs-srweather-app that referenced this pull request Jun 7, 2023
@EdwardSnyder-NOAA
Copy link
Collaborator

EdwardSnyder-NOAA commented Jun 12, 2023

The 2020_CAD experiment passed on Cheyenne intel for me, but I had to manually get the initial conditions as the get_extrn_* steps failed. Cheyenne's compute node had trouble connecting to S3 to download the initial condition tar file, so I had to downloaded and stage it locally. I had the same problem on Jet, so I'm curious how you were able to run the get_extrn_* tasks on the other tier one platforms? From my understanding, the compute nodes don't have internet access.

@clouden90
Copy link
Contributor Author

clouden90 commented Jun 12, 2023

The 2020_CAD experiment passed on Cheyenne intel for me, but I had to manually get the initial conditions as the get_extrn_* steps failed. Cheyenne's compute node had trouble connecting to S3 to download the initial condition tar file, so I had to downloaded and stage it locally. I had the same problem on Jet, so I'm curious how you were able to run the get_extrn_* tasks on the other tier one platforms? From my understanding, the compute nodes don't have internet access.

Thanks for testing, Ed. Good catch. Normally on compute node you do not have access to internet. I have tested 2020_CAD experiment on Hera, Orion, and Gaea. Hera and Orion have service partition so you can submit jobs with internet access. Gaea has specific nodes to allow you do data transfer and the associated changes have been included in this PR. Unfortunately I do not have access of Jet and Cheyenne, but I guess they may have similar partitions? @MichaelLueken do you have any inputs?

@MichaelLueken
Copy link
Collaborator

@clouden90 I can't speak on Cheyenne, but looking through the coverage and functional tests on Jet, there shouldn't be an issue pulling data from HPSS or AWS. We will likely need to stage data on Cheyenne if we want to run these tests on that machine, but there are get_from_HPSS and get_from_AWS tests run on Jet (and Hera), so that shouldn't be an issue.

I'm currently working on testing this PR on Jet and will let you know if I encounter this issue as well.

@MichaelLueken
Copy link
Collaborator

@clouden90 In the ush/machine/*.yaml files, the necessary partitions to pull data from the internet should already be defined. While this isn't the case for Cheyenne (it only has access to the regular partition), the rest of the machines appear to be split between the compute node partition and the service partition, which should allow access to the internet. When I run the 2020_CAD test on Jet, I see the following:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
2020_CAD                                                           COMPLETE              34.66
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              34.66
----------------------------------------------------------------------------------------------------
Detailed summary of experiment 2020_CAD
in directory /mnt/lfs1/NAGAPE/epic/Michael.Lueken/expt_dirs/2020_CAD
                                        | Status    | Walltime   | Core hours used
----------------------------------------------------------------------------------------------------
make_grid_202002031200                    SUCCEEDED          19.0           0.13
get_extrn_ics_202002031200                SUCCEEDED         457.0           0.13
get_extrn_lbcs_202002031200               SUCCEEDED         840.0           0.23
make_orog_202002031200                    SUCCEEDED          42.0           0.28
make_sfc_climo_202002031200               SUCCEEDED          41.0           0.55
make_ics_mem000_202002031200              SUCCEEDED         212.0           2.83
make_lbcs_mem000_202002031200             SUCCEEDED         224.0           2.99
run_fcst_mem000_202002031200              SUCCEEDED         444.0          23.68
run_post_mem000_f000_202002031200         SUCCEEDED          69.0           0.92
run_post_mem000_f001_202002031200         SUCCEEDED          49.0           0.65
run_post_mem000_f002_202002031200         SUCCEEDED          16.0           0.21
run_post_mem000_f003_202002031200         SUCCEEDED          46.0           0.61
run_post_mem000_f004_202002031200         SUCCEEDED          45.0           0.60
run_post_mem000_f005_202002031200         SUCCEEDED          48.0           0.64
run_post_mem000_f006_202002031200         SUCCEEDED          16.0           0.21
----------------------------------------------------------------------------------------------------
Total                                     COMPLETE                         34.66

The test is running without isue for me on Jet.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clouden90 These changes look good to me!

I was able to successfully run the new 2020_CAD test on Jet. I wouldn't expect the new test to work on Cheyenne, you need to pre-stage data on that machine, but the rest of the machines should be able to pull the necessary data from AWS. So, while I will give my approval, I'd like to see if @EdwardSnyder-NOAA is still encountering issues with running the test on Hera, Jet, or Orion.

@EdwardSnyder-NOAA
Copy link
Collaborator

I was able to get data on Jet now, so not sure what went wrong the first time. Are we wanting this test to run on Cheyenne? If so, we should pre-stage the data there, which is something I can help with. I'll approve the PR once the data is staged or if the decision is to not to run this on Cheyenne.

@clouden90
Copy link
Contributor Author

I was able to get data on Jet now, so not sure what went wrong the first time. Are we wanting this test to run on Cheyenne? If so, we should pre-stage the data there, which is something I can help with. I'll approve the PR once the data is staged or if the decision is to not to run this on Cheyenne.

@EdwardSnyder-NOAA: Thanks for testing, and I'm glad to hear that you can now pass the test on Jet. Ideally, it would be great if we could make this test work on all the Tier 1 NOAA machines, including Cheyenne. Please note that the end date for this specific deliverable is 6/23. Do you think it's possible to have the pre-staged data ready on Cheyenne before that? In the meantime, I can add a note in the test config YAML file to notify users that this test will require pre-staged data on Cheyenne. Does this sound good?

@EdwardSnyder-NOAA
Copy link
Collaborator

EdwardSnyder-NOAA commented Jun 15, 2023

@clouden90 - Yeah, we can pre-stage the data by then. It looks like this data is FV3GFS, so I'll place it with the other case/test input model data here: /glade/work/epicufsrt/contrib/UFS_SRW_data/develop/input_model_data/FV3GFS/nemsio/2020020312

@MichaelLueken
Copy link
Collaborator

@clouden90 Before this PR is merged, since @EdwardSnyder-NOAA is working on staging the data on Cheyenne for the new test, the new 2020_CAD test should be added to the tests/WE2E/machine_suites/comprehensive* files, to ensure that the test is run on every machine as part of the comprehensive testing. I'd also recommend that the new test get added to one of the tests/WE2E/machine_suites/coverage.* files, so that it is run regularly as part of the Jenkins automated testing.

@clouden90
Copy link
Contributor Author

@clouden90 - Yeah, we can pre-stage the data by then. It looks like this data is FV3GFS, so I'll place it with the other case/test input model data here: /glade/work/epicufsrt/contrib/UFS_SRW_data/develop/input_model_data/FV3GFS/nemsio/2020020312

Thanks @EdwardSnyder-NOAA for the support! Since I do not have account on Cheyenne, would you mind to re-do the test on Cheyenne once the pre-staging data is ready? Thanks

@EdwardSnyder-NOAA
Copy link
Collaborator

@clouden90 - Yeah, we can pre-stage the data by then. It looks like this data is FV3GFS, so I'll place it with the other case/test input model data here: /glade/work/epicufsrt/contrib/UFS_SRW_data/develop/input_model_data/FV3GFS/nemsio/2020020312

Thanks @EdwardSnyder-NOAA for the support! Since I do not have account on Cheyenne, would you mind to re-do the test on Cheyenne once the pre-staging data is ready? Thanks

The data has been staged on Cheyenne and the test passed successfully. These are the changes (highlighted by **) I made to the config.2020_CAD.yaml in order for the get_extrn_* tasks to fetch the data locally:

task_get_extrn_ics:
  EXTRN_MDL_NAME_ICS: **FV3GFS**
  FV3GFS_FILE_FMT_ICS: nemsio
  **USE_USER_STAGED_EXTRN_FILES: true**
task_get_extrn_lbcs:
  EXTRN_MDL_NAME_LBCS: **FV3GFS**
  LBC_SPEC_INTVL_HRS: 3
  FV3GFS_FILE_FMT_LBCS: nemsio
  **USE_USER_STAGED_EXTRN_FILES: true**

@clouden90
Copy link
Contributor Author

@EdwardSnyder-NOAA , Thanks again for staging data on Cheyenne and sharing the changes! I will add a note in the description section to include your modifications for users who are interested in running this test on Cheyenne.

@MichaelLueken
Copy link
Collaborator

@EdwardSnyder-NOAA Unfortunately, if the USE_USER_EXTRN_STAGED_FILES variable is set to true in the config.2020_CAD.yaml file, then the data needs to be staged on all machines that it will be run on. If pre-staged data isn't found, then the test will fail.

@clouden90 Following the update to the develop branch this morning, there is now a conflict in Components.rst. Please merge the latest develop into your branch and correct the conflict, then we should be able to move forward. Thanks!

@clouden90
Copy link
Contributor Author

@EdwardSnyder-NOAA Unfortunately, if the USE_USER_EXTRN_STAGED_FILES variable is set to true in the config.2020_CAD.yaml file, then the data needs to be staged on all machines that it will be run on. If pre-staged data isn't found, then the test will fail.

@clouden90 Following the update to the develop branch this morning, there is now a conflict in Components.rst. Please merge the latest develop into your branch and correct the conflict, then we should be able to move forward. Thanks!

@MichaelLueken , thanks! I have merged the latest develop, and add 2020_CAD test to comprehensive.orion and coverage.orion. Also @EdwardSnyder-NOAA suggestions are added as a note in description session for users who are interested in running this test on Cheyenne.

@clouden90
Copy link
Contributor Author

@EdwardSnyder-NOAA , as @MichaelLueken mentioned, the USE_USER_EXTRN_STAGED_FILES variable is set to true but pre-staged data isn't found, the test will fail. I have added your modification in the description session of config.2020_CAD.yaml. Feel free to let me know If you have any comments, suggestions, or concerns regarding this. If you are satisfied with the changes and find them acceptable, could you kindly consider approving the pull request at your convenience? Thanks

Copy link
Collaborator

@EdwardSnyder-NOAA EdwardSnyder-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding that note! LGTM.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Jun 21, 2023
@MichaelLueken
Copy link
Collaborator

@clouden90 The Jenkins automated tests passed on Cheyenne, Hera, and Jet. The tests failed on Orion due to the inability to clone the ccpp-physics repository (a known issue that requires git/2.28.0 to be loaded in the .bashrc file on the machine before the tests can run). I am currently running the Jenkins tests manually on Orion. Once they are complete, I will move forward with merging this PR.

@MichaelLueken
Copy link
Collaborator

MichaelLueken commented Jun 21, 2023

@clouden90 The manual submission of the Jenkins tests on Orion have all passed. Moving forward with merging this PR now.

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
deactivate_tasks                                                   COMPLETE               1.07
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE             760.15
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta   COMPLETE             265.20
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot        COMPLETE             140.06
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta            COMPLETE              15.93
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp  COMPLETE              14.14
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR              COMPLETE             384.49
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              29.58
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16    COMPLETE             281.31
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0         COMPLETE              14.95
nco                                                                COMPLETE               7.78
2020_CAD                                                           COMPLETE              31.48
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1946.14

@MichaelLueken MichaelLueken merged commit 4932f02 into ufs-community:develop Jun 21, 2023
@clouden90 clouden90 mentioned this pull request Jun 23, 2023
22 tasks
MichaelLueken pushed a commit that referenced this pull request Jun 26, 2023
The reference of CCPP-SCM was missing in PR #822. Here, we add it back.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants