Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Update WM and UPP hashes and minor rearrangement of WE2E coverage tests that fail on certain platforms #799

Merged

Conversation

MichaelLueken
Copy link
Collaborator

@MichaelLueken MichaelLueken commented May 15, 2023

DESCRIPTION OF CHANGES:

  • The WM and UPP hashes in the Externals.cfg file were updated to current versions.
  • The nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km test that is constantly failing on Cheyenne GNU and the grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta test on Hera Intel have been moved to platforms where the tests successfully run.
  • Minor changes to the DT_ATMOS for the RRFS_CONUS_25km grid were made to allow the WE2E tests run after updating to the current UFS-WM hash. Please note that there is still an issue with the grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR test on Jet.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

TESTS CONDUCTED:

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental (coverage) test suite
  • comprehensive tests (specify which if a subset was used)

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes

Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelLueken Sorry I am not very up-to-date on current development due to other obligations: what commit(s) resulted in the failure for the nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km and grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta tests? We shouldn't be allowing existing failures unless there is a very good reason for this.

This also reminds me of the issue around the known failure of grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta (#731). I didn't have time to attempt a fix of this, but I was at least able to point to the change that resulted in this failure. We need to have someone assigned to fix these failures prior to the release, it shouldn't be something we just sweep under the rug.

@MichaelLueken
Copy link
Collaborator Author

@MichaelLueken Sorry I am not very up-to-date on current development due to other obligations: what commit(s) resulted in the failure for the nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km and grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta tests? We shouldn't be allowing existing failures unless there is a very good reason for this.

This also reminds me of the issue around the known failure of grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta (#731). I didn't have time to attempt a fix of this, but I was at least able to point to the change that resulted in this failure. We need to have someone assigned to fix these failures prior to the release, it shouldn't be something we just sweep under the rug.

@mkavulich No worries. The failures in nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km and grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta were first seen for the PRs that came in directly after PR #732 was merged - PRs #745 and #729, respectively.

The nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km test passes on any machine that isn't Cheyenne, so it might be a similar issue to what was encountered in PR #732 with the grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2test. The failure is also very reminiscent - only failing intermittently in the run_post tasks.

The grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta test is failing in the run_fcst step, similar to what you noted in issue #731 for the grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta test.

Since this PR is fairly open-ended, I can certainly attempt to work through these failures as part of this PR, rather than comment them out in the coverage tests.

For instance, it's possible that the reverting the modifications made to the model_configure file will allow both the grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta and grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta tests to pass.

@mkavulich
Copy link
Collaborator

Thanks for the reply @MichaelLueken. I think it is okay to re-allocate tests to different machines if they are failing only on one or two (likely machine-specific memory/node problems), but these all-platform run_fcst failures are more serious. I apologize for bringing this up as a roadblock on your PR since you didn't introduce these failures, but I feel like it is very important to avoid setting the precedent of simply kicking the can down the road on test failures. At the very least, any known failure should be documented as a high-priority issue.

I'm curious to hear your thoughts on the best way forward here. I am willing to help investigate and open issues for existing failures if that would be helpful, but I do not have time to try to debug model segfaults and related mysterious errors caused by namelist/configure/openmp changes that I'm not very familiar with. If there is a solid plan with people assigned to solve the problem, I would be okay with temporarily removing these failing tests and merging this PR, but there must be a specific plan to resolve these failures soon, prior to the release. Especially since the grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta domain is specifically used for the RRFS.

@MichaelLueken
Copy link
Collaborator Author

@mkavulich It looks like the failure with the grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR is due to changes from a separate PR in the ufs-weather-model. I'm currently testing the hash from #1720, and this test is still failing. This might be something that we can correct in the SRW, but I need to still find which hash is causing the failure first.

… to pass with current UFS-WM hash, revert modulefiles/build_gaea_intel.lua to current develop, and comment out Gaea in Jenkinsfile.
@MichaelLueken
Copy link
Collaborator Author

The modulefiles/build_gaea_intel.lua modulefile has been reverted to the what is currently in develop. Once the weather model has merged @natalie-perlin's PR to use the latest hpc-stack location, a PR will be opened in the SRW App to make the same update. In the interim, Gaea has been deactivated in the Jenkinsfile (the SRW App is unable to build without an updated modulefile). Applied @mkavulich's fix to DT_ATMOS (reducing the value from 180 to 150) for the RRFS_CONUS_25km grid. Currently, the only test that is failing is the Jet test - grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR. Will continue to try and find the hash that causes this test to fail and will hopefully be able to correct it at that time.

@MichaelLueken MichaelLueken changed the title [develop] Update WM, UPP, and UFS_UTILS hashes, comment out WE2E coverage tests that consistently fail, and update Gaea modulefile [develop] Update WM and UPP hashes and minor rearrangement of WE2E coverage tests that fail on certain platforms May 26, 2023
@MichaelLueken
Copy link
Collaborator Author

@mkavulich I have gone through all of the preceding hashes and it looks like the issue with the grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR test was brought about by the changes in UFS-WM hash 889254a and is associated with PR #1658. I will now look into what changes might be causing this test to fail. If you see anything in this PR that looks like the culprit, please let me know. Thank you very much!

@mkavulich
Copy link
Collaborator

Thanks @MichaelLueken for your continued work on this. I have some welcome (if weird) news! I made a run of the grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR case on Jet with an increased timestep of DT_ATMOS=20 to see if I could get the failure to occur closer to the first output forecast hour (to see if I could directly observe some of the developing numerical instability). And yet when I did so, the case ran to completion! I also tried DT_ATMOS=30 and DT_ATMOS=45 and they both succeeded as well...so I think we can be comfortable increasing the default DT_ATMOS for that domain to either of those values! 30 is probably safer.

My strong suspicion is that the changes in ufs-community/ccpp-physics#43 by Joe Olson are responsible for the failures, since HRRR includes the MYNN-EDMF scheme that was heavily modified in that PR. It is very curious that increasing the time step would resolve this failure, I may drop Joe a note to see if this behavior makes any sense to him.

@MichaelLueken
Copy link
Collaborator Author

@mkavulich Very welcome news indeed! Thank you so much for assisting with trying to figure out what was causing issues with the updated WM hash! I've set the DT_ATMOS to 30 for this grid and have resubmitted the test. Once it is complete, I will push the change. Have a great weekend!

…llow grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR test to run to completion on Jet.
@MichaelLueken
Copy link
Collaborator Author

@mkavulich The changes made to the DT_ATMOS value to make the specify_template_filenames, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2, and GST_release_public_v1 tests pass using the updated WM hash, decreasing the value from 180 to 150, is now causing the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot test to fail on Cheyenne Intel. Values of 160 and 170 were also tested for this case, but ultimately resulted in failure. Just wanted to let you know why this work hasn't been merged as of this time.

@mkavulich
Copy link
Collaborator

@MichaelLueken It doesn't make any sense at all that changing the time step would cause plotting to fail (even less sense than increasing time steps to solve a CFL violation; at least that is all related to the dynamical core!). If valid output is produced, then the plotting scripts should be able to plot it.

Does the test succeed if you manually change DT_ATMOS back to 180 for that plotting test?

@MichaelLueken
Copy link
Collaborator Author

@MichaelLueken It doesn't make any sense at all that changing the time step would cause plotting to fail (even less sense than increasing time steps to solve a CFL violation; at least that is all related to the dynamical core!). If valid output is produced, then the plotting scripts should be able to plot it.

Does the test succeed if you manually change DT_ATMOS back to 180 for that plotting test?

@mkavulich I completely agree. Using DT_ATMOS of 150, everything successfully passes, except for the plotting task. If the rest of the tasks successfully complete, then the plotting script should be able to successfully plot. However, this isn't the case. Yes, changing DT_ATMOS back to 180 allows the plotting test to successfully run. When DT_ATMOS was set to 160 and 170, there were failures in the post tasks. Further decreasing DT_ATMOS to 140 causes the run_fcst job to fail. The failure in the plotting script using DT_ATMOS 150 appears to be related to a self-intersection point:

ERROR: TopologyException: side location conflict at -97.608377902731846 34.009777791998118
INFO: Self-intersection at or near point -97.878776042255282 34.00918460186481

Again, setting DT_ATMOS to 180 solves this failure. Of course, using this value also causes three tests to fail.

@mkavulich
Copy link
Collaborator

Further decreasing DT_ATMOS to 140 causes the run_fcst job to fail.

This is problematic. Similar to the AK domain problem, but in this case it's not using the MYNN-EDMF scheme....so your guess is as good as mine.

The hacky solution would be to change back to DT_ATMOS=180 just for the plotting test. But I am worried that we are just masking potential problems here that users will start to run into if they make any modifications. Do you know if there is anyone on the weather model team more familiar with the dynamical core that can offer us any assistance here?

@MichaelLueken
Copy link
Collaborator Author

@mkavulich I've reached out to Jong to see if he can recommend someone that can hopefully assist with this issue.

michelleharrold pushed a commit to michelleharrold/ufs-srweather-app that referenced this pull request Jun 7, 2023
* add metplus paths

* add run_vx.local for jet

Co-authored-by: Edward Snyder <[email protected]>
@MichaelLueken
Copy link
Collaborator Author

Following the discussion during the SRW CM meeting, I will update the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot test config file to use DT_ATMOS = 180 and then create an issue on the issue.

@MichaelLueken
Copy link
Collaborator Author

The Jenkins test on Cheyenne GNU successfully passed. Moving forward with merging this PR now. For more details on the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot test filaure, please see issue #827.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants