-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Update WM and UPP hashes and minor rearrangement of WE2E coverage tests that fail on certain platforms #799
[develop] Update WM and UPP hashes and minor rearrangement of WE2E coverage tests that fail on certain platforms #799
Conversation
…t out WE2E coverage tests that are currently failing.
…ld and run on Gaea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelLueken Sorry I am not very up-to-date on current development due to other obligations: what commit(s) resulted in the failure for the nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km
and grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta
tests? We shouldn't be allowing existing failures unless there is a very good reason for this.
This also reminds me of the issue around the known failure of grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta
(#731). I didn't have time to attempt a fix of this, but I was at least able to point to the change that resulted in this failure. We need to have someone assigned to fix these failures prior to the release, it shouldn't be something we just sweep under the rug.
@mkavulich No worries. The failures in The The Since this PR is fairly open-ended, I can certainly attempt to work through these failures as part of this PR, rather than comment them out in the coverage tests. For instance, it's possible that the reverting the modifications made to the |
Thanks for the reply @MichaelLueken. I think it is okay to re-allocate tests to different machines if they are failing only on one or two (likely machine-specific memory/node problems), but these all-platform run_fcst failures are more serious. I apologize for bringing this up as a roadblock on your PR since you didn't introduce these failures, but I feel like it is very important to avoid setting the precedent of simply kicking the can down the road on test failures. At the very least, any known failure should be documented as a high-priority issue. I'm curious to hear your thoughts on the best way forward here. I am willing to help investigate and open issues for existing failures if that would be helpful, but I do not have time to try to debug model segfaults and related mysterious errors caused by namelist/configure/openmp changes that I'm not very familiar with. If there is a solid plan with people assigned to solve the problem, I would be okay with temporarily removing these failing tests and merging this PR, but there must be a specific plan to resolve these failures soon, prior to the release. Especially since the |
@mkavulich It looks like the failure with the |
… to pass with current UFS-WM hash, revert modulefiles/build_gaea_intel.lua to current develop, and comment out Gaea in Jenkinsfile.
The |
… allow unit tests to pass.
@mkavulich I have gone through all of the preceding hashes and it looks like the issue with the |
Thanks @MichaelLueken for your continued work on this. I have some welcome (if weird) news! I made a run of the My strong suspicion is that the changes in ufs-community/ccpp-physics#43 by Joe Olson are responsible for the failures, since HRRR includes the MYNN-EDMF scheme that was heavily modified in that PR. It is very curious that increasing the time step would resolve this failure, I may drop Joe a note to see if this behavior makes any sense to him. |
@mkavulich Very welcome news indeed! Thank you so much for assisting with trying to figure out what was causing issues with the updated WM hash! I've set the DT_ATMOS to 30 for this grid and have resubmitted the test. Once it is complete, I will push the change. Have a great weekend! |
…llow grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR test to run to completion on Jet.
…-srweather-app into feature/WM-hash-update
@mkavulich The changes made to the DT_ATMOS value to make the |
@MichaelLueken It doesn't make any sense at all that changing the time step would cause plotting to fail (even less sense than increasing time steps to solve a CFL violation; at least that is all related to the dynamical core!). If valid output is produced, then the plotting scripts should be able to plot it. Does the test succeed if you manually change DT_ATMOS back to 180 for that plotting test? |
@mkavulich I completely agree. Using DT_ATMOS of 150, everything successfully passes, except for the plotting task. If the rest of the tasks successfully complete, then the plotting script should be able to successfully plot. However, this isn't the case. Yes, changing DT_ATMOS back to 180 allows the plotting test to successfully run. When DT_ATMOS was set to 160 and 170, there were failures in the post tasks. Further decreasing DT_ATMOS to 140 causes the run_fcst job to fail. The failure in the plotting script using DT_ATMOS 150 appears to be related to a self-intersection point:
Again, setting DT_ATMOS to 180 solves this failure. Of course, using this value also causes three tests to fail. |
This is problematic. Similar to the AK domain problem, but in this case it's not using the MYNN-EDMF scheme....so your guess is as good as mine. The hacky solution would be to change back to DT_ATMOS=180 just for the plotting test. But I am worried that we are just masking potential problems here that users will start to run into if they make any modifications. Do you know if there is anyone on the weather model team more familiar with the dynamical core that can offer us any assistance here? |
@mkavulich I've reached out to Jong to see if he can recommend someone that can hopefully assist with this issue. |
* add metplus paths * add run_vx.local for jet Co-authored-by: Edward Snyder <[email protected]>
Following the discussion during the SRW CM meeting, I will update the |
…V3GFS_suite_GFS_v17_p8_plot WE2E test.
The Jenkins test on Cheyenne GNU successfully passed. Moving forward with merging this PR now. For more details on the |
DESCRIPTION OF CHANGES:
Externals.cfg
file were updated to current versions.nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km
test that is constantly failing on Cheyenne GNU and thegrid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta
test on Hera Intel have been moved to platforms where the tests successfully run.DT_ATMOS
for theRRFS_CONUS_25km
grid were made to allow the WE2E tests run after updating to the current UFS-WM hash. Please note that there is still an issue with thegrid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR
test on Jet.Type of change
TESTS CONDUCTED:
CHECKLIST