-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sporadic floating point errors in a2b_edge.F90 for regional configurations #346
Comments
@SamuelTrahanNOAA I am seeing same bug behavior with ufs-community/ufs-weather-model#2362. It points to #344 |
Hi, all. This PR only makes changes to the diagnostic omega and not to any
of the prognostic variables so there is probably some loop index or bounds
error. Does debug mode include bounds checking? Also check the subroutine
calls closely to make sure array bounds are set up correctly.
Thanks,
Lucas
…On Thu, Jul 11, 2024 at 1:29 PM JONG KIM ***@***.***> wrote:
@SamuelTrahanNOAA <https://github.com/SamuelTrahanNOAA> I am seeing same
bug behavior with ufs-community/ufs-weather-model#2362
<ufs-community/ufs-weather-model#2362>.
—
Reply to this email directly, view it on GitHub
<#346 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMUQRVHY2RGSRTZJVBAMHETZL26HRAVCNFSM6AAAAABKXL6OR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTGUYDAMRSGM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Yes, debug mode includes bounds checking and tests for various floating-point errors. In most cases, the dynamical core fails in debug mode in regional tests, where it aborts due to floating point exceptions. Many debug tests are already disabled because of existing unknown bugs. We really can't afford to disable the remaining tests due to new bugs. |
I am confused now. Is this in the global-nested configuration (as the title
suggests), or only in regional domains?
…On Thu, Jul 11, 2024 at 2:04 PM Samuel Trahan (NOAA contractor) < ***@***.***> wrote:
Yes, debug mode includes bounds checking and tests for various
floating-point errors.
In most cases, the dynamical core fails in debug mode in regional tests
with floating point exceptions. Many debug tests are already disabled
because of existing unknown bugs. We really can't afford to disable the
remaining tests due to new bugs.
—
Reply to this email directly, view it on GitHub
<#346 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMUQRVBUCTG2Y25W7FJZIO3ZL3CJDAVCNFSM6AAAAABKXL6OR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTGU2TSNZQGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Regional configurations. I've corrected the title; sorry about that. |
These are the last three tests that have failed for me:
The only commonalities I see are:
|
@SamuelTrahanNOAA 's list is correct. I am still running on Derecho and Gaea to make sure again. If 577fd5e is going to be reverted, then I can turn off https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/rt.conf#L35-L36 |
I am a little confused that why #344 is the issue. I do not think to revert dycore update is an option. We need this update for GFSv17/GEFSv13. It is better to identify what is really causing the problem. |
@XiaqiongZhou-NOAA my mistake! ufs-community/ufs-weather-model#2327 doesn't have a test case changed. @lharris4 @SamuelTrahanNOAA 577fd5e can be reverted w/o any change on UFS-WM level. |
I don't know why @jkbk2004 mentioned that test, but it is not one of the ones that is failing for me. |
@SamuelTrahanNOAA I was confused. @lharris4 @laurenchilutti @bensonr Can we make a decision to revert 577fd5e ? |
@XiaqiongZhou-NOAA These are the only tests that fail for me:
I explain in detail in my comment #346 (comment) |
If you would like this reverted, we should do it via a PR so you can rerun the UFS tests. If Lucas and Rusty agree, I can put in a PR with this Merge being reverted for you to test. |
Lauren:
@SamuelTrahanNOAA |
Try running on HERA. I haven't tested this on Hercules, so I don't know if it'll fail there. Uninitialized memory and out-of-bounds accesses can be troublesome like that. Change one little thing, and the contents of that memory are different. |
@SamuelTrahanNOAA I ran on hera/hercules/gaea/derecho. It's random behavior but those 3 cases are commonly crashing same line 382 of atmos_cubed_sphere/model/a2b_edge.F90. |
I've tried two changes:
EDIT: Updated comment to reflect that in item 1, the results changed for the two jobs that ran to completion. |
@SamuelTrahanNOAA Can you try this change in a2b_edge.F90 diff --git a/model/a2b_edge.F90 b/model/a2b_edge.F90
index c4530a1..0c5de7e 100644
--- a/model/a2b_edge.F90
+++ b/model/a2b_edge.F90
@@ -377,8 +377,8 @@ contains
if (gridstruct%bounded_domain) then
- do j=js-2,je+1+2
- do i=is-2,ie+1+2
+ do j=js,je+1
+ do i=is,ie+1
qout(i,j) = 0.25*(qin(i-1,j-1)+qin(i,j-1)+qin(i-1,j)+qin(i,j))
enddo
enddo
diff --git a/model/dyn_core.F90 b/model/dyn_core.F90
index 15df82f..f469e30 100644
--- a/model/dyn_core.F90
+++ b/model/dyn_core.F90
@@ -166,6 +166,12 @@ public :: dyn_core, del2_cubed, init_ijk_mem
integer :: kmax=1
real, parameter :: rad2deg = 180./pi
+#ifdef OVERLOAD_R4
+ real, parameter:: real_snan=real(Z'FFBFFFFF')
+#else
+ real, parameter:: real_snan=real(Z'FFF7FFFFFFFFFFFF')
+#endif
+
contains
!-----------------------------------------------------------------------
@@ -1627,6 +1633,9 @@ integer :: is, ie, js, je
js = bd%js
je = bd%je
+ pin = real_snan
+ pb = real_snan
+
!$OMP parallel do default(none) shared(is,ie,js,je,npz,ua,va,gridstruct,pem,npx,npy,ng,om) &
!$OMP private(n, pdx, pdy, pin, pb, up, vp, grad, v3)
do k=1,npz |
Dusan's fix worked for me. All three jobs succeeded the first time. |
@SamuelTrahanNOAA let me test on gaea/hercules/hera. |
All those cases pass ok Hera/Hercules/Gaea/Derecho.
@DusanJovic-NOAA @SamuelTrahanNOAA Will you create PR ? |
I'd rather not do it since this is neither my fix nor my code, and I have too much going on already. @DusanJovic-NOAA - Can you do the PR? |
Opened PR #349 |
PR #349 merged into dev/emc |
Describe the bug
Regional configurations of UFS FV3 abort sporadically with a floating-point exception in subroutine a2b_ord2 in FV3/atmos_cubed_sphere/model/a2b_edge.F90 when compiled in debug mode. The crash is here:
Full stack trace
The crash is a floating-point exception. There are only additions and multiplications, so the exception is probably from a NaN. This could be due to uninitialized memory, or due to not filling boundary conditions (which are initialized with signalling NaN).
Crashes seem to start after #344 was merged. If so, that PR shouldn't have been merged; the regression test system should've detected this problem. Unfortunately, the ufs-weather-model regression test system is presently unable to detect the difference between a crash and a test's results changing. A fix for the regression test system bug is being tested now.
Unfortunately, we're stuck with broken authoritative branches until this bug is fixed.
From skimming the changes in #344, my best guess is that some parts of the omga array are uninitialized for regional cases due to removing the initialization loop. I haven't had a chance to test that hypothesis yet.
To Reproduce
The fix for the regression test system is in this PR:
That is being tested now. Once it's merged, model crashes will be detectable in regression tests once again.
Expected behavior
Model runs to completion when compiled in debug mode.
System Environment
UFS Weather Model regression test system with Intel compiler on Hera. That's Intel 2021.5.0 with IMPI 2021.5.1 and FMS 2023.04 using Spack Stack 1.6.0.
Here's the uname -a output from a login node:
Linux hfe09 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Wed Sep 20 15:55:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Additional context
Can't think of anything.
The text was updated successfully, but these errors were encountered: