Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sporadic floating point errors in a2b_edge.F90 for regional configurations #346

Closed
SamuelTrahanNOAA opened this issue Jul 11, 2024 · 25 comments

Comments

@SamuelTrahanNOAA
Copy link

Describe the bug

Regional configurations of UFS FV3 abort sporadically with a floating-point exception in subroutine a2b_ord2 in FV3/atmos_cubed_sphere/model/a2b_edge.F90 when compiled in debug mode. The crash is here:

    if (gridstruct%grid_type < 3) then

       if (gridstruct%bounded_domain) then

          do j=js-2,je+1+2   
             do i=is-2,ie+1+2
                qout(i,j) = 0.25*(qin(i-1,j-1)+qin(i,j-1)+qin(i-1,j)+qin(i,j)) ! <------- crashes here
             enddo
          enddo

       else
Full stack trace
112:
112: WARNING from PE   112: atmos_modeldefine_blocks_packed: domain (  33  19) is not an even divisor with definition (  32) - blocks will not be uniform with a remainder of   19
112:
112: [h11c41:455655:0:455655] Caught signal 8 (Floating point exception: floating-point invalid operation)
112: ==== backtrace (tid: 455655) ====
112:  0 0x00000000000534e9 ucs_debug_print_backtrace()  ???:0
112:  1 0x0000000000012cf0 __funlockfile()  :0
112:  2 0x0000000004ba5714 a2b_edge_mod_mp_a2b_ord2_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/a2b_edge.F90:382
112:  3 0x0000000002bccce6 L_dyn_core_mod_mp_adv_pe__1630__par_loop0_2_108()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/dyn_core.F90:1665
112:  4 0x000000000013fbb3 __kmp_invoke_microtask()  ???:0
112:  5 0x00000000000bbfac __kmp_fork_call()  /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:2111
112:  6 0x000000000007dcb5 __kmpc_fork_call()  /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_csupport.cpp:358
112:  7 0x0000000002bc674f dyn_core_mod_mp_adv_pe_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/dyn_core.F90:1630
112:  8 0x0000000002b689ea dyn_core_mod_mp_dyn_core_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/dyn_core.F90:1280
112:  9 0x0000000002ce48d4 fv_dynamics_mod_mp_fv_dynamics_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/fv_dynamics.F90:683
112: 10 0x00000000028bd928 atmosphere_mod_mp_atmosphere_dynamics_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90:683
112: 11 0x00000000020b079c atmos_model_mod_mp_update_atmos_model_dynamics_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_model.F90:880
112: 12 0x0000000001b4014c module_fcst_grid_comp_mp_fcst_run_phase_1_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/module_fcst_grid_comp.F90:1330
112: 13 0x0000000000aa2644 ESMCI::FTable::callVFuncPtr()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
112: 14 0x0000000000aa61ef ESMCI_FTableCallEntryPointVMHop()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
112: 15 0x000000000094dbea ESMCI::VMK::enter()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:1247
112: 16 0x000000000121eeaf ESMCI::VM::enter()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
112: 17 0x0000000000aa3a8a c_esmc_ftablecallentrypointvm_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
112: 18 0x0000000000970d50 esmf_compmod_mp_esmf_compexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252
112: 19 0x0000000000ca5351 esmf_gridcompmod_mp_esmf_gridcomprun_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1903
112: 20 0x0000000001b0b54e fv3atm_cap_mod_mp_modeladvance_phase1_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/fv3_cap.F90:1077
112: 21 0x0000000001b0a615 fv3atm_cap_mod_mp_modeladvance_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/fv3_cap.F90:1026
112: 22 0x00000000006aba58 ESMCI::MethodElement::execute()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
112: 23 0x00000000006ab9ba ESMCI::MethodTable::execute()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
112: 24 0x00000000006aa582 c_esmc_methodtableexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
112: 25 0x000000000047c492 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
112: 26 0x0000000004e0e71d nuopc_modelbase_mp_routine_run_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_ModelBase.F90:2212
112: 27 0x0000000000aa2644 ESMCI::FTable::callVFuncPtr()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
112: 28 0x0000000000aa61ef ESMCI_FTableCallEntryPointVMHop()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
112: 29 0x000000000094d9da ESMCI::VMK::enter()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2501
112: 30 0x000000000121eeaf ESMCI::VM::enter()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
112: 31 0x0000000000aa3a8a c_esmc_ftablecallentrypointvm_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
112: 32 0x0000000000970d50 esmf_compmod_mp_esmf_compexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252
112: 33 0x0000000000ca5351 esmf_gridcompmod_mp_esmf_gridcomprun_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1903
112: 34 0x00000000008d1317 nuopc_driver_mp_routine_executegridcomp_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3694
112: 35 0x00000000008d0b6a nuopc_driver_mp_executerunsequence_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3940
112: 36 0x00000000006aba58 ESMCI::MethodElement::execute()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
112: 37 0x00000000006ab9ba ESMCI::MethodTable::execute()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
112: 38 0x00000000006aa582 c_esmc_methodtableexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
112: 39 0x000000000047c492 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
112: 40 0x00000000008cdbb2 nuopc_driver_mp_routine_run_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3615
112: 41 0x0000000000aa2644 ESMCI::FTable::callVFuncPtr()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
112: 42 0x0000000000aa61ef ESMCI_FTableCallEntryPointVMHop()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
112: 43 0x000000000094d9da ESMCI::VMK::enter()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2501
112: 44 0x000000000121eeaf ESMCI::VM::enter()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
112: 45 0x0000000000aa3a8a c_esmc_ftablecallentrypointvm_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
112: 46 0x0000000000970d50 esmf_compmod_mp_esmf_compexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252
112: 47 0x0000000000ca5351 esmf_gridcompmod_mp_esmf_gridcomprun_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1903
112: 48 0x000000000042fae6 MAIN__()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/driver/UFS.F90:406
112: 49 0x000000000042bfa2 main()  ???:0
112: 50 0x000000000003ad85 __libc_start_main()  ???:0
112: 51 0x000000000042beae _start()  ???:0
112: =================================
112: forrtl: error (75): floating point exception
112: Image              PC                Routine            Line        Source
112: fv3.exe            000000000C1EE34B  Unknown               Unknown  Unknown
112: libpthread-2.28.s  0000150AC4D0BCF0  Unknown               Unknown  Unknown
112: fv3.exe            0000000004BA5714  a2b_edge_mod_mp_a         382  a2b_edge.F90
112: fv3.exe            0000000002BCCCE6  dyn_core_mod_mp_a        1665  dyn_core.F90
112: libiomp5.so        0000150AC7D74BB3  __kmp_invoke_micr     Unknown  Unknown
112: libiomp5.so        0000150AC7CF0FAC  __kmp_fork_call       Unknown  Unknown
112: libiomp5.so        0000150AC7CB2CB5  __kmpc_fork_call      Unknown  Unknown
112: fv3.exe            0000000002BC674F  dyn_core_mod_mp_a        1630  dyn_core.F90
112: fv3.exe            0000000002B689EA  dyn_core_mod_mp_d        1280  dyn_core.F90
112: fv3.exe            0000000002CE48D4  fv_dynamics_mod_m         683  fv_dynamics.F90
112: fv3.exe            00000000028BD928  atmosphere_mod_mp         683  atmosphere.F90
112: fv3.exe            00000000020B079C  atmos_model_mod_m         880  atmos_model.F90
112: fv3.exe            0000000001B4014C  module_fcst_grid_        1330  module_fcst_grid_comp.F90
112: fv3.exe            0000000000AA2644  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA61EF  Unknown               Unknown  Unknown
112: fv3.exe            000000000094DBEA  Unknown               Unknown  Unknown
112: fv3.exe            000000000121EEAF  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA3A8A  Unknown               Unknown  Unknown
112: fv3.exe            0000000000970D50  Unknown               Unknown  Unknown
112: fv3.exe            0000000000CA5351  Unknown               Unknown  Unknown
112: fv3.exe            0000000001B0B54E  fv3atm_cap_mod_mp        1077  fv3_cap.F90
112: fv3.exe            0000000001B0A615  fv3atm_cap_mod_mp        1026  fv3_cap.F90
112: fv3.exe            00000000006ABA58  Unknown               Unknown  Unknown
112: fv3.exe            00000000006AB9BA  Unknown               Unknown  Unknown
112: fv3.exe            00000000006AA582  Unknown               Unknown  Unknown
112: fv3.exe            000000000047C492  Unknown               Unknown  Unknown
112: fv3.exe            0000000004E0E71D  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA2644  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA61EF  Unknown               Unknown  Unknown
112: fv3.exe            000000000094D9DA  Unknown               Unknown  Unknown
112: fv3.exe            000000000121EEAF  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA3A8A  Unknown               Unknown  Unknown
112: fv3.exe            0000000000970D50  Unknown               Unknown  Unknown
112: fv3.exe            0000000000CA5351  Unknown               Unknown  Unknown
112: fv3.exe            00000000008D1317  Unknown               Unknown  Unknown
112: fv3.exe            00000000008D0B6A  Unknown               Unknown  Unknown
112: fv3.exe            00000000006ABA58  Unknown               Unknown  Unknown
112: fv3.exe            00000000006AB9BA  Unknown               Unknown  Unknown
112: fv3.exe            00000000006AA582  Unknown               Unknown  Unknown
112: fv3.exe            000000000047C492  Unknown               Unknown  Unknown
112: fv3.exe            00000000008CDBB2  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA2644  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA61EF  Unknown               Unknown  Unknown
112: fv3.exe            000000000094D9DA  Unknown               Unknown  Unknown
112: fv3.exe            000000000121EEAF  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA3A8A  Unknown               Unknown  Unknown
112: fv3.exe            0000000000970D50  Unknown               Unknown  Unknown
112: fv3.exe            0000000000CA5351  Unknown               Unknown  Unknown
112: fv3.exe            000000000042FAE6  MAIN__                    406  UFS.F90
112: fv3.exe            000000000042BFA2  Unknown               Unknown  Unknown
112: libc-2.28.so       0000150AC4756D85  __libc_start_main     Unknown  Unknown
112: fv3.exe            000000000042BEAE  Unknown               Unknown  Unknown

The crash is a floating-point exception. There are only additions and multiplications, so the exception is probably from a NaN. This could be due to uninitialized memory, or due to not filling boundary conditions (which are initialized with signalling NaN).

Crashes seem to start after #344 was merged. If so, that PR shouldn't have been merged; the regression test system should've detected this problem. Unfortunately, the ufs-weather-model regression test system is presently unable to detect the difference between a crash and a test's results changing. A fix for the regression test system bug is being tested now.

Unfortunately, we're stuck with broken authoritative branches until this bug is fixed.

From skimming the changes in #344, my best guess is that some parts of the omga array are uninitialized for regional cases due to removing the initialization loop. I haven't had a chance to test that hypothesis yet.

To Reproduce

  1. Set up on Hera the ufs-weather-model regression test system to not retry jobs, and not delete logs or run directories.
  2. Run all ufs-weather-model regression tests that have both "conus13km" and "debug" in their name.
  3. Check for floating point exceptions in failed tests before the regression test system deletes the logs.

The fix for the regression test system is in this PR:

That is being tested now. Once it's merged, model crashes will be detectable in regression tests once again.

Expected behavior
Model runs to completion when compiled in debug mode.

System Environment
UFS Weather Model regression test system with Intel compiler on Hera. That's Intel 2021.5.0 with IMPI 2021.5.1 and FMS 2023.04 using Spack Stack 1.6.0.

Here's the uname -a output from a login node:

Linux hfe09 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Wed Sep 20 15:55:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Additional context
Can't think of anything.

@jkbk2004
Copy link

jkbk2004 commented Jul 11, 2024

@SamuelTrahanNOAA I am seeing same bug behavior with ufs-community/ufs-weather-model#2362. It points to #344

@lharris4
Copy link
Contributor

lharris4 commented Jul 11, 2024 via email

@SamuelTrahanNOAA
Copy link
Author

SamuelTrahanNOAA commented Jul 11, 2024

Yes, debug mode includes bounds checking and tests for various floating-point errors.

In most cases, the dynamical core fails in debug mode in regional tests, where it aborts due to floating point exceptions. Many debug tests are already disabled because of existing unknown bugs. We really can't afford to disable the remaining tests due to new bugs.

@lharris4
Copy link
Contributor

lharris4 commented Jul 11, 2024 via email

@SamuelTrahanNOAA SamuelTrahanNOAA changed the title sporadic floating point errors in a2b_edge.F90 for nested configurations sporadic floating point errors in a2b_edge.F90 for regional configurations Jul 11, 2024
@SamuelTrahanNOAA
Copy link
Author

I am confused now. Is this in the global-nested configuration (as the title suggests), or only in regional domains?

Regional configurations. I've corrected the title; sorry about that.

@SamuelTrahanNOAA
Copy link
Author

These are the last three tests that have failed for me:

  • conus13km_debug_intel = regional, no nests, uncoupled, no threads, FV3_HRRR physics suite
  • conus13km_debug_2threads_intel = regional, no nests, uncoupled, two OpenMP threads, FV3_HRRR physics suite
  • hafs_regional_storm_following_1nest_atm_ocn_debug_intel = regional, one moving nest, ocean coupling, two OpenMP threads, FV3_HAFS_v1_gfdlmp_tedmf_nonsst physics suite

The only commonalities I see are:

  • FV3 regional
  • Intel compiler with -DDEBUG=ON

@jkbk2004
Copy link

@SamuelTrahanNOAA 's list is correct. I am still running on Derecho and Gaea to make sure again. If 577fd5e is going to be reverted, then I can turn off https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/rt.conf#L35-L36

@XiaqiongZhou-NOAA
Copy link
Contributor

XiaqiongZhou-NOAA commented Jul 11, 2024

I am a little confused that why #344 is the issue.
To clarify:
First of all: https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/rt.conf#L35-L36 is nothing related to 577fd5e.
Second, #344 does not change the result.
Is this commit causing the model crash?

I do not think to revert dycore update is an option. We need this update for GFSv17/GEFSv13. It is better to identify what is really causing the problem.

@jkbk2004
Copy link

@XiaqiongZhou-NOAA my mistake! ufs-community/ufs-weather-model#2327 doesn't have a test case changed. @lharris4 @SamuelTrahanNOAA 577fd5e can be reverted w/o any change on UFS-WM level.

@SamuelTrahanNOAA
Copy link
Author

Second, #344 does not change the result.

It doesn't change the result when the job succeeds.

The problem is that the job doesn't succeed reliably, after #344 is merged.

@SamuelTrahanNOAA
Copy link
Author

First of all: https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/rt.conf#L35-L36 is nothing related to 577fd5e.

I don't know why @jkbk2004 mentioned that test, but it is not one of the ones that is failing for me.

@jkbk2004
Copy link

First of all: https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/rt.conf#L35-L36 is nothing related to 577fd5e.

I don't know why @jkbk2004 mentioned that test, but it is not one of the ones that is failing for me.

@SamuelTrahanNOAA I was confused. @lharris4 @laurenchilutti @bensonr Can we make a decision to revert 577fd5e ?

@SamuelTrahanNOAA
Copy link
Author

SamuelTrahanNOAA commented Jul 11, 2024

@XiaqiongZhou-NOAA These are the only tests that fail for me:

  • conus13km_debug_intel
  • conus13km_debug_2threads_intel
  • hafs_regional_storm_following_1nest_atm_ocn_debug_intel

I explain in detail in my comment #346 (comment)

@laurenchilutti
Copy link
Contributor

If you would like this reverted, we should do it via a PR so you can rerun the UFS tests. If Lucas and Rusty agree, I can put in a PR with this Merge being reverted for you to test.

@XiaqiongZhou-NOAA
Copy link
Contributor

If you would like this reverted, we should do it via a PR so you can rerun the UFS tests. If Lucas and Rusty agree, I can put in a PR with this Merge being reverted for you to test.

Lauren:
Please hold this.

@XiaqiongZhou-NOAA These are the only tests that fail for me:

  • conus13km_debug_intel
  • conus13km_debug_2threads_intel
  • hafs_regional_storm_following_1nest_atm_ocn_debug_intel

I explain in detail in my comment #346 (comment)

@SamuelTrahanNOAA
I am running these tests OK on Hercules. How to repeat your failed cases? What else need changed?

@SamuelTrahanNOAA
Copy link
Author

I am running these tests OK on Hercules. How to repeat your failed cases? What else need changed?

Try running on HERA. I haven't tested this on Hercules, so I don't know if it'll fail there. Uninitialized memory and out-of-bounds accesses can be troublesome like that. Change one little thing, and the contents of that memory are different.

@jkbk2004
Copy link

@SamuelTrahanNOAA I ran on hera/hercules/gaea/derecho. It's random behavior but those 3 cases are commonly crashing same line 382 of atmos_cubed_sphere/model/a2b_edge.F90.

@SamuelTrahanNOAA
Copy link
Author

SamuelTrahanNOAA commented Jul 11, 2024

I've tried two changes:

  1. Default pass_full_omega_to_physics_in_non_hydrostatic_mode to .true. With this change, hafs_regional_storm_following_1nest_atm_ocn_debug_intel failed the first try in a2b_edge.F90. The other two tests succeeded on the first try (but the results changed).
  2. Restore the initialization loop on line 826 which sets omga(i,j,k) = delp(i,j,k)/delz(i,j,k)*w(i,j,k). With this change, all three tests fail reliably in the usual way.

EDIT: Updated comment to reflect that in item 1, the results changed for the two jobs that ran to completion.

@DusanJovic-NOAA
Copy link
Contributor

@SamuelTrahanNOAA Can you try this change in a2b_edge.F90

diff --git a/model/a2b_edge.F90 b/model/a2b_edge.F90
index c4530a1..0c5de7e 100644
--- a/model/a2b_edge.F90
+++ b/model/a2b_edge.F90
@@ -377,8 +377,8 @@ contains

        if (gridstruct%bounded_domain) then

-          do j=js-2,je+1+2
-             do i=is-2,ie+1+2
+          do j=js,je+1
+             do i=is,ie+1
                 qout(i,j) = 0.25*(qin(i-1,j-1)+qin(i,j-1)+qin(i-1,j)+qin(i,j))
              enddo
           enddo
diff --git a/model/dyn_core.F90 b/model/dyn_core.F90
index 15df82f..f469e30 100644
--- a/model/dyn_core.F90
+++ b/model/dyn_core.F90
@@ -166,6 +166,12 @@ public :: dyn_core, del2_cubed, init_ijk_mem
   integer :: kmax=1
   real, parameter    ::     rad2deg = 180./pi

+#ifdef OVERLOAD_R4
+  real, parameter:: real_snan=real(Z'FFBFFFFF')
+#else
+  real, parameter:: real_snan=real(Z'FFF7FFFFFFFFFFFF')
+#endif
+
 contains

 !-----------------------------------------------------------------------
@@ -1627,6 +1633,9 @@ integer :: is,  ie,  js,  je
       js  = bd%js
       je  = bd%je

+      pin = real_snan
+      pb = real_snan
+
 !$OMP parallel do default(none) shared(is,ie,js,je,npz,ua,va,gridstruct,pem,npx,npy,ng,om) &
 !$OMP                          private(n, pdx, pdy, pin, pb, up, vp, grad, v3)
 do k=1,npz

@SamuelTrahanNOAA
Copy link
Author

Dusan's fix worked for me. All three jobs succeeded the first time.
Can other people confirm it works for them?

@jkbk2004
Copy link

Dusan's fix worked for me. All three jobs succeeded the first time. Can other people confirm it works for them?

@SamuelTrahanNOAA let me test on gaea/hercules/hera.

@jkbk2004
Copy link

All those cases pass ok Hera/Hercules/Gaea/Derecho.

PASS -- TEST 'conus13km_debug_intel' [17:58, 14:25](1242 MB)
PASS -- TEST 'conus13km_debug_qr_intel' [17:58, 14:48](919 MB)
PASS -- TEST 'conus13km_debug_2threads_intel' [10:53, 08:11](1165 MB)
PASS -- TEST 'hafs_regional_storm_following_1nest_atm_ocn_debug_intel' [19:03, 13:08](563 MB)

@DusanJovic-NOAA @SamuelTrahanNOAA Will you create PR ?

@SamuelTrahanNOAA
Copy link
Author

I'd rather not do it since this is neither my fix nor my code, and I have too much going on already.

@DusanJovic-NOAA - Can you do the PR?

@DusanJovic-NOAA
Copy link
Contributor

I'd rather not do it since this is neither my fix nor my code, and I have too much going on already.

@DusanJovic-NOAA - Can you do the PR?

Opened PR #349

@bensonr
Copy link
Contributor

bensonr commented Aug 8, 2024

PR #349 merged into dev/emc

@bensonr bensonr closed this as completed Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants