Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGFPE in ch4Mod with DEBUG=TRUE #1729

Closed
glemieux opened this issue Apr 29, 2022 · 10 comments · Fixed by #1723
Closed

SIGFPE in ch4Mod with DEBUG=TRUE #1729

glemieux opened this issue Apr 29, 2022 · 10 comments · Fixed by #1723
Labels
bfb bit-for-bit bug something is working incorrectly

Comments

@glemieux
Copy link
Collaborator

Brief summary of bug

While rebuilding dependencies on my workstation I ran into a problem similar to an old issue: #1013. The code builds successfully, but crashes almost immediately upon run time with a floating point error in ch4Mod when DEBUG=TRUE.

General bug information

CTSM version you are using: ctsm5.1.dev091

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: Debug mode with gnu compiler on local ubuntu workstation with both mct and nuopc drivers

Details of bug

The issue appears to be similar to #1013. It occurs in the code block just prior to the one reported in the referenced issue:

if (.not. lake .and. usephfact .and. pH(c) > pHmin .and.pH(c) < pHmax) then

Talking with @ekluzek, we surmise that this might be addressable with a newer gnu compiler version.

Important details of your setup / configuration so we can reproduce the bug

Machine: lobata
OS: Pop!_OS 20.04 (Ubuntu based distro)
Compiler: gfortran 9.4.0
Dependencies: openmpi 4.0.3, netcdf-c 4.8.1, netcdf-fortran 4.5.4, hdf5 1.12.1, esmf 8.2.0

Dependencies have been built with parallel enabling options.

Important output or errors that show the problem

 clmfates_interfaceMod.F90:: reading fates_leaf_vcmax25top

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f3d219d3d21 in ???
#1  0x7f3d219d2ef5 in ???
#2  0x7f3d216980bf in ???
#3  0x56186ce3df9e in ch4_prod
        at /home/glemieux/Repos/ctsm/src/biogeochem/ch4Mod.F90:2635
#4  0x56186ce7aa75 in __ch4mod_MOD_ch4
        at /home/glemieux/Repos/ctsm/src/biogeochem/ch4Mod.F90:2069
#5  0x56186c661e55 in __clm_driver_MOD_clm_drv
        at /home/glemieux/Repos/ctsm/src/main/clm_driver.F90:1187
#6  0x56186c61c2a6 in modeladvance
        at /home/glemieux/Repos/ctsm/src/cpl/nuopc/lnd_comp_nuopc.F90:886
#7  0x7f3d22f52a1a in _ZNK5ESMCI13MethodElement7executeEPvPi
        at /home/glemieux/local/esmf/src/esmf-ESMF_8_2_0/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
#8  0x7f3d22f53abf in _ZN5ESMCI11MethodTable7executeENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPvPiPb
        at /home/glemieux/local/esmf/src/esmf-ESMF_8_2_0/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
#9  0x7f3d22f52549 in c_esmc_methodtableexecute_
        at /home/glemieux/local/esmf/src/esmf-ESMF_8_2_0/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
#10  0x7f3d2326195f in __esmf_attachmethodsmod_MOD_esmf_methodgridcompexecute
        at /home/glemieux/local/esmf/src/esmf-ESMF_8_2_0/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1288
#11  0x7f3d240a6d2d in __nuopc_modelbase_MOD_routine_run
        at /home/glemieux/local/esmf/src/esmf-ESMF_8_2_0/src/addon/NUOPC/src/NUOPC_ModelBase.F90:2218
#12  0x7f3d22c3949e in _ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi
        at /home/glemieux/local/esmf/src/esmf-ESMF_8_2_0/src/Superstructure/Component/src/ESMCI_FTable.C:2167
#13  0x7f3d22c362b0 in ESMCI_FTableCallEntryPointVMHop
@wwieder
Copy link
Contributor

wwieder commented Apr 29, 2022

I was getting a similar error message with DEBUG=TRUE recently, which seemed to trigger on the new year for some reason, (using tag dev_074). I was no longer trying to run in DEBUG mode, and turning this off resolved my issue so I didn't think much of it. Happy to resurrect the case if it's helpful.

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 29, 2022

@wwieder can you give us the specs of that case? Where were you running, and what compiler versions and all that. Just pointing to the case directory would be sufficient. And didn't we have a different issue for the new year that's resolved now?

@billsacks
Copy link
Member

Yes, this does look essentially the same as the issue in #1013 - the code only works if the compiler is using short-circuit logic. It looks like the conditional should be refactored to:

if (.not. lake .and. usephfact) then
   if (pH(c) >  pHmin .and.pH(c) <  pHmax) then
      ! body of conditional
   end if
end if

The problem is that pH is NaN if usephfact is false (which appears to be the default).

I think we should fix this since it looks like an easy fix and it could appear with other compilers as well: The Fortran standard makes no guarantees about short circuiting.

@billsacks billsacks added bug something is working incorrectly tag: simple bfb labels Apr 29, 2022
@billsacks
Copy link
Member

Bigger question: we don't appear to have any tests of usephfact true. Do we actually still want to support that option?

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 29, 2022

@ckoven can you comment on @billsacks question above -- is usephfact a useful option to support? It looks like one problem with it is that we don't have soil PH data, but we will be getting soil PH with ctsm5.2, so perhaps this will be a useful option at that point? And if so maybe it should stay in and we'll activate it in ctsm5.2?

@wwieder
Copy link
Contributor

wwieder commented Apr 29, 2022

@ekluzek Looking closer I'm not sure this will be helpful or similar. My case directory is here
/glade/work/wwieder/ctsm/flexLeafCN/cime/scripts/ctsm51d074_2deg_GSWP3V1_hist_cnSlope0

not sure which of the many log files may helpful here, but this is not a FATES case, and from what I recall the case seemed to fail with negative CH4 concentrations with DEBUG=TRUE. This was NOT on new years, however, but Dec 22...

/glade/scratch/wwieder/archive/ctsm51d074_2deg_GSWP3V1_hist_cnSlope0/logs/cesm.log.3893142.chadmin1.ib0.cheyenne.ucar.edu.220422-205201

@glemieux
Copy link
Collaborator Author

glemieux commented Apr 29, 2022

@billsacks I just tried out your suggestion at Erik's encouragement on my machine with a simple 1x1 brazil ctsm-fates one year case using the nuopc driver and it ran to completion.

@ckoven
Copy link
Contributor

ckoven commented Apr 29, 2022

@ekluzek I am not familiar with the CH4 pH sensitivity code, I think that it came in via Lei Meng's paper https://bg.copernicus.org/articles/9/2793/2012/ ?

@dlawrenncar
Copy link
Contributor

dlawrenncar commented Apr 29, 2022 via email

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 30, 2022

@dlawrenncar I think it's fine to leave it in, especially because it will likely be useful with ctsm5.2. But, we should also use this example as something that illustrates the cost of non-functioning complexity. usephfact makes the code more complex and harder to read and understand, but until soil PH is in, won't be functional. But, in this case it's mere existence in the code caused an issue. We don't always have an example of the cost of maintaining complexity, but we can use this as an example of that. Maintaining complexity that's required is the cost of doing business, but maintaining complexity of something that's not functional and may only be used in the "future" isn't always good to do. I think sometimes we side too much on the side of leaving in complexity that might be useful, rather than making the code easier to maintain. An agile software development adage is to only implement what you need right now -- only add the stuff that you need in the future when it's actually needed. I just want to leave this as food for thought. This kind of decision on "should we keep this" is going to be constantly coming up in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bfb bit-for-bit bug something is working incorrectly
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants