Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balance Check failure in fire runs #378

Closed
jkshuman opened this issue May 5, 2018 · 47 comments
Closed

Balance Check failure in fire runs #378

jkshuman opened this issue May 5, 2018 · 47 comments

Comments

@jkshuman
Copy link
Contributor

jkshuman commented May 5, 2018

Getting a fail in fire runs. Seems to be due to a Balance Check. This happens in both CLM45 runs and CLM5 runs at year 5 with 2PFTs (Trop tree and Grass). Non-fire runs haven't failed through year 10, but will resubmit longer.
ctsm git hash: 2dba074 fates git hash: f8d7693
Here is the create case statement:
./create_newcase --case ${casedir}${CASE_NAME} --res f09_f09 --compset 2000_DATM%GSWP3v1_CLM45%FATES_SICE_SOCN_RTM_SGLC_SWAV --run-unsupported

from within cesm.log (and end of cesm.log below)
396: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
396: nstep = 96934
396: errsol = -1.031027636599902E-007
529: Large Dir Radn consvn error 87346.4733653322 1 2
529: diags 46218.1932574409 -0.338494232152740 589450.614042712
529: -394259.718697869
529: lai_change 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: 6.38062653664038 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000
529: elai 0.000000000000000E+000 0.000000000000000E+000 0.961064260932761
529: 0.000000000000000E+000 0.000000000000000E+000 0.958469792135196
529: 0.000000000000000E+000 0.000000000000000E+000 0.122722763358372
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: esai 0.000000000000000E+000 0.000000000000000E+000 3.893573906723917E-002
529: 0.000000000000000E+000 0.000000000000000E+000 3.883117669682943E-002
529: 0.000000000000000E+000 0.000000000000000E+000 4.984874625802597E-003
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: ftweight 1.00000000000000 0.000000000000000E+000
529: 0.000000000000000E+000 1.00000000000000 0.000000000000000E+000
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000
529: cp 9.580078716659667E-011 1
529: bc_in(s)%albgr_dir_rb(ib) 0.557730205770928
529: >5% Dif Radn consvn error -2474470293.77894 1 2
529: diags 639144447.809849 -10366553911.8306 6420139512.41898
529: lai_change 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: 6.38062653664038 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000
529: elai 0.000000000000000E+000 0.000000000000000E+000 0.961064260932761
529: 0.000000000000000E+000 0.000000000000000E+000 0.958469792135196
529: 0.000000000000000E+000 0.000000000000000E+000 0.122722763358372
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: esai 0.000000000000000E+000 0.000000000000000E+000 3.893573906723917E-002
529: 0.000000000000000E+000 0.000000000000000E+000 3.883117669682943E-002
529: 0.000000000000000E+000 0.000000000000000E+000 4.984874625802597E-003
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: ftweight 0.000000000000000E+000 0.000000000000000E+000
529: 37.4271707468345 0.000000000000000E+000 0.000000000000000E+000
529: 37.4271707468345 0.000000000000000E+000 0.000000000000000E+000
529: 31.0465442101942 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000
529: cp 9.580078716659667E-011 1
529: bc_in(s)%albgr_dif_rb(ib) 0.557730205770928
529: rhol 0.100000001490116 0.100000001490116 0.100000001490116
529: 0.449999988079071 0.449999988079071 0.349999994039536
529: ftw 1.00000000000000 1.00000000000000 0.000000000000000E+000
529: 0.000000000000000E+000
529: present 1 0 0
529: CAP 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000
465: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
465: nstep = 96935
465: errsol = -1.048202307174506E-007
433: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
433: nstep = 96935
433: errsol = -1.017730255625793E-007
358: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
358: nstep = 96936
358: errsol = -1.278503987123258E-007
432: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
432: nstep = 96936
432: errsol = -1.040576194100140E-007
431: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
431: nstep = 96936
431: errsol = -1.129041606873216E-007
466: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
466: nstep = 96936
466: errsol = -1.248336616299639E-007
433: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
433: nstep = 96936
433: errsol = -1.003071474769968E-007
529: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
529: nstep = 96936
529: errsol = 1.383552742595384E-005
529: clm model is stopping - error is greater than 1e-5 (W/m2)
529: fsa = 12787101170.2958
529: fsr = -12787101148.9356
529: forc_solad(1) = 2.30644280577964
529: forc_solad(2) = 3.71261017842798
529: forc_solai(1) = 8.37364785641270
529: forc_solai(2) = 6.96748048376436
529: forc_tot = 21.3601813243847
529: clm model is stopping
529: calling getglobalwrite with decomp_index= 39670 and clmlevel= pft
529: local patch index = 39670
529: global patch index = 15897
529: global column index = 8008
529: global landunit index = 2104
529: global gridcell index = 494
529: gridcell longitude = 290.000000000000
529: gridcell latitude = -15.5497382198953
529: pft type = 1
529: column type = 1
529: landunit type = 1
529: ENDRUN:
529: ERROR in BalanceCheckMod.F90 at line 543
396: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
396: nstep = 96934
396: errsol = -1.031027636599902E-007
529: Large Dir Radn consvn error 87346.4733653322 1 2
529: diags 46218.1932574409 -0.338494232152740 589450.614042712
529: -394259.718697869
529: lai_change 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: 6.38062653664038 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000
529: elai 0.000000000000000E+000 0.000000000000000E+000 0.961064260932761
529: 0.000000000000000E+000 0.000000000000000E+000 0.958469792135196
529: 0.000000000000000E+000 0.000000000000000E+000 0.122722763358372
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: esai 0.000000000000000E+000 0.000000000000000E+000 3.893573906723917E-002
529: 0.000000000000000E+000 0.000000000000000E+000 3.883117669682943E-002
529: 0.000000000000000E+000 0.000000000000000E+000 4.984874625802597E-003
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: ftweight 1.00000000000000 0.000000000000000E+000
529: 0.000000000000000E+000 1.00000000000000 0.000000000000000E+000
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000
529: cp 9.580078716659667E-011 1
529: bc_in(s)%albgr_dir_rb(ib) 0.557730205770928
529: >5% Dif Radn consvn error -2474470293.77894 1 2
529: diags 639144447.809849 -10366553911.8306 6420139512.41898
529: lai_change 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: 6.38062653664038 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000
529: elai 0.000000000000000E+000 0.000000000000000E+000 0.961064260932761
529: 0.000000000000000E+000 0.000000000000000E+000 0.958469792135196
529: 0.000000000000000E+000 0.000000000000000E+000 0.122722763358372
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: esai 0.000000000000000E+000 0.000000000000000E+000 3.893573906723917E-002
529: 0.000000000000000E+000 0.000000000000000E+000 3.883117669682943E-002
529: 0.000000000000000E+000 0.000000000000000E+000 4.984874625802597E-003
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
529: ftweight 0.000000000000000E+000 0.000000000000000E+000
529: 37.4271707468345 0.000000000000000E+000 0.000000000000000E+000
529: 37.4271707468345 0.000000000000000E+000 0.000000000000000E+000
529: 31.0465442101942 0.000000000000000E+000 0.000000000000000E+000
529: 0.000000000000000E+000
529: cp 9.580078716659667E-011 1
529: bc_in(s)%albgr_dif_rb(ib) 0.557730205770928
529: rhol 0.100000001490116 0.100000001490116 0.100000001490116
529: 0.449999988079071 0.449999988079071 0.349999994039536
529: ftw 1.00000000000000 1.00000000000000 0.000000000000000E+000
529: 0.000000000000000E+000
529: present 1 0 0
529: CAP 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000
465: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
465: nstep = 96935
465: errsol = -1.048202307174506E-007
433: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
433: nstep = 96935
433: errsol = -1.017730255625793E-007
358: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
358: nstep = 96936
358: errsol = -1.278503987123258E-007
432: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
432: nstep = 96936
432: errsol = -1.040576194100140E-007
431: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
431: nstep = 96936
431: errsol = -1.129041606873216E-007
466: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
466: nstep = 96936
466: errsol = -1.248336616299639E-007
433: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
433: nstep = 96936
433: errsol = -1.003071474769968E-007
529: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
529: nstep = 96936
529: errsol = 1.383552742595384E-005
529: clm model is stopping - error is greater than 1e-5 (W/m2)
529: fsa = 12787101170.2958
529: fsr = -12787101148.9356
529: forc_solad(1) = 2.30644280577964
529: forc_solad(2) = 3.71261017842798
529: forc_solai(1) = 8.37364785641270
529: forc_solai(2) = 6.96748048376436
529: forc_tot = 21.3601813243847
529: clm model is stopping
529: calling getglobalwrite with decomp_index= 39670 and clmlevel= pft
529: local patch index = 39670
529: global patch index = 15897
529: global column index = 8008
529: global landunit index = 2104
529: global gridcell index = 494
529: gridcell longitude = 290.000000000000
529: gridcell latitude = -15.5497382198953
529: pft type = 1
529: column type = 1
529: landunit type = 1
529: ENDRUN:
529: ERROR in BalanceCheckMod.F90 at line 543
529:
529:
529:
529:
529:
529: ERROR: Unknown error submitted to shr_abort_abort.
413: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
413: nstep = 96936
413: errsol = -1.288894111439731E-007
397: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
397: nstep = 96937
397: errsol = -1.022812625706138E-007
319: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
319: nstep = 96937
319: errsol = -1.036731305248395E-007
395: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
395: nstep = 96937
395: errsol = -1.211479911944480E-007
432: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
432: nstep = 96937
432: errsol = -1.264885440832586E-007
464: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
464: nstep = 96937
464: errsol = -1.101450379792368E-007
431: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
431: nstep = 96937
431: errsol = -1.387476800118748E-007
433: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
433: nstep = 96937
433: errsol = -1.261905708815902E-007
529:Image PC Routine Line Source
529:cesm.exe 0000000001237DAD Unknown Unknown Unknown
529:cesm.exe 0000000000D1B432 shr_abort_mod_mp_ 114 shr_abort_mod.F90
529:cesm.exe 0000000000503CD5 abortutils_mp_end 77 abortutils.F90
529:cesm.exe 0000000000677E2D balancecheckmod_m 543 BalanceCheckMod.F90
529:cesm.exe 000000000050AF77 clm_driver_mp_clm 924 clm_driver.F90
529:cesm.exe 00000000004F9516 lnd_comp_mct_mp_l 451 lnd_comp_mct.F90
529:cesm.exe 0000000000430E14 component_mod_mp_ 688 component_mod.F90
529:cesm.exe 0000000000417D59 cime_comp_mod_mp_ 2652 cime_comp_mod.F90
529:cesm.exe 0000000000430B3D MAIN__ 68 cime_driver.F90
529:cesm.exe 0000000000415C5E Unknown Unknown Unknown
529:libc-2.19.so 00002AAAB190AB25 libc_start_main Unknown Unknown
529:cesm.exe 0000000000415B69 Unknown Unknown Unknown
529:MPT ERROR: Rank 529(g:529) is aborting with error code 1001.
529: Process ID: 53637, Host: r12i2n18, Program: /glade2/scratch2/jkshuman/Fire0504_Obrienh_Saldaa_Saldal_agb1zero_2PFT_1x1_2dba074_f8d7693/bld/cesm.exe
529: MPT Version: SGI MPT 2.15 12/18/16 02:58:06
529:
529:MPT: --------stack traceback-------
0: memory_write: model date = 60715 0 memory = 65749.16 MB (highwater) 102.04 MB (usage) (pe= 0 comps= ATM ESP)
529:MPT: Attaching to program: /proc/53637/exe, process 53637
529:MPT: done.
529:MPT: Try: zypper install -C "debuginfo(build-id)=3d290be00d48b823d3b71df2249e80d881bc473d"
529:MPT: (no debugging symbols found)...done.
529:MPT: Try: zypper install -C "debuginfo(build-id)=5409c48fdb15e90649c1407e444fbe31d6dc8ec1"
529:MPT: (no debugging symbols found)...done.
529:MPT: [Thread debugging using libthread_db enabled]
529:MPT: Using host libthread_db library "/glade/u/apps/ch/os/lib64/libthread_db.so.1".
529:MPT: Try: zypper install -C "debuginfo(build-id)=e97cfdb062d6f0c41073f2109a7605d0ae991c03"
529:MPT: (no debugging symbols found)...done.
529:MPT: Try: zypper install -C "debuginfo(build-id)=f43d7754940a14ffe3d9bd8fc9472ffbbfead544"
529:MPT: (no debugging symbols found)...done.
529:MPT: Try: zypper install -C "debuginfo(build-id)=0ea764119690f32c98faae9a63a73f35ed8b1099"
529:MPT: (no debugging symbols found)...done.
529:MPT: Try: zypper install -C "debuginfo(build-id)=15916519d9dbaea26ec88427460b4cedb9c0a6ab"
529:MPT: (no debugging symbols found)...done.
529:MPT: Try: zypper install -C "debuginfo(build-id)=79264652a62453da222372a430cd9351d4bbcbde"
529:MPT: (no debugging symbols found)...done.
529:MPT: Try: zypper install -C "debuginfo(build-id)=68682e9ac223d269cbecb94315fcec5e16b32bfb"
529:MPT: (no debugging symbols found)...done.
529:MPT: 0x00002aaaafac141c in waitpid () from /glade/u/apps/ch/os/lib64/libpthread.so.0
529:MPT: Missing separate debuginfos, use: zypper install glibc-debuginfo-2.19-35.1.x86_64
529:MPT: (gdb) #0 0x00002aaaafac141c in waitpid ()
529:MPT: from /glade/u/apps/ch/os/lib64/libpthread.so.0
529:MPT: #1 0x00002aaab16215d6 in mpi_sgi_system (
529:MPT: #2 MPI_SGI_stacktraceback (
529:MPT: header=header@entry=0x7ffffffeeb70 "MPT ERROR: Rank 529(g:529) is aborting with error code 1001.\n\tProcess ID: 53637, Host: r12i2n18, Program: /glade2/scratch2/jkshuman/Fire0504_Obrienh_Saldaa_Saldal_agb1zero_2PFT_1x1_2dba074_f8d7693/bld"...) at sig.c:339
529:MPT: #3 0x00002aaab1574d6f in print_traceback (ecode=ecode@entry=1001)
529:MPT: at abort.c:227
529:MPT: #4 0x00002aaab1574fda in PMPI_Abort (comm=, errorcode=1001)
529:MPT: at abort.c:66
529:MPT: #5 0x00002aaab157528d in pmpi_abort
()
529:MPT: from /opt/sgi/mpt/mpt-2.15/lib/libmpi.so
529:MPT: #6 0x0000000000e191a9 in shr_mpi_mod_mp_shr_mpi_abort_ ()
529:MPT: at /glade/p/work/jkshuman/git/ctsm/cime/src/share/util/shr_mpi_mod.F90:2132
529:MPT: #7 0x0000000000d1b4d8 in shr_abort_mod_mp_shr_abort_abort_ ()
529:MPT: at /glade/p/work/jkshuman/git/ctsm/cime/src/share/util/shr_abort_mod.F90:69
529:MPT: #8 0x0000000000503cd5 in abortutils_mp_endrun_globalindex_ ()
529:MPT: at /glade/p/work/jkshuman/git/ctsm/src/main/abortutils.F90:77
529:MPT: #9 0x0000000000677e2d in balancecheckmod_mp_balancecheck_ ()
529:MPT: at /glade/p/work/jkshuman/git/ctsm/src/biogeophys/BalanceCheckMod.F90:543
529:MPT: #10 0x000000000050af77 in clm_driver_mp_clm_drv_ ()
529:MPT: at /glade/p/work/jkshuman/git/ctsm/src/main/clm_driver.F90:924
529:MPT: #11 0x00000000004f9516 in lnd_comp_mct_mp_lnd_run_mct_ ()
529:MPT: at /glade/p/work/jkshuman/git/ctsm/src/cpl/lnd_comp_mct.F90:451
529:MPT: #12 0x0000000000430e14 in component_mod_mp_component_run_ ()
529:MPT: at /glade/p/work/jkshuman/git/ctsm/cime/src/drivers/mct/main/component_mod.F90:688
529:MPT: #13 0x0000000000417d59 in cime_comp_mod_mp_cime_run_ ()
529:MPT: at /glade/p/work/jkshuman/git/ctsm/cime/src/drivers/mct/main/cime_comp_mod.F90:2652
529:MPT: #14 0x0000000000430b3d in MAIN__ ()
529:MPT: at /glade/p/work/jkshuman/git/ctsm/cime/src/drivers/mct/main/cime_driver.F90:68
529:MPT: #15 0x0000000000415c5e in main ()
529:MPT: (gdb) A debugging session is active.
529:MPT:
529:MPT: Inferior 1 [process 53637] will be detached.
529:MPT:
529:MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
529:MPT: Detaching from program: /proc/53637/exe, process 53637
529:
529:MPT: -----stack traceback ends-----
-1:MPT ERROR: MPI_COMM_WORLD rank 529 has terminated without calling MPI_Finalize()
-1: aborting job

@jkshuman
Copy link
Contributor Author

jkshuman commented May 6, 2018

@ekluzek @rosiealice @rgknox @ckoven
I am getting a balance check error in the fire runs. This is using the latest fates version which incorporates the memory leak fix, and was merged with an added history variable from my branch. The error that is writing out is within the CLM BalanceCheckMod.f90. The system is down, and I can't get more information at the moment. When I was looking at it last night, I submitted the run with a switch from nyears to nmonths. As I was watching the file list in the case/run folder the cesm.log would pop up and then disappear. I was not able to see if it finally appeared last night. I haven't seen that behavior before (inability to write the cesm.log.). I did cancel the run and restart, and it was the same behavior where the cesm log would appear and disappear. I will try resubmitting with stop_option set to ndays - maybe it isn't completing the month? Any advice/help would be appreciated on what to look for, etc.

Erik - does this look at all similar to the balance check error we saw in the past?

@rgknox
Copy link
Contributor

rgknox commented May 7, 2018

Some things I'm noticing:
The radiation solution errors are quite large, so if they are that large, I would not be surprised if they will generate a NaN, or cause anarchy anywhere in the code down-stream.
These errors appear to be triggered over and over again in the same patch. The patch area is e-11 in size, which seems like maybe it should be culled?
In the arrays that are printed out, lai_change, elai, ftweight, etc. I'm surprised that there are some lai_change values (which is change in light level, per change in lai, maybe..) where I see no tai. But its hard to tell why this is so.
I'm wondering if perhaps the "ftweight" variable is being filled incorrectly, and maybe because there is something special about the grasses. I can't really tell exactly what is happening though, also the diagnostic that writes this stuff uses canopy layer 1 for ftweight, but ncl_p for the others...

Do these runs have grasses with some structural biomass, or are they 0 structure/sap?

@jkshuman
Copy link
Contributor Author

jkshuman commented May 7, 2018 via email

@jkshuman
Copy link
Contributor Author

jkshuman commented May 7, 2018

Run which uses allom_latosa_int = default and allom_agb1=0.0001 for grass also fails in year 5 with fire. (This is a bad case name as it uses default allometry. will fix that...) /glade2/scratch2/jkshuman/Fire0507_Obrienh_Saldaa_Saldal_latosa_int_default_2PFT_1x1_2dba074_f8d7693/run
Similar failure message in year 5. In cesm log there is a set of "NetCDF: invalid dimension ID or name statements" followed by patch trimming followed by Solar radiation balance check errors, more patch trimming, more radiation balance check errors. Then again identifying CLM BalanceCheckMod line 543.

WARNING:: BalanceCheck, solar radiation balance error (W/m2)
334: nstep = 96938
334: errsol = -1.311063329012541E-007
330: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
330: nstep = 96938
330: errsol = -1.427682150278997E-007
529:Image PC Routine Line Source
529:cesm.exe 0000000001237DAD Unknown Unknown Unknown
529:cesm.exe 0000000000D1B432 shr_abort_mod_mp_ 114 shr_abort_mod.F90
529:cesm.exe 0000000000503CD5 abortutils_mp_end 77 abortutils.F90
529:cesm.exe 0000000000677E2D balancecheckmod_m 543 BalanceCheckMod.F90
529:cesm.exe 000000000050AF77 clm_driver_mp_clm 924 clm_driver.F90
529:cesm.exe 00000000004F9516 lnd_comp_mct_mp_l 451 lnd_comp_mct.F90
529:cesm.exe 0000000000430E14 component_mod_mp_ 688 component_mod.F90
529:cesm.exe 0000000000417D59 cime_comp_mod_mp_ 2652 cime_comp_mod.F90
529:cesm.exe 0000000000430B3D MAIN__ 68 cime_driver.F90
529:cesm.exe 0000000000415C5E Unknown Unknown Unknown
529:libc-2.19.so 00002AAAB190AB25 __libc_start_main Unknown Unknown
529:cesm.exe 0000000000415B69 Unknown Unknown Unknown

@jkshuman
Copy link
Contributor Author

jkshuman commented May 7, 2018

that is the right case name. Obrien Salda is the default allometry...
too many iterations on this.

@jkshuman
Copy link
Contributor Author

jkshuman commented May 8, 2018

@rgknox @rosiealice I did another set of runs for single and 2PFTs for a regional run in South America. Both fails have the same set of solar radiation balance check errors. I include pieces of the cesm.log for the failed runs.

general case statement:
./create_newcase --case ${casedir}${CASE_NAME} --res f09_f09 --compset 2000_DATM%GSWP3v1_CLM45%FATES_SICE_SOCN_RTM_SGLC_SWAV --run-unsupported

1 PFT (no fire) for Grass and Trop Tree completed to year 21 with reasonable biomass and distribution.
1 PFT (Fire) for Trop Tree completed through year 21.
1 PFT (Fire) for Grass failed at year 11. (cesm.log piece below)

2 PFT (Fire) for Trop Tree and Grass failed at year 5. (cesm.log piece after the fire grass log)

/glade2/scratch2/jkshuman/Fire_Grass_1x1_2dba074_f8d7693/run
Errors:
clmfates_interfaceMod.F90:: reading froz_q10
217: NetCDF: Invalid dimension ID or name
217: NetCDF: Invalid dimension ID or name
217: NetCDF: Invalid dimension ID or name
217: NetCDF: Invalid dimension ID or name
217: NetCDF: Invalid dimension ID or name
217: NetCDF: Variable not found
217: NetCDF: Variable not found
0:(seq_domain_areafactinit) : min/max mdl2drv 1.00000000000000 1.00000000000000 areafact_a_ATM
0:(seq_domain_areafactinit) : min/max drv2mdl 1.00000000000000 1.00000000000000 areafact_a_ATM
102: trimming patch area - is too big 1.818989403545856E-012
109: trimming patch area - is too big 1.818989403545856E-012
467: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
467: nstep = 192742
467: errsol = -1.090609771381423E-007

(and from further within the cesm.log...)
WARNING:: BalanceCheck, solar radiation balance error (W/m2)
202: nstep = 195723
202: errsol = -1.013256678561447E-007
180:Image PC Routine Line Source
180:cesm.exe 0000000001237DAD Unknown Unknown Unknown
180:cesm.exe 0000000000D1B432 shr_abort_mod_mp_ 114 shr_abort_mod.F90
180:cesm.exe 0000000000503D97 abortutils_mp_end 43 abortutils.F90
180:cesm.exe 000000000050329C lnd_import_export 419 lnd_import_export.F90
180:cesm.exe 00000000004F9557 lnd_comp_mct_mp_l 457 lnd_comp_mct.F90
180:cesm.exe 0000000000430E14 component_mod_mp_ 688 component_mod.F90
180:cesm.exe 0000000000417D59 cime_comp_mod_mp_ 2652 cime_comp_mod.F90
180:cesm.exe 0000000000430B3D MAIN__ 68 cime_driver.F90
180:cesm.exe 0000000000415C5E Unknown Unknown Unknown
180:libc-2.19.so 00002AAAB190AB25 __libc_start_main Unknown Unknown
180:cesm.exe 0000000000415B69 Unknown Unknown Unknown
180:MPT ERROR: Rank 180(g:180) is aborting with error code 1001.
180: Process ID: 70276, Host: r2i2n9, Program: /glade2/scratch2/jkshuman/Fire_Grass_1x1_2dba074_f8d7693/bld/cesm.exe
180: MPT Version: SGI MPT 2.15 12/18/16 02:58:06

/glade2/scratch2/jkshuman/Fire0507_Obrienh_Saldaa_Saldal_2PFT_1x1_2dba074_f8d7693/run

WARNING:: BalanceCheck, solar radiation balance error (W/m2)
330: nstep = 96938
330: errsol = -1.427682150278997E-007
529:Image PC Routine Line Source
529:cesm.exe 0000000001237DAD Unknown Unknown Unknown
529:cesm.exe 0000000000D1B432 shr_abort_mod_mp_ 114 shr_abort_mod.F90
529:cesm.exe 0000000000503CD5 abortutils_mp_end 77 abortutils.F90
529:cesm.exe 0000000000677E2D balancecheckmod_m 543 BalanceCheckMod.F90
529:cesm.exe 000000000050AF77 clm_driver_mp_clm 924 clm_driver.F90
529:cesm.exe 00000000004F9516 lnd_comp_mct_mp_l 451 lnd_comp_mct.F90
529:cesm.exe 0000000000430E14 component_mod_mp_ 688 component_mod.F90
529:cesm.exe 0000000000417D59 cime_comp_mod_mp_ 2652 cime_comp_mod.F90
529:cesm.exe 0000000000430B3D MAIN__ 68 cime_driver.F90
529:cesm.exe 0000000000415C5E Unknown Unknown Unknown
529:libc-2.19.so 00002AAAB190AB25 __libc_start_main Unknown Unknown
529:cesm.exe 0000000000415B69 Unknown Unknown Unknown
529:MPT ERROR: Rank 529(g:529) is aborting with error code 1001.
529: Process ID: 47973, Host: r5i4n34, Program: /glade2/scratch2/jkshuman/Fire0507_Obrienh_Saldaa_Saldal_2PFT_1x1_2dba074_f8d7693/bld/cesm.exe
529: MPT Version: SGI MPT 2.15 12/18/16 02:58:06
529:
529:MPT: --------stack traceback-------
0: memory_write: model date = 60715 0 memory = 129228.42 MB (highwater) 102.11 MB (usage) (pe= 0 comps= ATM ESP)
529:MPT: Attaching to program: /proc/47973/exe, process 47973
529:MPT: done.

529: gridcell longitude = 290.000000000000
529: gridcell latitude = -15.5497382198953

@rgknox
Copy link
Contributor

rgknox commented May 9, 2018

@jkshuman , can you provide a link to the branch you are using, I can't find hash f8d7693

@jkshuman
Copy link
Contributor Author

jkshuman commented May 9, 2018

It is a merge between the memory leak commit and my added crown area history field. Here is a link, but this may not have the memory leak commit. I don't recall if I pushed those changes to my link. Cheyenne is still down. so I can't update at the moment.
https://github.com/jkshuman/fates/tree/hio_crownarea_si_pft_sync

@jkshuman
Copy link
Contributor Author

jkshuman commented May 9, 2018

Cheyenne is still down, so putting my link to my crown area history variable branch in this issue as well. The failing runs were on a merge branch created from master branch #372 memory leak fix and my crown area branch (link below).
https://github.com/jkshuman/fates/tree/hio_crownarea_si_pft

@jkshuman
Copy link
Contributor Author

I updated the sync branch with the failing branch code. https://github.com/jkshuman/fates/tree/hio_crownarea_si_pft_sync

@rosiealice
Copy link
Contributor

rosiealice commented May 11, 2018 via email

@jkshuman
Copy link
Contributor Author

Running 1PFT grass, 1PFT trop tree, and 2PFT all with fire on CLM4.5 (paths below)
New set of runs being created with this branch (crown area history merge with 379 canopy photo fix):
https://github.com/jkshuman/fates/tree/hio_crownarea_si_pft_379canopy_photo_fix

./create_newcase --case ${casedir}${CASE_NAME} --res f09_f09 --compset 2000_DATM%GSWP3v1_CLM45%FATES_SICE_SOCN_RTM_SGLC_SWAV --run-unsupported

/glade2/scratch2/jkshuman/Fire_Grass_1x1_2dba074_5dda57b
/glade2/scratch2/jkshuman/Fire_Obrien_Salda_TropTree_1x1_2dba074_5dda57b
/glade2/scratch2/jkshuman/Fire_Obrienh_Saldaa_Saldal_2PFT_1x1_2dba074_5dda57b

@jkshuman
Copy link
Contributor Author

jkshuman commented May 11, 2018 via email

@rgknox
Copy link
Contributor

rgknox commented May 11, 2018

looks like my single site run at:

gridcell longitude = 290.000000000000
gridcell latitude = -15.5497382198953

did not generate the error after 30 years.

I will try to look through and see if I added some configuration that was different.

Run directory:

/glade2/scratch2/rgknox/jkstest-1pt-v0/run

Uses this parameter file:

/glade/u/home/rgknox/param_file_2PFT_Obrienh_Saldaa_Saldal_05042018.nc

@jkshuman
Copy link
Contributor Author

this was with fire for clm45?

@rgknox
Copy link
Contributor

rgknox commented May 11, 2018

I noticed this in the parameter file:

fates_leaf_xl = 0.1, 0.1, -0.3

This may be fine, it just caught my eye. xl is orientation index, which I think I recall allowing negatives. But we should double check if our formulation does.

@rgknox
Copy link
Contributor

rgknox commented May 11, 2018

yeah, that parameter seems fine, false alarm

@jkshuman
Copy link
Contributor Author

jkshuman commented May 11, 2018 via email

@rgknox
Copy link
Contributor

rgknox commented May 11, 2018

ok, thanks. New single site run on cheyenne is going, now using spit-fire.

My current guess as to what is happening is that we are running into a problem with nigh-zero biomass or leaves, which is the product of fire turning over an all grass patch? Its possible the recent bug fix addressed this, but we will see.

@jkshuman
Copy link
Contributor Author

@rgknox another set of runs going with pull request 382. 1 PFT runs with fire are still going (tree at year 21, grass at year 2 - slow in queue?). 2PFT run (trop tree and grass) failed in year 6. Similar set of errors. BalanceCheckMod.f90 line 543, BalanceCheck, solar radiation balance error.
/glade/scratch/jkshuman/archive/Fire_Obrienh_Saldaa_Saldal_2PFT_SA1x1_2dba074_0f0c41c/
New location:
gridcell longitude = 305.000000000000
gridcell latitude = -23.0890052356021

From cesm.log
WARNING:: BalanceCheck, solar radiation balance error (W/m2)
235: nstep = 119564
235: errsol = -1.108547849071329E-007
252: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
252: nstep = 119565
252: errsol = -1.065200194716454E-007
0: memory_write: model date = 71029 0 memory = 128919.57 MB (highwater) 101.85 MB (usage) (pe= 0 comps= ATM ESP)
467: trimming patch area - is too big 1.818989403545856E-012
545: trimming patch area - is too big 1.818989403545856E-012
353: trimming patch area - is too big 1.818989403545856E-012
390: trimming patch area - is too big 1.818989403545856E-012
513: trimming patch area - is too big 1.818989403545856E-012
506: trimming patch area - is too big 1.818989403545856E-012
535: trimming patch area - is too big 1.818989403545856E-012
446: trimming patch area - is too big 1.818989403545856E-012
469: trimming patch area - is too big 1.818989403545856E-012
477: trimming patch area - is too big 1.818989403545856E-012
326: trimming patch area - is too big 1.818989403545856E-012
403: trimming patch area - is too big 1.818989403545856E-012
69: trimming patch area - is too big 1.818989403545856E-012
239: trimming patch area - is too big 1.818989403545856E-012
70: trimming patch area - is too big 1.818989403545856E-012
218: trimming patch area - is too big 1.818989403545856E-012
257: trimming patch area - is too big 1.818989403545856E-012
75: trimming patch area - is too big 1.818989403545856E-012
330: trimming patch area - is too big 1.818989403545856E-012
170: trimming patch area - is too big 1.818989403545856E-012
200: trimming patch area - is too big 1.818989403545856E-012
198: trimming patch area - is too big 1.818989403545856E-012
255: trimming patch area - is too big 1.818989403545856E-012
80: trimming patch area - is too big 1.818989403545856E-012
219: trimming patch area - is too big 1.818989403545856E-012
118: trimming patch area - is too big 1.818989403545856E-012
119: trimming patch area - is too big 1.818989403545856E-012
202: >5% Dif Radn consvn error -1.05825538715178 1 2
202: diags 7.96359955072742 -54.6696896639910 38.3301532002546
202: lai_change 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000
202: elai 0.796415587611356 0.000000000000000E+000 0.961509001506293
202: 0.000000000000000E+000 0.000000000000000E+000 0.961509001506293
202: 0.000000000000000E+000 0.000000000000000E+000 0.234465085324267
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: esai 9.096157657329497E-002 0.000000000000000E+000 3.849099849370675E-002
202: 0.000000000000000E+000 0.000000000000000E+000 3.849099849370675E-002
202: 0.000000000000000E+000 0.000000000000000E+000 9.398288976575598E-003
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: ftweight 1.267302001703947E-002 0.000000000000000E+000
202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000
202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000
202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000
202: cp 6.405767903805394E-010 1
202: bc_in(s)%albgr_dif_rb(ib) 0.190858817093915
202: rhol 0.100000001490116 0.100000001490116 0.100000001490116
202: 0.449999988079071 0.449999988079071 0.349999994039536
202: ftw 1.00000000000000 1.00000000000000 0.000000000000000E+000
202: 0.000000000000000E+000
202: present 1 0 0
202: CAP 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000
331: Large Dir Radn consvn error 87300236774.1395 1 2
331: diags 35545013833.8197 -1.718567028306606E-002 -793747809365.306
331: 496278040697.993
331: lai_change 0.000000000000000E+000 0.000000000000000E+000
331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
331: 0.000000000000000E+000
331: elai 0.776682425289442 0.000000000000000E+000 0.961569569355599
331: 0.000000000000000E+000 0.000000000000000E+000 0.961569569355599
331: 0.000000000000000E+000 0.000000000000000E+000 0.227539226615268
331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
331: esai 9.093202219977818E-002 0.000000000000000E+000 3.843043064440077E-002
331: 0.000000000000000E+000 0.000000000000000E+000 3.843043064440077E-002
331: 0.000000000000000E+000 0.000000000000000E+000 9.101385150350671E-003
331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
331: ftweight 0.143517787345916 0.000000000000000E+000
331: 0.856482212654084 0.000000000000000E+000 0.000000000000000E+000
331: 0.856482212654084 0.000000000000000E+000 0.000000000000000E+000
331: 0.856482212654084 0.000000000000000E+000 0.000000000000000E+000
331: 0.000000000000000E+000
331: cp 2.006325586387992E-009 1
331: bc_in(s)%albgr_dir_rb(ib) 0.220000000000000
331: dif ground absorption error 1 1 -2.968510966153521E+017
331: -2.968510966153521E+017 2 2 1.00000000000000
331: >5% Dif Radn consvn error 4.270016056591235E+016 1 2
331: diags 1.669646990961853E+016 -3.805783289940412E+017 2.374544661398212E+017
331: lai_change 0.000000000000000E+000 0.000000000000000E+000
331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
331: 0.000000000000000E+000
331: elai 0.776682425289442 0.000000000000000E+000 0.961569569355599
331: 0.000000000000000E+000 0.000000000000000E+000 0.961569569355599
331: 0.000000000000000E+000 0.000000000000000E+000 0.227539226615268
331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
331: esai 9.093202219977818E-002 0.000000000000000E+000 3.843043064440077E-002
331: 0.000000000000000E+000 0.000000000000000E+000 3.843043064440077E-002
331: 0.000000000000000E+000 0.000000000000000E+000 9.101385150350671E-003
331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
331: ftweight 7.801052745940848E-002 0.000000000000000E+000
331: 143.470563918829 0.000000000000000E+000 0.000000000000000E+000
331: 143.470563918829 0.000000000000000E+000 0.000000000000000E+000
331: 143.470563918829 0.000000000000000E+000 0.000000000000000E+000
331: 0.000000000000000E+000
331: cp 2.006325586387992E-009 1
331: bc_in(s)%albgr_dif_rb(ib) 0.220000000000000
331: rhol 0.100000001490116 0.100000001490116 0.100000001490116
331: 0.449999988079071 0.449999988079071 0.349999994039536
331: ftw 1.00000000000000 0.143517787345916 0.000000000000000E+000
331: 0.856482212654084
331: present 1 0 1
331: CAP 0.143517787345916 0.000000000000000E+000 0.856482212654084
331: there is still error after correction 1.00000000000000 1
331: 2
202: >5% Dif Radn consvn error -1.07307654594231 1 2
202: diags 8.03407121904317 -55.1147964199711 38.6409503555679
202: lai_change 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000
202: elai 0.796415587611356 0.000000000000000E+000 0.961509001506293
202: 0.000000000000000E+000 0.000000000000000E+000 0.961509001506293
202: 0.000000000000000E+000 0.000000000000000E+000 0.234465085324267
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: esai 9.096157657329497E-002 0.000000000000000E+000 3.849099849370675E-002
202: 0.000000000000000E+000 0.000000000000000E+000 3.849099849370675E-002
202: 0.000000000000000E+000 0.000000000000000E+000 9.398288976575598E-003
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: ftweight 1.267302001703947E-002 0.000000000000000E+000
202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000
202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000
202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000
202: cp 6.405767903805394E-010 1
202: bc_in(s)%albgr_dif_rb(ib) 0.190744628923151
202: rhol 0.100000001490116 0.100000001490116 0.100000001490116
202: 0.449999988079071 0.449999988079071 0.349999994039536
202: ftw 1.00000000000000 1.00000000000000 0.000000000000000E+000
202: 0.000000000000000E+000
202: present 1 0 0
202: CAP 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000
331: energy balance in canopy 26844 , err= -11.9593662381158
331: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
331: nstep = 119588
331: errsol = -1323.30638249407
331: clm model is stopping - error is greater than 1e-5 (W/m2)
331: fsa = -7.745702732785249E+017
331: fsr = 7.745702732785236E+017
331: forc_solad(1) = 5.51145480639649
331: forc_solad(2) = 8.61256572561393
331: forc_solai(1) = 16.1417364406403
331: forc_solai(2) = 13.0406255214228
331: forc_tot = 43.3063824940735
331: clm model is stopping
331: calling getglobalwrite with decomp_index= 26844 and clmlevel= pft
331: local patch index = 26844
331: global patch index = 9516
331: global column index = 4795
331: global landunit index = 1267
331: global gridcell index = 296
331: gridcell longitude = 305.000000000000
331: gridcell latitude = -23.0890052356021
331: pft type = 1
331: column type = 1
331: landunit type = 1
331: ENDRUN:
331: ERROR in BalanceCheckMod.F90 at line 543
331:
331:

@rosiealice
Copy link
Contributor

rosiealice commented May 15, 2018 via email

@rgknox
Copy link
Contributor

rgknox commented May 15, 2018

agreed @rosiealice , whatever is wrong, seems to be mediated by ftweight

@rgknox
Copy link
Contributor

rgknox commented May 15, 2018

I will try to reproduce errors in that last post.

@jkshuman , could you post your create_case execution and any environment modifiers?

relevant parameters:

fates_paramfile = '/glade/p/work/jkshuman/FATES_data/parameter_files/param_file_2PFT_Obrienh_Saldaa_Saldal_05072018.nc'
 use_fates = .true.
 use_fates_ed_prescribed_phys = .false.
 use_fates_ed_st3 = .false.
 use_fates_inventory_init = .false.
 use_fates_logging = .false.
 use_fates_planthydro = .false.
 use_fates_spitfire = .true.
fsurdat = '/glade/scratch/jkshuman/sfcdata/surfdata_0.9x1.25_16pfts_Irrig_CMIP6_simyr2000_SA.nc'

@jkshuman
Copy link
Contributor Author

ok. I have it down to days. it seems to be hung up, but I will restart from this case in debug mode and take a close look at ftweight. Going to use the 2PFT case as the 1 PFT trop tree run made it out to 51 years with fire. seems a grass and fire issue. But may try the grass single PFT as well...
/glade2/scratch2/jkshuman/archive/Fire_Obrienh_Saldaa_Saldal_2PFT_SA1x1_2dba074_0f0c41c/

/glade2/scratch2/jkshuman/archive/Fire_Grass_SA_1x1_2dba074_0f0c41c/

@jkshuman
Copy link
Contributor Author

path to restart files for 2PFT case:
/glade/scratch/jkshuman/archive/Fire_Obrienh_Saldaa_Saldal_2PFT_SA1x1_2dba074_0f0c41c/rest

path to my script for creating the case, and relevant params below:
/glade/p/work/jkshuman/FATES_data/case_fire_TreeGrass_tropics

./create_newcase --case ${casedir}${CASE_NAME} --res f09_f09 --compset 2000_DATM%GSWP3v1_CLM45%FATES_SICE_SOCN_RTM_SGLC_SWAV --run-unsupp
orted
./xmlchange STOP_OPTION=ndays
./xmlchange STOP_N=1
./xmlchange REST_OPTION=ndays
./xmlchange RESUBMIT=50

./xmlchange JOB_WALLCLOCK_TIME=1:00

./xmlchange DATM_MODE=CLMGSWP3v1
./xmlchange DATM_CLMNCEP_YR_ALIGN=1985
./xmlchange DATM_CLMNCEP_YR_START=1985
./xmlchange DATM_CLMNCEP_YR_END=2004

./xmlchange RTM_MODE=NULL
./xmlchange ATM_DOMAIN_FILE=domain.lnd.fv0.9x1.25_gx1v6.SA.nc
./xmlchange ATM_DOMAIN_PATH=/glade/scratch/jkshuman/sfcdata
./xmlchange LND_DOMAIN_FILE=domain.lnd.fv0.9x1.25_gx1v6.SA.nc
./xmlchange LND_DOMAIN_PATH=/glade/scratch/jkshuman/sfcdata
./xmlchange CLM_USRDAT_NAME=SAmerica

./xmlchange NTASKS_ATM=-1
./xmlchange NTASKS_CPL=-15
./xmlchange NTASKS_GLC=-15
./xmlchange NTASKS_OCN=-15
./xmlchange NTASKS_WAV=-15
./xmlchange NTASKS_ICE=-15
./xmlchange NTASKS_LND=-15
./xmlchange NTASKS_ROF=-15
./xmlchange NTASKS_ESP=-15

@jkshuman
Copy link
Contributor Author

relevant parameters in user_nl_clm are as you have them listed. above.

@rosiealice
Copy link
Contributor

I think we need to look at why ftweight is >1. ftweight is the same as canopy_area_profile, which is set on:

currentPatch%canopy_area_profile(cl,ft,iv) = currentPatch%canopy_area_profile(cl,ft,iv) + &

I'd put a write statement there to catch anything going over 1... (or a slightly bigger number, so we don't get all these 10^-12 edge cases), and then print out the c_area, total_canopy_area, etc. if that happens. If you've got the runs down to days it shouldn't take long to find the culprit there. I'd be quite surprised if the ftweight wasn't the culprit here.

@rgknox
Copy link
Contributor

rgknox commented May 15, 2018

So I was able to trigger an error using just cell -20.09N 305E, and your 2PFT case. The fail happens on April 17th of the 7th year.

FATES Dynamics:    7-04-17
0:forrtl: error (73): floating divide by zero
0:Image              PC                Routine            Line        Source             
0:cesm.exe           0000000003E1CF91  Unknown               Unknown  Unknown
0:cesm.exe           0000000003E1B0CB  Unknown               Unknown  Unknown
0:cesm.exe           0000000003DCCBC4  Unknown               Unknown  Unknown
0:cesm.exe           0000000003DCC9D6  Unknown               Unknown  Unknown
0:cesm.exe           0000000003D4C4B9  Unknown               Unknown  Unknown
0:cesm.exe           0000000003D58AE9  Unknown               Unknown  Unknown
0:libpthread-2.19.s  00002AAAAFAC1870  Unknown               Unknown  Unknown
0:cesm.exe           0000000002B8581B  dynpatchstateupda         189  dynPatchStateUpdaterMod.F90
0:cesm.exe           0000000000A1CCCC  dynsubgriddriverm         284  dynSubgridDriverMod.F90
0:cesm.exe           000000000087E555  clm_driver_mp_clm         306  clm_driver.F90
0:cesm.exe           000000000084B5B9  lnd_comp_mct_mp_l         451  lnd_comp_mct.F90
0:cesm.exe           000000000046BD2D  component_mod_mp_         688  component_mod.F90
0:cesm.exe           000000000043C474  cime_comp_mod_mp_        2652  cime_comp_mod.F90
0:cesm.exe           00000000004543B7  MAIN__                     68  cime_driver.F90
0:cesm.exe           0000000000415A5E  Unknown               Unknown  Unknown
0:libc-2.19.so       00002AAAB190AB25  __libc_start_main     Unknown  Unknown
0:cesm.exe           0000000000415969  Unknown               Unknown  Unknown
-1:MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
-1:	aborting job
MPT: Received signal 6

@jkshuman
Copy link
Contributor Author

jkshuman commented May 15, 2018 via email

@jkshuman
Copy link
Contributor Author

Got it to day of failure (October 30 year 7). Will kick it off in debug to see if I get the same error as you did @rgknox (similar error as previous, and same location: long = 305 lat = -23.089
from cesm.log
bc_in(s)%albgr_dif_rb(ib) 0.220000000000000
331: rhol 0.100000001490116 0.100000001490116 0.100000001490116
331: 0.449999988079071 0.449999988079071 0.349999994039536
331: ftw 1.00000000000000 0.143517787251814 0.000000000000000E+000
331: 0.856482212748186
331: present 1 0 1
331: CAP 0.143517787251814 0.000000000000000E+000 0.856482212748186
331: there is still error after correction 1.00000000000000 1
331: 2
202: >5% Dif Radn consvn error -1.07341422635010 1 2
202: diags 8.03574910457470 -55.1258110560189 38.6485853190346
202: lai_change 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000
202: elai 0.796415126488024 0.000000000000000E+000 0.961509014797645
202: 0.000000000000000E+000 0.000000000000000E+000 0.961509014797645
202: 0.000000000000000E+000 0.000000000000000E+000 0.234466930897031
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: esai 9.096157669642455E-002 0.000000000000000E+000 3.849098520235514E-002
202: 0.000000000000000E+000 0.000000000000000E+000 3.849098520235514E-002
202: 0.000000000000000E+000 0.000000000000000E+000 9.398356961483976E-003
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
202: ftweight 1.267295049486910E-002 0.000000000000000E+000
202: 29.1628509591272 0.000000000000000E+000 0.000000000000000E+000
202: 29.1628509591272 0.000000000000000E+000 0.000000000000000E+000
202: 29.1628509591272 0.000000000000000E+000 0.000000000000000E+000
202: 0.000000000000000E+000
202: cp 6.410821458268472E-010 1
202: bc_in(s)%albgr_dif_rb(ib) 0.190743513017422
202: rhol 0.100000001490116 0.100000001490116 0.100000001490116
202: 0.449999988079071 0.449999988079071 0.349999994039536
202: ftw 1.00000000000000 1.00000000000000 0.000000000000000E+000
202: 0.000000000000000E+000
202: present 1 0 0
202: CAP 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000
331: energy balance in canopy 26844 , err= -11.9601284804630
331: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
331: nstep = 119588
331: errsol = 724.693617505926
331: clm model is stopping - error is greater than 1e-5 (W/m2)
331: fsa = -7.745702333124070E+017
331: fsr = 7.745702333124078E+017
331: forc_solad(1) = 5.51145480639649
331: forc_solad(2) = 8.61256572561393
331: forc_solai(1) = 16.1417364406403
331: forc_solai(2) = 13.0406255214228
331: forc_tot = 43.3063824940735
331: clm model is stopping
331: calling getglobalwrite with decomp_index= 26844 and clmlevel= pft
331: local patch index = 26844
331: global patch index = 9516
331: global column index = 4795
331: global landunit index = 1267
331: global gridcell index = 296
331: gridcell longitude = 305.000000000000
331: gridcell latitude = -23.0890052356021
331: pft type = 1
331: column type = 1
331: landunit type = 1
331: ENDRUN:
331: ERROR in BalanceCheckMod.F90 at line 543
331:
331:
331:

@rgknox
Copy link
Contributor

rgknox commented May 16, 2018

Here is a print message at the time of fail, this is from subroutine set_new_weights() in dynPatchStateUpdaterMod.F90.

The problem is triggered because from the second-to-last step to the last, that bare-ground patch goes to a weight of zero, and somehow its old (previous) area was negative?

print*,bounds%begp,bounds%endp,p,this%pwtgcell_old(p),this%pwtgcell_new(p)

0:           1          32           3  0.998904682346343     0.998904682346344     
0:           1          32           3  0.998904682346344     0.998904682346344     
0:           1          32           3  0.998904682346344     0.998904682346344     
0:           1          32           1 -2.218013955499719E-016  0.000000000000000E+000
 subroutine set_new_weights(this, bounds)
    !                                                                                                                                                                                        
    ! !DESCRIPTION:                                                                                                                                                                          
    ! Set subgrid weights after dyn subgrid updates                                                                                                                                          
    !                                                                                                                                                                                        
    ! !USES:                                                                                                                                                                                 
    !                                                                                                                                                                                        
    ! !ARGUMENTS:                                                                                                                                                                            
    class(patch_state_updater_type), intent(inout) :: this
    type(bounds_type), intent(in) :: bounds
    !                                                                                                                                                                                        
    ! !LOCAL VARIABLES:                                                                                                                                                                      
    integer :: p

    character(len=*), parameter :: subname = 'set_new_weights'
    !-----------------------------------------------------------------------                                                                                                                 

    do p = bounds%begp, bounds%endp
       this%pwtgcell_new(p) = patch%wtgcell(p)
       this%dwt(p) = this%pwtgcell_new(p) - this%pwtgcell_old(p)
       if (this%dwt(p) > 0._r8) then
          print*,bounds%begp,bounds%endp,p,this%pwtgcell_old(p),this%pwtgcell_new(p)
          this%growing_old_fraction(p) = this%pwtgcell_old(p) / this%pwtgcell_new(p)
          this%growing_new_fraction(p) = this%dwt(p) / this%pwtgcell_new(p)
       else
          ! These values are unused in this case, but set them to something reasonable for                                                                                                   
          ! safety. (We could set them to NaN, but that requires a more expensive                                                                                                            
          ! subroutine call, using the shr_infnan_mod infrastructure.)                                                                                                                       
          this%growing_old_fraction(p) = 1._r8
          this%growing_new_fraction(p) = 0._r8
       end if
    end do

  end subroutine set_new_weights

@rgknox
Copy link
Contributor

rgknox commented May 16, 2018

The interface call wrap_update_hlmfates_dyn(), in clmfates_interfaceMod.F90, is responsible for calculating these weights.

We sum up the canopy fractions, via this output boundary condition:

this%fates(nc)%bc_out(s)%canopy_fraction_pa(1:npatch)

But if this sum is above 1, which it shouldn't be, we will have problems, and calculate a negative bare-patch size. Somehow that is happening in this run. I put a break-point where this endrun used to be:

https://github.com/ESCOMP/ctsm/blob/master/src/utils/clmfates_interfaceMod.F90#L830

@rgknox
Copy link
Contributor

rgknox commented May 16, 2018

I think one bug is that we are not zero'ing out bc_out(s)%canopy_fraction_pa(1:npatch) in the subroutine that is filling it update_hlm_dynamics() . So if we shrink in total number of patches, we have an extra index that is contributing to total patch area. I will test this.

@rgknox
Copy link
Contributor

rgknox commented May 16, 2018

actually, that probably wasn't the problem... although zero'ing would had been better, we should be only passing the used indexes in that array...

@rosiealice
Copy link
Contributor

rosiealice commented May 16, 2018 via email

@jkshuman
Copy link
Contributor Author

I have been focusing on the fire runs. With the updates to master, and continued testing the fail still occurs for grass and for tree/grass runs with fire. I had a tree fire run which completed through year 51 with reasonable biomass. My 2PFT debug fire run is in queue still, so no update there.

With grass the difference is that when it burns, it burns completely. So, this could be a response to the grass flammability specifically and, as @rosiealice said, completely burned patches.

@rgknox
Copy link
Contributor

rgknox commented May 16, 2018

For the problem I'm currently working through (which may or may not be related to what is ultimately killing Jackie's runs), one problem is that total_canopy_area is exceeding patch area. We currently don't force total_canopy_area to be equal to or less than patch area.

I'm also noticing that when we do canopy promotion/demotion, that we have a fairly relaxed tolerance on layer area exeedance of patch area: 1e-4.

I'm wondering if grasses give the canopy demotion/promotion scheme a particularly challenging time at layering? Maybe in this specific case we are left with not-so precise canopy area, which is creating weirdness?

@rgknox
Copy link
Contributor

rgknox commented May 17, 2018

Here is an error log that I think corroborates with the ftweight issue. During leaf_area_profile(), we construct several canopy-layer x pft x leaf-layer arrays. cpatch%canopy_area_profile(cl,ft,iv) is converted directly into ftweight. We have a few checks in the scheme, which can be switched on, one of which fails gracefully, if canopy_area_profile exceeds 1.0 for any given layer.

FATES: A canopy_area_profile exceeded 1.0
 cl:            1
 iv:            1
 sum(cpatch%canopy_area_profile(cl,:,iv)):    1.65653669059244     
 FATES: cohorts in layer cl =            1  0.376936443831203     
  7.401777278905496E-009  2.698777192878076E-008  2.698777192878076E-008
 ED: fracarea           3  0.274264111110705     
 FATES: cohorts in layer cl =            1   4.47710468466018     
  1.069014260600514E-009  2.698777192878076E-008  2.698777192878076E-008
 ED: fracarea           1  3.961106027654241E-002
 FATES: cohorts in layer cl =            1   4.79421520149869     
  5.313109854499176E-010  2.698777192878076E-008  2.698777192878076E-008
 ED: fracarea           1  1.968710076741488E-002
 FATES: cohorts in layer cl =            1   5.13024998876371     
  6.459332537834644E-010  2.698777192878076E-008  2.698777192878076E-008
 ED: fracarea           1  2.393429348254634E-002
 FATES: cohorts in layer cl =            1   5.79933797252383     
  3.505819861862652E-008  2.698777192878076E-008  2.698777192878076E-008
 ED: fracarea           1   1.29904012495523     

In this case, we have a few cohorts contributing crown area to the offending layer, layer 1. Layer 1 is also the top layer, and it should be assumed there is an understory layer also. The cohorts appear to be normal, no nans, no garbage values...
It is a small patch in terms of area, and it has a combination of PFT1 and PFT 3 in that layer.

Note that the area fraction of the last cohort is 130% of the area. I'm not sure why the other cohorts are sharing the top layer (cl==1) with it, if this cohort, which is the largest, is filling that layer completely. This is particularly strange/wrong because we have grasses sharing that layer with a couple of 5 cm cohorts.

I'm wondering if this is a precision problem, as indicated in a post above. The area on this patch is very small, but large enough to keep. Although, the promotion/demotion precision is about 4 orders of magnitude larger than the size of the patch...

@jkshuman
Copy link
Contributor Author

jkshuman commented Jun 6, 2018

New runs using 1) rgknox promotion/demotion updates PR 388, 2) updated API 4.0.0, 3) updated CTSM changes. Two runs: one using clm45 or clm5 with 2PFTs (TropTree and Grass) and active fire.

clm45 completed to year 63 and still running, in queue at the moment. /glade2/scratch2/jkshuman/archive/Fire_rgknox_area_fixes_clm45_2PFT_1x1_692ba82_992e968/lnd/hist

clm5 failed in year 6 with error in EdPatchDynamicsMod.F90 associated with high fire area and patch trimming.
/glade2/scratch2/jkshuman/Fire_rgknox-area-fixes_2PFT_1x1_692ba82_992e968/run

from cesm.log
very high fire areas 0.983208971507476 0.983208971507476
413: Projected Canopy Area of all FATES patches
413: cannot exceed 1.0
517: trimming patch area - is too big 1.818989403545856E-012
570: trimming patch area - is too big 1.818989403545856E-012
533: trimming patch area - is too big 1.818989403545856E-012
110: trimming patch area - is too big 1.818989403545856E-012
110: patch area correction produced negative area 10000.0000000000
110: 1.818989403545856E-012 -4.939832763539551E-013
61: trimming patch area - is too big 1.818989403545856E-012
443: trimming patch area - is too big 1.818989403545856E-012
110: ENDRUN:
110: ERROR in EDPatchDynamicsMod.F90 at line 722
110:
110:
110:
110:
110:
110:
110: ERROR: Unknown error submitted to shr_abort_abort.
431: Projected Canopy Area of all FATES patches
431: cannot exceed 1.0

@rgknox
Copy link
Contributor

rgknox commented Jun 6, 2018

@jkshuman , that new fail is an error check that I put into that branch you are currently testing.

What happened is that the model determined that the total patch area exceeded 10,000 m2, and so it simply removes the excess from one of it's patches. But, we have been removing it from the oldest patch. HOwever, up until now, we have never checked to see if that patch has the area to donate.

This can be solved by removing the area from the largest patch, instead of the oldest patch.

I will make a correction and update the branch.

@rgknox
Copy link
Contributor

rgknox commented Jun 6, 2018

Updated the branch. Here is the change:

e85b681

@jkshuman , I will fire off some tests.

@rgknox
Copy link
Contributor

rgknox commented Jun 6, 2018

hold a moment before testing though, it needs a quick tweak, forgot to declare "nearzero"

@rosiealice
Copy link
Contributor

rosiealice commented Jun 6, 2018 via email

@rgknox
Copy link
Contributor

rgknox commented Jun 6, 2018

@jkshuman @rosiealice and I had a review and discussion of changes in PR #388. Added some updates to code per our discussion. @jkshuman I'm going to pass it through the regression tests now.

@jkshuman
Copy link
Contributor Author

jkshuman commented Jun 7, 2018

Revising this to correct my mistaken runs from earlier. Confirmed that the branch code pulled in the correct changes from rgknox repo.
Updated code with more rgknox-area-fixes (commit 658064e) and ctsm changes. Similar setup CLM45 and clm5 with active fire and 2PFTs (trop tree and grass) for South America region.
CLM5 successfully running into year 18, and still going...
CLM45 successfully running into year 20, and still going...

clm5: /glade/scratch/jkshuman/archive/Fire_rgknox_areafixes_0607_2PFT_1x1_fdce2b2_26542ea/
clm45:/glade/scratch/jkshuman/archive/Fire_rgknox_areafixes_0607_clm45_2PFT_1x1_fdce2b2_26542ea/

@jkshuman
Copy link
Contributor Author

jkshuman commented Jun 8, 2018

Runs are up to year 92 for clm5 and year 98 for clm45. I am going to call this closed, and open a new issue if anything else comes up as the code has diverged since opening this...
To summarize: fixes included pull requests PR382 and PR388 and @rgknox fixes in repo branches for fates and ctsm.
ctsm branch from rgknox_ctsm_repo-protectbaresoilfrac
fates branch from rgknox-area-fix merged with master sci.1.14.0_api.4.0.0

branch details for ctsm and fates below.

fates git log details:
26542ea (HEAD, rgknox-areafix-0607_api4.0.0) Merge branch 'rgknox-area-fixes' into rgknox-areafix-0607_api4.0.0
ce689da (rgknox-area-fixes) Merge branch 'rgknox-area-fixes' of https://github.com/rgknox/fates into rgknox-area-fixes
658064e (rgknox_repo/rgknox-area-fixes) Updated some comments, added back protections on patch canopy areas exceeding 1 during the output boundary condition preparations.
c357399 Merge branch 'rgknox-area-fixes' of github.com:rgknox/fates into rgknox-area-fixes
e85b681 Fixed area checking logic on their sum to 10k
0f2003b Merge remote-tracking branch 'rgknox_repo/rgknox-area-fixes' into rgknox-area-fixes
34bfcdb Resolved conflict in EDCanopyStructureMod, used HEAD over master
5e92e69 (master) Merge remote-tracking branch 'ngeet_repo/master'
14aeb4f (tag: sci.1.14.0_api.4.0.0, ngeet_repo/master) Merge pull request #381 from rgknox/rgknox-soildepth-clm5

ctsm git log details:
fdce2b2 (HEAD, rgknox_ctsm_repo/rgknox-fates-protectbaresoilfrac, rgknox-fates-protectbaresoilfrac, fates_next_api_rgknox_protectbaresoilfrac) Protected fates calculation of bare-soil area to not go below 0
692ba82 (origin/fates_next_api, fates_next_api) Merge pull request #375 from rgknox/rgknox-fates-varsoildepth
1cdd0e6 Merge pull request #390 from ckoven/fateshistdims
8eb90b1 (rgknox_ctsm_repo/rgknox-fates-varsoildepth) Changed a 1.0 r4 to r8
e9b7b68 Updating fates external to sci.1.14.0_api.4.0.0

@jkshuman jkshuman closed this as completed Jun 8, 2018
@rosiealice
Copy link
Contributor

rosiealice commented Jun 8, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants