Hanging case #267

jkshuman · 2017-08-30T19:30:58Z

This is related to Issue #250, but uses the most up to date fates-clm and fates code (version 0115fbc). Behavior is the same as Issue #250 which is that the model runs and then hangs and will not restart. From previous runs related to Issue #250 This happens with fire active, as well as without fire, with 6PFTs and with 1PFT. The time of hang seems to be random.

I am using the recent default parameter file (fates_params_2troppftclones.c170810.nc) without modifications. Again this uses the most up to date fates-clm and fates code (version 0115fbc). Case details: ./create_newcase -case /glade/p/work/jkshuman/FATES_cases/Debug/Debug0825_Fire_clmED_4x5_Default2PFT_GSWP3_BGC -res f45_f45 -compset 2000_DATM%QIA_CLM45%ED_SICE_SOCN_RTM_SGLC_SWAV

./xmlchange STOP_OPTION=nyears
./xmlchange DATM_MODE=CLMGSWP3
./xmlchange DATM_CLMNCEP_YR_ALIGN=1985
./xmlchange DATM_CLMNCEP_YR_START=1985
./xmlchange DATM_CLMNCEP_YR_END=2004
(Cheyenne PE layout from Erik)
./xmlchange NTASKS_ATM=-1
./xmlchange NTASKS_CPL=-15
./xmlchange NTASKS_GLC=-15
./xmlchange NTASKS_OCN=-15
./xmlchange NTASKS_WAV=-15
./xmlchange NTASKS_ICE=-15
./xmlchange NTASKS_LND=-15
./xmlchange NTASKS_ROF=-15
./xmlchange NTASKS_ESP=-15
./xmlchange ROOTPE_ATM=0
./xmlchange ROOTPE_CPL=-1
./xmlchange ROOTPE_GLC=-1
./xmlchange ROOTPE_OCN=-1
./xmlchange ROOTPE_WAV=-1
./xmlchange ROOTPE_ICE=-1
./xmlchange ROOTPE_LND=-1
./xmlchange ROOTPE_ROF=-1
./xmlchange ROOTPE_ESP=-1

user_nl_clm
use_ed=.true.
use_fates_spitfire=.true.

The text was updated successfully, but these errors were encountered:

jkshuman · 2017-08-30T19:34:58Z

@ekluzek @rgknox @ckoven @rosiealice
This case was run over the weekend, and I have it to the day of failure hang. Year 26 Month 09 Day 28.
In Debug mode the run terminates rather than hangs. Within EDCanopyStrucutreMod there are multiple NANs in the canopy demotion section of the code. (sumweights, weight, excl_weight). This is NAN on some processors but not others.

serbinsh · 2017-08-30T19:36:34Z

@jkshuman Do you get a PIO write error? I am still getting those in some runs after some length of time due to NANs, I currently assume related to NEP.....That is I sometimes have a run that suddenly crashes due to NANs being written to the netCDF output

jkshuman · 2017-08-30T19:39:32Z

@serbinsh - Thank you, but I haven't seen a PIO error.

serbinsh · 2017-08-30T19:39:36Z

For example

 clmfates_interfaceMod.F90:: reading fates_low_moisture_Slope
 clmfates_interfaceMod.F90:: reading fates_mid_moisture_Coeff
 clmfates_interfaceMod.F90:: reading fates_mid_moisture_Slope
 clmfates_interfaceMod.F90:: reading fates_alpha_FMC
 clmfates_interfaceMod.F90:: reading fates_max_decomp
 clmfates_interfaceMod.F90:: reading q10_mr
 clmfates_interfaceMod.F90:: reading froz_q10
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Variable not found
 NetCDF: Variable not found
 NetCDF: Numeric conversion not representable
 pio_support::pio_die:: myrank=          -1 : ERROR: pionfwrite_mod::write_nfdarray_double:         250 : NetCDF: Numeric conversion not representable
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

serbinsh · 2017-08-30T19:40:10Z

@jkshuman sorry, I posted that right after you replied....seems like you are having a different issue.

jkshuman · 2017-08-30T19:43:12Z

@serbinsh with termination (in Debug) I get this:

194: end run_mct
-1:MPT ERROR: MPI_COMM_WORLD rank 319 has terminated without calling MPI_Finalize()
-1: aborting job
MPT: Received signal 6

jkshuman · 2017-08-30T19:43:50Z

When it's not in Debug, the run will terminate only due to time out without any errors of note.

jkshuman · 2017-08-30T19:48:33Z

On Tuesday I updated the code to write with a nan for variables in Canopy demotion, and added a temp variable for part of an expression. (dbh_comp_excln = currentCohort%dbhED_val_comp_excln) With rebuild the code is slightly altered and in the interactive debugger stepped further along then the previous version. Also in this version, the temp variable "dbh_comp_excln" seems to hold the value from a previous cohort. There was at least one case where temp variable "dbh_comp_excln" was zero but debugger and values for currentCohort%dbh said it should be a value. Through all of this currentCohort%excl_weight was always a nan. currentCohort%excl_weight = 1.0_r8/(currentCohort%dbhED_val_comp_excln)

at suggestion of @ekluzek I have added an "end run" to the write statements to abort with the nan. Then I will try Charlie's strict ppa by modifying param file to have negative value for ED_val_comp_excln.

serbinsh · 2017-08-30T20:02:00Z

Hmmm....that error message is a bit strange...it does seem like the run hung and then was killed by a scheduler?

jkshuman · 2017-08-30T20:02:02Z

restart file:
/glade/scratch/jkshuman/Debug0825_Fire_clmED_4x5_Default2PFT_GSWP3_BGC/run

ekluzek · 2017-08-30T20:16:01Z

@serbinsh the error Jackie showed is because it dies due to a floating point trap error, and that shuts down one of the MPI tasks without having called a MPI_finalize to shut down the MPI job cleanly. In the cases where it hangs it doesn't report an error, but eventually reaches the queue wall clock limit. It might then report a different error, but it takes however long the wall clock limit was set on the job.

jkshuman · 2017-08-30T20:19:52Z

After adding the "end run" if NAN, I rebuilt and am running in the interactive debugger from the daily restart file. Things progressed along apparently without canopy demotion for quite a while, but things just terminated due to a floating point error. Error text below.

Error from Alinea ddt.:
maxcohorts exceeded 5.500000000000001E-002
maxcohorts exceeded 6.050000000000001E-002
maxcohorts exceeded 6.655000000000001E-002
maxcohorts exceeded 7.320500000000002E-002
maxcohorts exceeded 5.500000000000001E-002
forrtl: error (65): floating invalid
Image PC Routine Line Source
cesm.exe 0000000003F8BDE1 Unknown Unknown Unknown
cesm.exe 0000000003F89F1B Unknown Unknown Unknown
cesm.exe 0000000003F3C884 Unknown Unknown Unknown
cesm.exe 0000000003F3C696 Unknown Unknown Unknown
cesm.exe 0000000003EBC179 Unknown Unknown Unknown
cesm.exe 0000000003EC82CC Unknown Unknown Unknown
libpthread-2.19.s 00002AAAAF8BB870 Unknown Unknown Unknown
cesm.exe 00000000012A6908 edcanopystructure 247 EDCanopyStructureMod.F90
cesm.exe 000000000130F907 edmainmod_mp_ed_u 398 EDMainMod.F90
cesm.exe 00000000008EFB4E clmfatesinterface 646 clmfates_interfaceMod.F90
cesm.exe 00000000008984DC clm_driver_mp_clm 876 clm_driver.F90
cesm.exe 000000000084C569 lnd_comp_mct_mp_l 443 lnd_comp_mct.F90
cesm.exe 0000000000461405 component_mod_mp_ 681 component_mod.F90
cesm.exe 0000000000434EF0 cesm_comp_mod_mp_ 2649 cesm_comp_mod.F90
cesm.exe 0000000000449F13 MAIN__ 67 cesm_driver.F90
cesm.exe 000000000040831E Unknown Unknown Unknown
libc-2.19.so 00002AAAB02C1B25 __libc_start_main Unknown Unknown
cesm.exe 0000000000408229 Unknown Unknown Unknown

rgknox · 2017-08-30T20:32:18Z

Hi @jkshuman

Can you copy and paste the code around line 247 of EDCanopyStructureMod.F90 for your version of the code. Sorry if I lost the thread, but that line is a do loop header on the master branch.

rosiealice · 2017-08-30T20:32:30Z

What's on line 247 in your EDCanopyStructureMod.F90 file?

rosiealice · 2017-08-30T20:32:46Z

Sorry, cross posted w Ryan.

rgknox · 2017-08-30T20:33:08Z

sorry, cross posted w Rosie

rgknox · 2017-08-30T20:33:24Z

ok, that last post was just to be cheeky

ckoven · 2017-08-30T20:33:34Z

i was wondering the same thing

jkshuman · 2017-08-30T20:35:30Z

line 247: weight = currentCohort%excl_weight/sum_weights(i)
Per ddt, termination due to currentCohort%excl_weight=0 in line 247.
line 188 sets this: currentCohort%excl_weight = 1.0_r8/(currentCohort%dbh**ED_val_comp_excln)
According to ddt:
currentCohort%dbh = 26.6106
ED_val_comp_excln = 0.1

why is that currentCohort%excl_weight zero??

jkshuman · 2017-08-30T20:37:59Z

currentCohort => currentPatch%tallest
do while (associated(currentCohort))
if(currentCohort%canopy_layer == i)then !All the trees in this layer need to lose some area...

                  if (ED_val_comp_excln .ge. 0) then
                     weight = currentCohort%excl_weight/sum_weights(i)
                     cc_loss = lossarea*weight !what this cohort has to lose. 
                  else
                     ! in deterministic ranking mode, cohort loss is not renormalized
                     cc_loss = currentCohort%excl_weight
                  endif

jkshuman · 2017-08-30T20:39:51Z

           ! Correct the demoted cohorts for  
            if (ED_val_comp_excln .ge. 0) then !set to 0.1 in default param file JKS
            do while (associated(currentCohort))
               if(currentCohort%canopy_layer  ==  i) then

                     if (shr_infnan_isnan(currentCohort%excl_weight)) then !there is a nan JKS
                        write(fates_log(),*) 'excl_weight_NAN: excl weight, dbh, site lat and lon',  currentCohort%excl_weight,& !JKS
                            currentCohort%dbh, currentsite%lat, currentsite%lon 
                        call endrun(msg=errMsg(sourcefile, __LINE__))
                     endif

                  weight = currentCohort%excl_weight/sumdiff(i)   !excl_weight and sumdiff should not be zero JKS  

                     if (shr_infnan_isnan(weight)) then !there is a nan JKS
                        write(fates_log(),*) 'weight_NAN: weight, dbh, site lat and lon',  weight,& !JKS
                            currentCohort%dbh, currentsite%lat, currentsite%lon 
                        call endrun(msg=errMsg(sourcefile, __LINE__))
                     endif

                  currentCohort%excl_weight = min(currentCohort%c_area/lossarea, weight)
                  sum_weights(i) = sum_weights(i) + currentCohort%excl_weight
               endif
               currentCohort => currentCohort%shorter      
            enddo
            endif

            currentCohort => currentPatch%tallest
            do while (associated(currentCohort))      
               if(currentCohort%canopy_layer == i)then !All the trees in this layer need to lose some area...

                  if (ED_val_comp_excln .ge. 0) then
                     weight = currentCohort%excl_weight/sum_weights(i)
                     cc_loss = lossarea*weight !what this cohort has to lose. 
                  else
                     ! in deterministic ranking mode, cohort loss is not renormalized
                     cc_loss = currentCohort%excl_weight
                  endif
                  if (cc_loss > 0._r8) then

                  !-----------Split and copy boundary cohort-----------------!

rgknox · 2017-08-30T20:47:57Z

My first thought: when we are comparing ED_val_comp_excln .ge. 0, we are comparing a real number to an integer. Maybe change that hard coded 0 to 0.0_r8. Without consulting the literature, its possible that the ED_val_comp_excln is being forced to an integer... which would make it zero and ... who knows....

Are you sure that the code last stepped through the code that is bound by the if(ED_val_comp_excln.ge.0)?

rosiealice · 2017-08-30T20:48:46Z

Can you attach the source file to the thread, so we can check out the line #'s?

rosiealice · 2017-08-30T20:50:08Z

I think it was a 0.1 in the parameter file (and in the DDT output) that we were looking at.

rosiealice · 2017-08-30T20:50:20Z

'it' being ED_val_comp_excln

rgknox · 2017-08-30T20:50:22Z

Also,
the calculation of weight when excl_weight = 0.0, should be fine. Its the sum_weights(i) being zero that will cause a problem. Is sum_weights also zero?

line 247: weight = currentCohort%excl_weight/sum_weights(i)

jkshuman · 2017-08-30T20:50:56Z

I am not sure if it stepped through that section. There were a lot of points where there was no canopy demotion, and maybe I was stepping too quickly. After this meeting I will set up new breakpoints to trigger at that section under those conditions.

Your idea about being forced to integer sounds reasonable.

jkshuman · 2017-08-30T20:51:24Z

yes - sum weights is also zero.

jkshuman · 2017-08-30T20:53:29Z

line 236: sum_weights(i) = sum_weights(i) + currentCohort%excl_weight

propagation of problem with excl_weight

rosiealice · 2017-08-30T20:53:56Z

but excl_weight is no longer nan in this run? Or do we not know yet because it exited with the endrun... And yes, NCAR-wide meeting-to-disseminate-as-yet-unknown-but-definitely-bad-news in 10 mins :/

rgknox · 2017-09-11T18:23:59Z

@rosiealice : I noticed that when we are promoting and demoting, we are trying to get a target layer area that almost exactly matches patch area. Is there any reason we can't target an area that is like 95-99% of the ground area? This is more aligned with imperfect plasticity, and would prevent this error.

rosiealice · 2017-09-11T18:55:40Z

We could try that, yes. My worry is that we'd need to add a new flux into surfacealbedo to let the light through the non-filled areas of the canopy, but it -might- work anyway. I also thought that one might recalculate 'z' with z = NumPotentialCanopyLayers(currentPatch,include_substory=.false.) before going into this check?

…

On 11 September 2017 at 12:30, Ryan Knox ***@***.***> wrote: @rosiealice <https://github.com/rosiealice> : I noticed that when we are promoting and demoting, we are trying to get a target layer area that almost exactly matches patch area. Is there any reason we can't target an area that is like 95-99% of the ground area? This is more aligned with imperfect plasticity, and would prevent this error. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#267 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMWsQ0vpwcAeLFHnwctRmZjc7wMooYgsks5shXrDgaJpZM4PH3aK> .

-- ----------------------------------------------------------------- Dr Rosie A. Fisher Staff Scientist Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706 http://www.cgd.ucar.edu/staff/rfisher/

rgknox · 2017-09-11T19:34:12Z

@rosiealice : that certainly wouldn't hurt. Although it is calculated at line 768, and no promotion/demotions are called after that.

jkshuman · 2017-09-11T20:23:56Z

I put in a print statement just before the if(((arealayer(i_lyr)-currentPatch%area)) > 0.0001) which gives the below values. Another print statement just before the endrun in this section does not print. Also the messages inside this if statement do not print.

areacheck_debug_1: layer: 1 ,z: 2 ,area layer:
137: 3654.07888223682 ,patch area: 3654.07888223682 ,diff:
137: 0.000000000000000E+000
137: areacheck_debug_1: layer: 2 ,z: 2 ,area layer:
137: 1699.19457388288 ,patch area: 3654.07888223682 ,diff:
137: -1954.88430835394
137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer:
137: 2916.76937228685 ,patch area: 3012.68079465575 ,diff:
137: -95.9114223689003
137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer:
137: 269.236258578231 ,patch area: 291.452826085509 ,diff:
137: -22.2165675072779
137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer:
137: 1.70952507795106 ,patch area: 1.84609385135598 ,diff:
137: -0.136568773404919
137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer:
137: 39.3409980661348 ,patch area: 262.523928568438 ,diff:
137: -223.182930502304
137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer:
137: 0.217118701101852 ,patch area: 1.67866830300434 ,diff:
137: -1.46154960190249
124: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
124: nstep = 2827729
124: errsol = -1.421994966221973E-007
121: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
121: nstep = 2827729
121: errsol = -1.363614501315169E-007
118: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
118: nstep = 2827729
118: errsol = -1.155913764705474E-007

rosiealice · 2017-09-11T20:32:51Z

Hmm. Maybe some multi-processor wierdness going on there (not printing the endrun write statement) since it did seem to clearly crash in that place? (assuming that's still the case). Can you copy & paste the write statements/section of code in here, just so we're all on the same page?

…

On 11 September 2017 at 14:24, jkshuman ***@***.***> wrote: I put in a print statement just before the if(((arealayer(i_lyr)-currentPatch%area)) > 0.0001) which gives the below values. Another print statement just before the endrun in this section does not print. Also the messages inside this if statement do not print. areacheck_debug_1: layer: 1 ,z: 2 ,area layer: 137: 3654.07888223682 ,patch area: 3654.07888223682 ,diff: 137: 0.000000000000000E+000 137: areacheck_debug_1: layer: 2 ,z: 2 ,area layer: 137: 1699.19457388288 ,patch area: 3654.07888223682 ,diff: 137: -1954.88430835394 137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer: 137: 2916.76937228685 ,patch area: 3012.68079465575 ,diff: 137: -95.9114223689003 137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer: 137: 269.236258578231 ,patch area: 291.452826085509 ,diff: 137: -22.2165675072779 137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer: 137: 1.70952507795106 ,patch area: 1.84609385135598 ,diff: 137: -0.136568773404919 137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer: 137: 39.3409980661348 ,patch area: 262.523928568438 ,diff: 137: -223.182930502304 137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer: 137: 0.217118701101852 ,patch area: 1.67866830300434 ,diff: 137: -1.46154960190249 124: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 124: nstep = 2827729 124: errsol = -1.421994966221973E-007 121: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 121: nstep = 2827729 121: errsol = -1.363614501315169E-007 118: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 118: nstep = 2827729 118: errsol = -1.155913764705474E-007 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#267 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMWsQ2fjnvpZPkBUrYqnooOHhHZpIWBcks5shZbhgaJpZM4PH3aK> .

-- ----------------------------------------------------------------- Dr Rosie A. Fisher Staff Scientist Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706 <(303)%20497-1706> http://www.cgd.ucar.edu/staff/rfisher/

jkshuman · 2017-09-11T20:33:48Z

   ! ----------- Final Check On Layer Area ------------
   do i_lyr = 1,z

      call CanopyLayerArea(currentPatch,i_lyr,arealayer(i_lyr))

         write(fates_log(),*) 'areacheck_debug_1:',' layer:', i_lyr,',z: ',z,',area layer:',arealayer(i_lyr),&
               ',patch area:',currentPatch%area,',diff:',arealayer(i_lyr)-currentPatch%area !JKS
      
      if(((arealayer(i_lyr)-currentPatch%area)) > 0.0001)then
         write(fates_log(),*) 'problem with canopy area', arealayer(i_lyr), currentPatch%area, &
               arealayer(i_lyr) - currentPatch%area,missing_area  
         write(fates_log(),*) 'lat:',currentpatch%siteptr%lat
         write(fates_log(),*) 'lon:',currentpatch%siteptr%lon
         write(fates_log(),*) 'i_lyr: ',i_lyr,' of z: ',z
         currentCohort => currentPatch%tallest
         do while (associated(currentCohort))
            if(currentCohort%canopy_layer == i_lyr)then
               write(fates_log(),*) ' c_area: ', &
                     c_area(currentCohort),' dbh: ',currentCohort%dbh,' n: ',currentCohort%n
            endif
            currentCohort => currentCohort%shorter
         enddo

         write(fates_log(),*) 'areacheck_debug_2',' layer: ', i_lyr, 'z ',z, 'area layer ',arealayer(i_lyr),&
               'patch area ',currentPatch%area, 'diff ',arealayer(i_lyr)-currentPatch%area !JKS


         call endrun(msg=errMsg(sourcefile, __LINE__))
      endif

rosiealice · 2017-09-11T20:36:41Z

Is there something one can do with the system flush idea that might deconvolve the write statements? I don't really understand how that works yet...

…

On 11 September 2017 at 14:33, jkshuman ***@***.***> wrote: call CanopyLayerArea(currentPatch,i_lyr,arealayer(i_lyr)) write(fates_log(),*) 'areacheck_debug_1:',' layer:', i_lyr,',z: ',z,',area layer:',arealayer(i_lyr),& ',patch area:',currentPatch%area,',diff:',arealayer(i_lyr)-currentPatch%area !JKS if(((arealayer(i_lyr)-currentPatch%area)) > 0.0001)then write(fates_log(),*) 'problem with canopy area', arealayer(i_lyr), currentPatch%area, & arealayer(i_lyr) - currentPatch%area,missing_area write(fates_log(),*) 'lat:',currentpatch%siteptr%lat write(fates_log(),*) 'lon:',currentpatch%siteptr%lon write(fates_log(),*) 'i_lyr: ',i_lyr,' of z: ',z currentCohort => currentPatch%tallest do while (associated(currentCohort)) if(currentCohort%canopy_layer == i_lyr)then write(fates_log(),*) ' c_area: ', & c_area(currentCohort),' dbh: ',currentCohort%dbh,' n: ',currentCohort%n endif currentCohort => currentCohort%shorter enddo write(fates_log(),*) 'areacheck_debug_2',' layer: ', i_lyr, 'z ',z, 'area layer ',arealayer(i_lyr),& 'patch area ',currentPatch%area, 'diff ',arealayer(i_lyr)-currentPatch%area !JKS call endrun(msg=errMsg(sourcefile, __LINE__)) endif if ( i_lyr > 1) then if ( (arealayer(i_lyr) - arealayer(i_lyr-1) )>1e-11 ) then write(fates_log(),*) 'smaller top layer than bottom layer ',arealayer(i_lyr),arealayer(i_lyr-1), & currentPatch%area,currentPatch%spread(i_lyr-1:i_lyr) endif endif enddo ! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#267 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMWsQ6qRt-W_hSCstiQ-sBXF8wyoQCvRks5shZkugaJpZM4PH3aK> .

-- ----------------------------------------------------------------- Dr Rosie A. Fisher Staff Scientist Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706 http://www.cgd.ucar.edu/staff/rfisher/

rosiealice · 2017-09-11T20:38:23Z

It's also odd that it doesn't write out those lat, lon statments, etc. before crashing, isn't it?

jkshuman · 2017-09-11T20:38:41Z

yep. will add in the sys_flush.

rosiealice · 2017-09-11T20:57:00Z

@rgknox , in the vanilla 'endrun' command, is there a way of sending it an error message that includes any of the properties of these variables? (z, i_lyr, etc.?)

The other thing we could try, backing up, is adding another of these
z = NumPotentialCanopyLayers(currentPatch,include_substory=.false.)

before going into that if statement?

jkshuman · 2017-09-11T21:09:18Z

@rgknox provided me with a helpful grep to find his error text.
cat filename | grep -i 'problem with canopy area'
They were not where I expected it them be, but it does indeed fail as he told it to. Here it is from the cesm log. Also, @rgknox is on the case and has ideas of how this might get resolved. He is implementing a potential fix.

45: read: fire_fuel_sav_acc
555: areacheck_debug_1: layer: 1 ,z: 1 ,area layer:
555: 3224.23987314057 ,patch area: 5717.64422608806 ,diff:
555: -2493.40435294749
380: areacheck_debug_1: layer: 1 ,z: 2 ,area layer:
380: 4196.68289866283 ,patch area: 4196.50281576499 ,diff:
380: 0.180082897834836
380: problem with canopy area 4196.68289866283 4196.50281576499
380: 0.180082897834836 0.000000000000000E+000
380: lat: 34.0000000000000
380: lon: 75.0000000000000
380: i_lyr: 1 of z: 2
380: c_area: 26.0983212075636 dbh: 153.469697348304 n:
380: 0.114454664808694
380: c_area: 34.7325966558851 dbh: 140.460213804226 n:
380: 0.165904242880867
380: c_area: 21.3893743582923 dbh: 134.518542440786 n:
380: 0.108073905367945
380: c_area: 15.9736830259743 dbh: 120.824477977309 n:
380: 9.279898353429825E-002
380: c_area: 17.9118968554050 dbh: 119.942334478646 n:
380: 0.105055040797760
380: c_area: 15.2433610824897 dbh: 105.657949334449 n:
380: 0.105425936919977
380: c_area: 27.7460281719833 dbh: 104.525774959189 n:
380: 0.194603196956469
190: areacheck_debug_1: layer: 1 ,z: 2 ,area layer:
190: 120.110504231918 ,patch area: 120.110504231918 ,diff:

rgknox · 2017-09-11T21:12:11Z

I think the problem is this: Demotion is the first step in a sequence of layer area changing events. The demotion step checks on every layer to make sure that it has not exceeded patch area; and I think it is doing it correctly. But after this, we do two cohort fusions and a promotion step. I think that the cohort fusions are perturbing the canopy area just enough so that it fails that last check.

I think a first step solution, is to encapsulate the demotion, fusion, promotion and last fusion all within a while loop, that is doing an area check. I'm currently working on this.

If that does not solve the problem, the final solution is to force crown area conservation into cohort fusion.

rosiealice · 2017-09-11T21:43:01Z

Do you mean cohort fusions?

rgknox · 2017-09-11T21:44:57Z

yup, cohort fusions, sorry for the confusion!

rgknox · 2017-09-12T00:31:46Z

@jkshuman: I would like to check my fixes to see if they reproduce this error. What namelist and parameter values were used for the above crash? e.g.: strict PPA? which parameter file?

jkshuman · 2017-09-12T15:54:05Z

Strict ppa with default file 2 tropical tree file you created. Fire is active.

rgknox · 2017-09-12T17:52:18Z

I think I have a branch that should hopefully address the new problem.

https://github.com/rgknox/fates/tree/rgknox-layering-sequence

I did a 50 year science regression test on it against master and the two showed indistinguishable results for the 1x1 brazil test case. @jkshuman , when you have time, could you test this branch? I'm curious if you would be able to restart your current point of failure with the new branch.

jkshuman · 2017-09-12T21:01:00Z

Looks good! Running along from restart at 162-05-27 and into month 09! Will let you know if it continues to my requested end date...

jkshuman · 2017-09-13T16:24:48Z

@rgknox simulation still going, but up to year 272. I would call that success! I am going to kill the run at 300 years. I am reading through the reorganization of EDCanopyStrcutreMod, and it is easier to follow things with your updates. Thank you for your help on finding and fixing this!

rgknox · 2017-09-13T18:33:25Z

Great news. I'm going to push these changes into the existing pull request. If your run finishes its 300 years without problem, I will kick off another round of regression tests on the pull request.
Could you point me towards the output directory again, I might run the acre diagnostic package on it to make sure there are no artifacts or shifts in the time series around the time we swap in the new branch.

jkshuman · 2017-09-13T19:09:43Z

will keep you posted. It is at year 289 at the moment.
output directory for archive:
(years 0-162)
/glade2/scratch2/jkshuman/archive/DebugStrictPPA_0908_Fire_clmED_4x5_Default2PFT_GSWP3_BGC/lnd/hist
(years 162-302)
/glade2/scratch2/jkshuman/archive/DebugStrictPPA_0912_Fire_clmED_4x5_Default2PFT_GSWP3_BGC

jkshuman · 2017-09-13T22:11:58Z

@rgknox I am killing the simulation with files in archive up to year 302. I give you the pleasure of closing this issue, and declaring yourself king of the lab!

rgknox · 2017-09-13T22:42:11Z

Even Steven will be watching me, if I celebrate too much, he will create a new bug for me sooner than later.

rgknox · 2017-09-13T23:13:33Z

The biomass projections of three tropical sites in your run look un-suspicious (note my units on biomass should be MgC/ha, not kgC/ha. Carbon fluxes are per square meter. For reference, Chambers et al. 2004 found annual mean NPP of 0.9 kgC/m2/year, and HR of 0.85 kgC/m2/year at zf2. The inventory there has roughly 300 MgC/ha.

ckoven · 2017-09-13T23:34:21Z

@rgknox asked me to make a map of biomass from jackie's run as a test of non-craziness, see attached.

rgknox · 2017-09-13T23:42:05Z

Thanks @ckoven , that was also partially to test my testing scripts. Can't expect too much with only one tropical pft on a global run, but I think this helps sanity check the bug fix.

rgknox · 2017-09-13T23:44:16Z

Although I do wonder why Virginia is more abundant than the Amazon, and why Florida has so little abundance... Maybe soils?

jkshuman · 2017-09-14T18:26:04Z

Thank you @ckoven and @rgknox . The totecosysC was very stable for the last 150 years of simulation. I stared at that for a while yesterday. There was variation in burn area and lai. I will remember to post that next time. I look forward to the pull request with these updates!

rgknox mentioned this issue Sep 13, 2017

bugfix on understory demotion #271

Merged

rgknox closed this as completed Oct 3, 2017

Hanging case #267

Hanging case #267

Comments

jkshuman commented Aug 30, 2017

jkshuman commented Aug 30, 2017

serbinsh commented Aug 30, 2017

jkshuman commented Aug 30, 2017

serbinsh commented Aug 30, 2017

serbinsh commented Aug 30, 2017

jkshuman commented Aug 30, 2017

jkshuman commented Aug 30, 2017

jkshuman commented Aug 30, 2017

serbinsh commented Aug 30, 2017

jkshuman commented Aug 30, 2017

ekluzek commented Aug 30, 2017

jkshuman commented Aug 30, 2017

rgknox commented Aug 30, 2017

rosiealice commented Aug 30, 2017

rosiealice commented Aug 30, 2017

rgknox commented Aug 30, 2017

rgknox commented Aug 30, 2017

ckoven commented Aug 30, 2017

jkshuman commented Aug 30, 2017

jkshuman commented Aug 30, 2017

jkshuman commented Aug 30, 2017

rgknox commented Aug 30, 2017

rosiealice commented Aug 30, 2017

rosiealice commented Aug 30, 2017

rosiealice commented Aug 30, 2017

rgknox commented Aug 30, 2017

jkshuman commented Aug 30, 2017

jkshuman commented Aug 30, 2017

jkshuman commented Aug 30, 2017

rosiealice commented Aug 30, 2017

rgknox commented Sep 11, 2017

rosiealice commented Sep 11, 2017 via email

rgknox commented Sep 11, 2017

jkshuman commented Sep 11, 2017

rosiealice commented Sep 11, 2017 via email

jkshuman commented Sep 11, 2017 • edited Loading

rosiealice commented Sep 11, 2017 via email

rosiealice commented Sep 11, 2017

jkshuman commented Sep 11, 2017

rosiealice commented Sep 11, 2017

jkshuman commented Sep 11, 2017 • edited Loading

rgknox commented Sep 11, 2017 • edited Loading

rosiealice commented Sep 11, 2017

rgknox commented Sep 11, 2017 • edited Loading

rgknox commented Sep 12, 2017

jkshuman commented Sep 12, 2017 via email • edited Loading

rgknox commented Sep 12, 2017 • edited Loading

jkshuman commented Sep 12, 2017 • edited Loading

jkshuman commented Sep 13, 2017

rgknox commented Sep 13, 2017

jkshuman commented Sep 13, 2017 • edited Loading

jkshuman commented Sep 13, 2017

rgknox commented Sep 13, 2017

rgknox commented Sep 13, 2017

ckoven commented Sep 13, 2017

rgknox commented Sep 13, 2017

rgknox commented Sep 13, 2017

jkshuman commented Sep 14, 2017

jkshuman commented Sep 11, 2017 •

edited

Loading

jkshuman commented Sep 11, 2017 •

edited

Loading

rgknox commented Sep 11, 2017 •

edited

Loading

rgknox commented Sep 11, 2017 •

edited

Loading

jkshuman commented Sep 12, 2017 via email •

edited

Loading

rgknox commented Sep 12, 2017 •

edited

Loading

jkshuman commented Sep 12, 2017 •

edited

Loading

jkshuman commented Sep 13, 2017 •

edited

Loading