-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hanging case #267
Comments
@ekluzek @rgknox @ckoven @rosiealice |
@jkshuman Do you get a PIO write error? I am still getting those in some runs after some length of time due to NANs, I currently assume related to NEP.....That is I sometimes have a run that suddenly crashes due to NANs being written to the netCDF output |
@serbinsh - Thank you, but I haven't seen a PIO error. |
For example
|
@jkshuman sorry, I posted that right after you replied....seems like you are having a different issue. |
@serbinsh with termination (in Debug) I get this: 194: end run_mct |
When it's not in Debug, the run will terminate only due to time out without any errors of note. |
On Tuesday I updated the code to write with a nan for variables in Canopy demotion, and added a temp variable for part of an expression. (dbh_comp_excln = currentCohort%dbhED_val_comp_excln) With rebuild the code is slightly altered and in the interactive debugger stepped further along then the previous version. Also in this version, the temp variable "dbh_comp_excln" seems to hold the value from a previous cohort. There was at least one case where temp variable "dbh_comp_excln" was zero but debugger and values for currentCohort%dbh said it should be a value. Through all of this currentCohort%excl_weight was always a nan. currentCohort%excl_weight = 1.0_r8/(currentCohort%dbhED_val_comp_excln) at suggestion of @ekluzek I have added an "end run" to the write statements to abort with the nan. Then I will try Charlie's strict ppa by modifying param file to have negative value for ED_val_comp_excln. |
Hmmm....that error message is a bit strange...it does seem like the run hung and then was killed by a scheduler? |
restart file: |
@serbinsh the error Jackie showed is because it dies due to a floating point trap error, and that shuts down one of the MPI tasks without having called a MPI_finalize to shut down the MPI job cleanly. In the cases where it hangs it doesn't report an error, but eventually reaches the queue wall clock limit. It might then report a different error, but it takes however long the wall clock limit was set on the job. |
After adding the "end run" if NAN, I rebuilt and am running in the interactive debugger from the daily restart file. Things progressed along apparently without canopy demotion for quite a while, but things just terminated due to a floating point error. Error text below. Error from Alinea ddt.: |
Hi @jkshuman Can you copy and paste the code around line 247 of EDCanopyStructureMod.F90 for your version of the code. Sorry if I lost the thread, but that line is a do loop header on the master branch. |
What's on line 247 in your EDCanopyStructureMod.F90 file? |
Sorry, cross posted w Ryan. |
sorry, cross posted w Rosie |
ok, that last post was just to be cheeky |
i was wondering the same thing |
line 247: weight = currentCohort%excl_weight/sum_weights(i) why is that currentCohort%excl_weight zero?? |
currentCohort => currentPatch%tallest
|
|
My first thought: when we are comparing ED_val_comp_excln .ge. 0, we are comparing a real number to an integer. Maybe change that hard coded 0 to 0.0_r8. Without consulting the literature, its possible that the ED_val_comp_excln is being forced to an integer... which would make it zero and ... who knows.... Are you sure that the code last stepped through the code that is bound by the if(ED_val_comp_excln.ge.0)? |
Can you attach the source file to the thread, so we can check out the line #'s? |
I think it was a 0.1 in the parameter file (and in the DDT output) that we were looking at. |
'it' being ED_val_comp_excln |
Also,
|
I am not sure if it stepped through that section. There were a lot of points where there was no canopy demotion, and maybe I was stepping too quickly. After this meeting I will set up new breakpoints to trigger at that section under those conditions. Your idea about being forced to integer sounds reasonable. |
yes - sum weights is also zero. |
line 236: sum_weights(i) = sum_weights(i) + currentCohort%excl_weight propagation of problem with excl_weight |
but excl_weight is no longer nan in this run? Or do we not know yet because it exited with the endrun... And yes, NCAR-wide meeting-to-disseminate-as-yet-unknown-but-definitely-bad-news in 10 mins :/ |
@rosiealice : I noticed that when we are promoting and demoting, we are trying to get a target layer area that almost exactly matches patch area. Is there any reason we can't target an area that is like 95-99% of the ground area? This is more aligned with imperfect plasticity, and would prevent this error. |
We could try that, yes. My worry is that we'd need to add a new flux into
surfacealbedo to let the light through the non-filled areas of the canopy,
but it -might- work anyway.
I also thought that one might recalculate 'z' with
z = NumPotentialCanopyLayers(currentPatch,include_substory=.false.)
before going into this check?
…On 11 September 2017 at 12:30, Ryan Knox ***@***.***> wrote:
@rosiealice <https://github.com/rosiealice> : I noticed that when we are
promoting and demoting, we are trying to get a target layer area that
almost exactly matches patch area. Is there any reason we can't target an
area that is like 95-99% of the ground area? This is more aligned with
imperfect plasticity, and would prevent this error.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#267 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMWsQ0vpwcAeLFHnwctRmZjc7wMooYgsks5shXrDgaJpZM4PH3aK>
.
--
-----------------------------------------------------------------
Dr Rosie A. Fisher
Staff Scientist
Terrestrial Sciences Section
Climate and Global Dynamics
National Center for Atmospheric Research
1850 Table Mesa Drive
Boulder, Colorado, 80305
USA.
+1 303-497-1706
http://www.cgd.ucar.edu/staff/rfisher/
|
@rosiealice : that certainly wouldn't hurt. Although it is calculated at line 768, and no promotion/demotions are called after that. |
I put in a print statement just before the if(((arealayer(i_lyr)-currentPatch%area)) > 0.0001) which gives the below values. Another print statement just before the endrun in this section does not print. Also the messages inside this if statement do not print. areacheck_debug_1: layer: 1 ,z: 2 ,area layer: |
Hmm. Maybe some multi-processor wierdness going on there (not printing the
endrun write statement) since it did seem to clearly crash in that place?
(assuming that's still the case). Can you copy & paste the write
statements/section of code in here, just so we're all on the same page?
…On 11 September 2017 at 14:24, jkshuman ***@***.***> wrote:
I put in a print statement just before the if(((arealayer(i_lyr)-currentPatch%area))
> 0.0001) which gives the below values. Another print statement just before
the endrun in this section does not print. Also the messages inside this if
statement do not print.
areacheck_debug_1: layer: 1 ,z: 2 ,area layer:
137: 3654.07888223682 ,patch area: 3654.07888223682 ,diff:
137: 0.000000000000000E+000
137: areacheck_debug_1: layer: 2 ,z: 2 ,area layer:
137: 1699.19457388288 ,patch area: 3654.07888223682 ,diff:
137: -1954.88430835394
137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer:
137: 2916.76937228685 ,patch area: 3012.68079465575 ,diff:
137: -95.9114223689003
137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer:
137: 269.236258578231 ,patch area: 291.452826085509 ,diff:
137: -22.2165675072779
137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer:
137: 1.70952507795106 ,patch area: 1.84609385135598 ,diff:
137: -0.136568773404919
137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer:
137: 39.3409980661348 ,patch area: 262.523928568438 ,diff:
137: -223.182930502304
137: areacheck_debug_1: layer: 1 ,z: 1 ,area layer:
137: 0.217118701101852 ,patch area: 1.67866830300434 ,diff:
137: -1.46154960190249
124: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
124: nstep = 2827729
124: errsol = -1.421994966221973E-007
121: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
121: nstep = 2827729
121: errsol = -1.363614501315169E-007
118: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
118: nstep = 2827729
118: errsol = -1.155913764705474E-007
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#267 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMWsQ2fjnvpZPkBUrYqnooOHhHZpIWBcks5shZbhgaJpZM4PH3aK>
.
--
-----------------------------------------------------------------
Dr Rosie A. Fisher
Staff Scientist
Terrestrial Sciences Section
Climate and Global Dynamics
National Center for Atmospheric Research
1850 Table Mesa Drive
Boulder, Colorado, 80305
USA.
+1 303-497-1706 <(303)%20497-1706>
http://www.cgd.ucar.edu/staff/rfisher/
|
|
Is there something one can do with the system flush idea that might
deconvolve the write statements? I don't really understand how that works
yet...
…On 11 September 2017 at 14:33, jkshuman ***@***.***> wrote:
call CanopyLayerArea(currentPatch,i_lyr,arealayer(i_lyr))
write(fates_log(),*) 'areacheck_debug_1:',' layer:', i_lyr,',z: ',z,',area layer:',arealayer(i_lyr),&
',patch area:',currentPatch%area,',diff:',arealayer(i_lyr)-currentPatch%area !JKS
if(((arealayer(i_lyr)-currentPatch%area)) > 0.0001)then
write(fates_log(),*) 'problem with canopy area', arealayer(i_lyr), currentPatch%area, &
arealayer(i_lyr) - currentPatch%area,missing_area
write(fates_log(),*) 'lat:',currentpatch%siteptr%lat
write(fates_log(),*) 'lon:',currentpatch%siteptr%lon
write(fates_log(),*) 'i_lyr: ',i_lyr,' of z: ',z
currentCohort => currentPatch%tallest
do while (associated(currentCohort))
if(currentCohort%canopy_layer == i_lyr)then
write(fates_log(),*) ' c_area: ', &
c_area(currentCohort),' dbh: ',currentCohort%dbh,' n: ',currentCohort%n
endif
currentCohort => currentCohort%shorter
enddo
write(fates_log(),*) 'areacheck_debug_2',' layer: ', i_lyr, 'z ',z, 'area layer ',arealayer(i_lyr),&
'patch area ',currentPatch%area, 'diff ',arealayer(i_lyr)-currentPatch%area !JKS
call endrun(msg=errMsg(sourcefile, __LINE__))
endif
if ( i_lyr > 1) then
if ( (arealayer(i_lyr) - arealayer(i_lyr-1) )>1e-11 ) then
write(fates_log(),*) 'smaller top layer than bottom layer ',arealayer(i_lyr),arealayer(i_lyr-1), &
currentPatch%area,currentPatch%spread(i_lyr-1:i_lyr)
endif
endif
enddo !
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#267 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMWsQ6qRt-W_hSCstiQ-sBXF8wyoQCvRks5shZkugaJpZM4PH3aK>
.
--
-----------------------------------------------------------------
Dr Rosie A. Fisher
Staff Scientist
Terrestrial Sciences Section
Climate and Global Dynamics
National Center for Atmospheric Research
1850 Table Mesa Drive
Boulder, Colorado, 80305
USA.
+1 303-497-1706
http://www.cgd.ucar.edu/staff/rfisher/
|
It's also odd that it doesn't write out those lat, lon statments, etc. before crashing, isn't it? |
yep. will add in the sys_flush. |
@rgknox , in the vanilla 'endrun' command, is there a way of sending it an error message that includes any of the properties of these variables? (z, i_lyr, etc.?) The other thing we could try, backing up, is adding another of these before going into that if statement? |
@rgknox provided me with a helpful grep to find his error text. 45: read: fire_fuel_sav_acc |
I think the problem is this: Demotion is the first step in a sequence of layer area changing events. The demotion step checks on every layer to make sure that it has not exceeded patch area; and I think it is doing it correctly. But after this, we do two cohort fusions and a promotion step. I think that the cohort fusions are perturbing the canopy area just enough so that it fails that last check. I think a first step solution, is to encapsulate the demotion, fusion, promotion and last fusion all within a while loop, that is doing an area check. I'm currently working on this. If that does not solve the problem, the final solution is to force crown area conservation into cohort fusion. |
Do you mean cohort fusions? |
yup, cohort fusions, sorry for the confusion! |
@jkshuman: I would like to check my fixes to see if they reproduce this error. What namelist and parameter values were used for the above crash? e.g.: strict PPA? which parameter file? |
Strict ppa with default file 2 tropical tree file you created. Fire is
active.
|
I think I have a branch that should hopefully address the new problem. https://github.com/rgknox/fates/tree/rgknox-layering-sequence I did a 50 year science regression test on it against master and the two showed indistinguishable results for the 1x1 brazil test case. @jkshuman , when you have time, could you test this branch? I'm curious if you would be able to restart your current point of failure with the new branch. |
Looks good! Running along from restart at 162-05-27 and into month 09! Will let you know if it continues to my requested end date... |
@rgknox simulation still going, but up to year 272. I would call that success! I am going to kill the run at 300 years. I am reading through the reorganization of EDCanopyStrcutreMod, and it is easier to follow things with your updates. Thank you for your help on finding and fixing this! |
Great news. I'm going to push these changes into the existing pull request. If your run finishes its 300 years without problem, I will kick off another round of regression tests on the pull request. |
will keep you posted. It is at year 289 at the moment. |
@rgknox I am killing the simulation with files in archive up to year 302. I give you the pleasure of closing this issue, and declaring yourself king of the lab! |
Even Steven will be watching me, if I celebrate too much, he will create a new bug for me sooner than later. |
@rgknox asked me to make a map of biomass from jackie's run as a test of non-craziness, see attached. |
Thanks @ckoven , that was also partially to test my testing scripts. Can't expect too much with only one tropical pft on a global run, but I think this helps sanity check the bug fix. |
Although I do wonder why Virginia is more abundant than the Amazon, and why Florida has so little abundance... Maybe soils? |
This is related to Issue #250, but uses the most up to date fates-clm and fates code (version 0115fbc). Behavior is the same as Issue #250 which is that the model runs and then hangs and will not restart. From previous runs related to Issue #250 This happens with fire active, as well as without fire, with 6PFTs and with 1PFT. The time of hang seems to be random.
I am using the recent default parameter file (fates_params_2troppftclones.c170810.nc) without modifications. Again this uses the most up to date fates-clm and fates code (version 0115fbc). Case details: ./create_newcase -case /glade/p/work/jkshuman/FATES_cases/Debug/Debug0825_Fire_clmED_4x5_Default2PFT_GSWP3_BGC -res f45_f45 -compset 2000_DATM%QIA_CLM45%ED_SICE_SOCN_RTM_SGLC_SWAV
./xmlchange STOP_OPTION=nyears
./xmlchange DATM_MODE=CLMGSWP3
./xmlchange DATM_CLMNCEP_YR_ALIGN=1985
./xmlchange DATM_CLMNCEP_YR_START=1985
./xmlchange DATM_CLMNCEP_YR_END=2004
(Cheyenne PE layout from Erik)
./xmlchange NTASKS_ATM=-1
./xmlchange NTASKS_CPL=-15
./xmlchange NTASKS_GLC=-15
./xmlchange NTASKS_OCN=-15
./xmlchange NTASKS_WAV=-15
./xmlchange NTASKS_ICE=-15
./xmlchange NTASKS_LND=-15
./xmlchange NTASKS_ROF=-15
./xmlchange NTASKS_ESP=-15
./xmlchange ROOTPE_ATM=0
./xmlchange ROOTPE_CPL=-1
./xmlchange ROOTPE_GLC=-1
./xmlchange ROOTPE_OCN=-1
./xmlchange ROOTPE_WAV=-1
./xmlchange ROOTPE_ICE=-1
./xmlchange ROOTPE_LND=-1
./xmlchange ROOTPE_ROF=-1
./xmlchange ROOTPE_ESP=-1
user_nl_clm
use_ed=.true.
use_fates_spitfire=.true.
The text was updated successfully, but these errors were encountered: