Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure in ERS 3 month nocomp test #1051

Closed
mvertens opened this issue Jul 11, 2023 · 55 comments · Fixed by #1098
Closed

failure in ERS 3 month nocomp test #1051

mvertens opened this issue Jul 11, 2023 · 55 comments · Fixed by #1098

Comments

@mvertens
Copy link

I am trying to create longer runs using the latest CTSM in NorESM configurations (with the latest CMEPS, CDEPS, etc). The following 2 tests fail restarts:
ERS_Lm3.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire

ERS_Ld90.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire

However what is interesting is that this test passes restart:
ERS_Ld90.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire.20230707_114133_1m5tec/

Since the restart test fails - we cannot really start an longer runs until we resolve this problem.

@mvertens
Copy link
Author

@rosiealice @glemieux - I was not sure where to post this. The code is very close to ctsm5.1.dev129.

@rosiealice
Copy link
Contributor

I guess a higher level question is whether we should push this NOCOMP compset back into CTSM to streamline the testing, given it (nocomp) is a pretty core use-case of FATES at the moment? Should we discuss this on a FATES SE call one day (those are at 9pm on Mondays in Norway Mariana)

I don't know what the difference is between these two test variants...

@glemieux
Copy link
Contributor

Hi @mvertens, I'll try replicating this test. As @rosiealice asked, what's the difference between the two ERS_Ld90 tests? Were they using different fates tags?

@mvertens
Copy link
Author

@glemieux - thanks so much for your quick response to this.
Sorry I had wrong test written down incorrectly.
It should be
ERS_Ld120.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire
that fails and the following passes.
ERS_Ld90.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire
I also tried ERS_Ld30 and ERS_Ld60 and those passed as well.

@glemieux
Copy link
Contributor

glemieux commented Jul 11, 2023

I can confirm that I'm seeing similar COMPARE_base_rest failures using the following test case setup:

ERS_Lm3.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV.cheyenne_intel.clm-FatesColdNoComp

Here's what I checked varying the durations:

  • Lm3: fails
  • Ld90: passes
  • Ld120: fails
  • Lm6: fails

Here are other checks I did using Lm3:

  • CRUv7 versus GSWP3v1: both fail
  • FatesCold versus FatesColdNoComp: FatesCold passes
  • ctsm5.1dev126 versus ctsm5.1.dev129: both fail

This is making me wonder if #897 is actually related to this issue.

@mvertens
Copy link
Author

@glemieux - thanks for confirming this. The one way I have always used to iron out restart problems is to get close to where the problem is and write history files every time step - for an initial run and a restart run. I can try to narrow this down some more. Clearly stopping somewhere between day 90 and day 120 is causing the problem. So say we know that stopping at day 100 and restarting will not work. Lets assume that just looking at the day boundary is good enough. Then one way to debug this is the following - assume that coupler history files and clm history files are written every time step.

  1. Do a run until day 99 - and write write a restart at the end
  2. Restart from day 99 and run for 50 time (assuming half hour time steps) and write a restart at day 100
  3. Save the history files
  4. Now restart again at and run for 3 time steps. Compare the history files for time steps 48-50 and see where the differences start.

I've usually been successful using this approach - and just did that to find a problem for a restart problem using a compset with ocean, ice, wave and datm. Anyway - I'm happy to help as well - and am also happy to clarify things if the explanation above has been confusing.
4) compare

@rgknox
Copy link
Contributor

rgknox commented Jul 13, 2023

@mvertens do you know what grid-cell is showing the diff?

Also, just to confirm, I see you have a FatesCold test that passes with LM3. Have you found any non-Nocomp tests failing, or are failures just nocomp configurations?

@mvertens
Copy link
Author

@rgknox - yes that is correct. The only failures I have found are in the one nocomp configuration. I have only added 4 regressions tests at this point to the NorESM CTSM fork - as a way of moving forwards to run this with CAM60, CICE and BLOM.
ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.betzy_intel.clm-FatesColdTreeDamage (Overall: PASS) details:
ERS_D_Ld3_PS.f09_g17.I2000Clm50FatesRs.betzy_intel.clm-FatesCold (Overall: PASS) details:
ERS_Lm13.f45_f45_mg37.I2000Clm50Fates.betzy_intel.clm-FatesColdNoComp (Overall: FAIL) details:
FAIL ERS_Lm13.f45_f45_mg37.I2000Clm50Fates.betzy_intel.clm-FatesColdNoComp COMPARE_base_rest
SMS.f45_f45_mg37.I2000Clm51FatesSpRsGs.betzy_intel.clm-FatesColdSatPhen (Overall: PASS) details:

@glemieux
Copy link
Contributor

glemieux commented Sep 29, 2023

A bunch of us met today to brainstorm next steps on how to approach this issue. I reported that yesterday I ran a few ERS FatesColdNoComp test mods at different durations and narrowed down the restart to some time between 52 and 58 days (Ld100 and Ld110 respectively) and is showing up first in southern latitudes (New Zealand and south Argentina). The issue presents itself right at the start up time. There are some additional nearby gridcells that start showing issues towards the end of the run suggesting that this issue might be more wide spread in the boreal region, but just swamped out because of the plotting range. The latest test runs are with the latest ctsm and fates tags.

Some additional notes from the meeting:

  • Could we be missing a variable in the restart file? Is the restart using the cold start value that just happens to be extremely small?
  • Given the evidence that this is happening in the southern boreal region, could this have something to do with deciduous phenology?
  • If the restart issue appears using a fully spun up case, we probably should accelerate addressing Create a non cold-start FATES test ESCOMP/CTSM#1277

The following actions are going to be taken:

  1. @mvertens will run a restart tests with a spin up data set from @rosiealice
  2. @rgknox will investigate running with evergreens only

We've agreed to meet in two weeks time to discuss progress.

@mvertens
Copy link
Author

mvertens commented Oct 3, 2023

I have an update on action 2.

  • I am using a ctsm5.1dev130-noresm_v1 from https://github.com/NorESMhub/CTSM
  • I tried to use a spun up CLM dataset that Rosie provided and the model crashed in FATES. I can provide that error if someone wants it.
  • I decided to simply to a 10 year spin up starting from a cold start using the following test.
    SMS.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire
  • Starting from the restart file at year 10 - I did 2 runs. A 4 time step initial run. And a 2 time step initial run following by a 2 timestep restart run. I was able to show that both the restart files at timestep 3 and the history files at timestep 3 (clm2.h0.2000-01-01-05400.nc) were different in a small number of variables.
  • The history files comparison showed the following diffs:
RMS FATES_CBALANCE_ERROR             2.6008E-18            NORMALIZED  2.3165E+00
RMS FATES_DISTURBANCE_RATE_P2P       3.3688E-03            NORMALIZED  2.0000E+00
RMS FATES_DISTURBANCE_RATE_POTENTIAL 3.3688E-03            NORMALIZED  2.0000E+00
RMS FATES_DISTURBANCE_RATE_TREEFALL  3.3688E-03            NORMALIZED  2.0000E+00
RMS FATES_ERROR_EL                   2.6008E-18            NORMALIZED  2.3165E+00
RMS FATES_FRAGMENTATION_SCALER_SL    2.4246E-01            NORMALIZED  2.0358E+00
RMS FATES_PRIMARY_PATCHFUSION_ERR    8.1046E-14            NORMALIZED  2.0000E+00
  • The restart files showed the following diffs:
RMS tempavg_bgnpp                           NaN            NORMALIZED  0.0000E+00
RMS annsum_counter_ch4                      NaN            NORMALIZED  0.0000E+00
RMS tempavg_somhr                           NaN            NORMALIZED  0.0000E+00
RMS tempavg_finrw                           NaN            NORMALIZED  0.0000E+00
RMS fates_fcansno_pa                 1.6970E-03            NORMALIZED  1.7092E-07
RMS fates_solar_zenith_angle_pa      1.5994E-04            NORMALIZED  1.6109E-08
RMS fates_gnd_alb_dif                6.2440E-03            NORMALIZED  1.5925E+00
RMS fates_gnd_alb_dir                6.3061E-03            NORMALIZED  1.6566E+00
RMS fates_errfates_vec_001           9.5475E-14            NORMALIZED  5.6401E+00

I am really perplexed as to why the albedos are different. I really would need some advice as to how to proceed with this. I have not looked at FATES for several years. I am happy to add new variables to the restart files. Would another chat help?
Should we wait for Rosie to come back?

@mvdebolskiy
Copy link

@rgknox can this be due to #428 and ESCOMP/CTSM#548 ? Do you remember what were the tests performed when merging that PR?

@ckoven
Copy link
Contributor

ckoven commented Oct 3, 2023

@mvertens one question I have is on the time stepping that you did. Much of the FATES-specific processes only occur once per day, so every 48th timestep. Was the daily timestep one of the two timesteps in your restart run? Because if not, that says that the daily FATES code shouldn't even be triggered in the experiment you did, despite you showing differences in things like the history-reported disturbance rates? So if that is the case it could considerably narrow the set off possible errors here.

@mvertens
Copy link
Author

mvertens commented Oct 4, 2023

@ckoven - the daily time step was not one of the two timesteps in my restart. I started at the beginning of the day and did 4 timesteps in one run and 2/2 in the restart run. I then compared the two timesteps at timestep 4.

@rgknox
Copy link
Contributor

rgknox commented Oct 4, 2023

@mvdebolskiy that could be a good lead. We had a test suit in 2018 similar to what we have now, but it may had not contained tests long enough.

@rgknox
Copy link
Contributor

rgknox commented Oct 4, 2023

@mvdebolskiy , I'm noticing that there is some similar code doing the same thing in the call to PatchNorman radiation during the restart and the normal call sequence. Even if this is not the problem we are tackling in this thread, I think it would be better to have the zero'ing and initializations contained in one routine that is called from both locations.

See: https://github.com/NGEET/fates/blob/main/main/FatesRestartInterfaceMod.F90#L3588-L3644
See: https://github.com/NGEET/fates/blob/main/biogeophys/EDSurfaceAlbedoMod.F90#L106-L164

@rgknox
Copy link
Contributor

rgknox commented Oct 4, 2023

Looks like this line is different in the restart process vs normal timestepping:
https://github.com/NGEET/fates/blob/sci.1.67.4_api.27.0.0/main/FatesRestartInterfaceMod.F90#L3630

@rgknox
Copy link
Contributor

rgknox commented Oct 4, 2023

I ran a test where I modify the line linked in my previous post to match that of the call in EDSurfaceAlbedoMod, but it did not remove the error. I think we are getting closer though, the diffs that you uncovered in the restart file @mvertens do seem to point to something in the radiation restart.

@mvdebolskiy
Copy link

@rgknox
In the ED_Norman_Radiation() assignment starts from 1st pft.

if(currentpatch%nocomp_pft_label.ne.nocomp_bareground)then

In the update_3dpatch_radiation() there is no such condition. Do I understand correctly, that when reading from the restart bareground pft will get the values from update_3dpatch_radiation() in nocomp mode while in competition mode it will get just clm restart values?

@rosiealice
Copy link
Contributor

A thought. If this is a radiation code thing, does it fail the test with the new two stream scheme?

@rgknox
Copy link
Contributor

rgknox commented Oct 5, 2023

@rosiealice its worth testing.
@mvdebolskiy I think you found a real bug there.
Update: I applied the filter @mvdebolskiy pointed out to the restart process, but I am still generating differences in an ERS test.

@rgknox
Copy link
Contributor

rgknox commented Oct 5, 2023

A couple of other things I tried:
Running with only evergreens: failure persists
*Running in debug mode with gnu: failure persists

I ran the latter test because I was concerned the radiation scheme may had been pulling from indices outside of the allocation bounds on arrays, and thus mixing in un-initialized data. It was a long shot, but this was not the issue.

@ckoven
Copy link
Contributor

ckoven commented Oct 5, 2023

@mvertens this is probably a separate issue from the fundamental one(s) responsible for the problems here, but the history fields FATES_DISTURBANCE_RATE_P2P, FATES_DISTURBANCE_RATE_POTENTIAL, FATES_DISTURBANCE_RATE_TREEFALL that are showing up as non-bit-for-bit in your experiment despite being tied to the daily timestep are all either removed or reworked as part of #1040. I will put in code to make sure that the underlying variable for these is restarted correctly into that PR. [Edit: I have now put this in via commit 853efcf which is on the #1040 branch]

@mvdebolskiy
Copy link

mvdebolskiy commented Oct 10, 2023

Small update, I;ve expanded @mvertens investigations:
I've run 2 runs from a restart file that was produced by a year run from coldstart with compset 2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV and no modifications.
To runs have following xml changes:

./xmlchange STOP_N=2,STOP_OPTION=ndays
./xmlchange REST_OPTION=nsteps,REST_N=2
./xmlchange RESUBMIT=0
./xmlchange RUN_REFDATE=2001-01-01
./xmlchange RUN_STARTDATE=2001-01-01

and

./xmlchange STOP_N=1,STOP_OPTION=ndays
./xmlchange REST_OPTION=nsteps,REST_N=2
./xmlchange RESUBMIT=1
./xmlchange RUN_REFDATE=2001-01-01
./xmlchange RUN_STARTDATE=2001-01-01

After looking at the difference between the 2 files, I've found that pfts1d_active is different in 1 value.

Looking further, I've found that active columns and pfts are setup through calling set_active() which calls is_active_p() during initialization and restart. However, when Fates is running, it patch%wtcol is assigned patch%wt_ed in here. The update for wtcol, and happens once a day.

The restart files for 2001-01-01-00000 are identical for both fates_nocomp = .true. and fates_nocomp=.false.
However, subsequent restarts show differences, for fates_nocomp = .true.:

For fates_nocomp

RMS pfts1d_wtxy                      4.1023E-13            NORMALIZED  9.0097E-12
 RMS pfts1d_wtlnd                     4.2316E-13            NORMALIZED  4.0344E-12
 RMS pfts1d_wtcol                     4.2316E-13            NORMALIZED  1.5587E-12
 RMS DZSNO                            2.4262E-09            NORMALIZED  6.8734E-08
 RMS ZSNO                             2.8319E-09            NORMALIZED  1.8190E-08

And for fates_nocomp=.false.:

RMS pfts1d_active                    5.8456E-03            NORMALIZED  1.2956E-02
 RMS pfts1d_wtxy                      1.4848E-03            NORMALIZED  2.9640E-02
 RMS pfts1d_wtlnd                     1.5813E-03            NORMALIZED  1.3703E-02
 RMS pfts1d_wtcol                     1.5813E-03            NORMALIZED  5.2942E-03
 RMS DZSNO                            7.9897E-07            NORMALIZED  2.2642E-05

This is not a full list, since those are a bit long.

I will check if this happens with fates_spitfire_mode=0 tomorrow.

Forgot to mention. The values are different for around 2800 patches out of around 32k.

@mvdebolskiy
Copy link

mvdebolskiy commented Oct 12, 2023

OK, reading through Fates calls during restart, I've found that there are cohorts being terminated during the reading of the restart file for the failing tests.

@rgknox
Copy link
Contributor

rgknox commented Oct 12, 2023

That could certainly explain the diffs. I see we call restart() in the clmfates_interfaceMod, and in that routine we call update_site(), which calls termination here:
https://github.com/NGEET/fates/blob/sci.1.67.5_api.27.0.0/main/EDMainMod.F90#L808-L809

Looking for more instances.

@rosiealice
Copy link
Contributor

Nice find @mvdebolskiy !

@mvdebolskiy
Copy link

@rgknox More specifically. The termination happens in canopy_structure(), in the call index 14 after the cohort fuse.

@rosiealice
Copy link
Contributor

It is indeed confusing why things should be terminating on restart and why this is triggered after and not before the restart occurs. I also agree that we should not in principle need to call canopy_structure during the restarting process. Doing away with it might prevent this termination occurring, but might actually mask the reason why the conditions wrt the termination change? Probably would be useful to track down what is triggering the termination here?

@mvdebolskiy
Copy link

mvdebolskiy commented Oct 13, 2023

fates branch I am working in
ctsm

Regarding failing pfts: it's not just 5 and 6 but 1,3,9,12.

@mvdebolskiy
Copy link

mvdebolskiy commented Oct 13, 2023

Things from today's meeting

  • Chek if init_cold is not called during restart read
  • Make NCL_p a restart var and switch of canopy_structure call during restart read
  • try to fuse on the restart write to see if there are cohorts to fuse
  • dump small n cohorts into log on restart write and read
  • Get rid of the variables FATES_DISTURBANCE_RATE_P2P, FATES_DISTURBANCE_RATE_POTENTIAL, FATES_DISTURBANCE_RATE_TREEFALL from history files, so the test don't fail because of them

@glemieux
Copy link
Contributor

During the meeting I ran a 110 day 1x1_brazil at @rosiealice suggestion to see if we can narrow down to a specific site to make troubleshooting a little easier. I wasn't able to get the issue to replicate in this configuration.

I also tried adding a check for hlm_is_restart in canopy_structure to avoid the do while(area_not_balanced) loop, hoping to avoid the termination calls here:

! Does any layer have excess area in it? Keep going until it does not...
patch_area_counter = 0
area_not_balanced = .true.
do while(area_not_balanced)
! ---------------------------------------------------------------------------
! Demotion Phase: Identify upper layers that are too full, and demote them to
! the layers below.
! ---------------------------------------------------------------------------
! Its possible that before we even enter this scheme
! some cohort numbers are very low. Terminate them.
call terminate_cohorts(currentSite, currentPatch, 1, 12, bc_in)
! Calculate how many layers we have in this canopy
! This also checks the understory to see if its crown
! area is large enough to warrant a temporary sub-understory layer
z = NumPotentialCanopyLayers(currentPatch,currentSite%spread,include_substory=.false.)
do i_lyr = 1,z ! Loop around the currently occupied canopy layers.
call DemoteFromLayer(currentSite, currentPatch, i_lyr, bc_in)
end do
! After demotions, we may then again have cohorts that are very very
! very sparse, remove them
call terminate_cohorts(currentSite, currentPatch, 1,13,bc_in)
call fuse_cohorts(currentSite, currentPatch, bc_in)
! Remove cohorts for various other reasons
call terminate_cohorts(currentSite, currentPatch, 2,13,bc_in)
! ---------------------------------------------------------------------------------------
! Promotion Phase: Identify if any upper-layers are underful and layers below them
! have cohorts that can be split and promoted to the layer above.
! ---------------------------------------------------------------------------------------
! Re-calculate Number of layers without the false substory
z = NumPotentialCanopyLayers(currentPatch,currentSite%spread,include_substory=.false.)
! We only promote if we have at least two layers
if (z>1) then
do i_lyr=1,z-1
call PromoteIntoLayer(currentSite, currentPatch, i_lyr)
end do
! Remove cohorts that are incredibly sparse
call terminate_cohorts(currentSite, currentPatch, 1,14,bc_in)
call fuse_cohorts(currentSite, currentPatch, bc_in)
! Remove cohorts for various other reasons
call terminate_cohorts(currentSite, currentPatch, 2,14,bc_in)
end if
! ---------------------------------------------------------------------------------------
! Check on Layer Area (if the layer differences are not small
! Continue trying to demote/promote. Its possible on the first pass through,
! that cohort fusion has nudged the areas a little bit.
! ---------------------------------------------------------------------------------------
z = NumPotentialCanopyLayers(currentPatch,currentSite%spread,include_substory=.false.)
area_not_balanced = .false.
do i_lyr = 1,z
call CanopyLayerArea(currentPatch,currentSite%spread,i_lyr,arealayer(i_lyr))
if( ((arealayer(i_lyr)-currentPatch%area)/currentPatch%area > area_check_rel_precision) .or. &
((arealayer(i_lyr)-currentPatch%area) > area_check_precision ) ) then
area_not_balanced = .true.
endif
enddo
! ---------------------------------------------------------------------------------------
! Gracefully exit if too many iterations have gone by
! ---------------------------------------------------------------------------------------
patch_area_counter = patch_area_counter + 1
if(patch_area_counter > max_patch_iterations .and. area_not_balanced) then
write(fates_log(),*) 'PATCH AREA CHECK NOT CLOSING'
write(fates_log(),*) 'patch area:',currentpatch%area
do i_lyr = 1,z
write(fates_log(),*) 'layer: ',i_lyr,' area: ',arealayer(i_lyr)
write(fates_log(),*) 'rel error: ',(arealayer(i_lyr)-currentPatch%area)/currentPatch%area
write(fates_log(),*) 'abs error: ',arealayer(i_lyr)-currentPatch%area
enddo
write(fates_log(),*) 'lat:',currentSite%lat
write(fates_log(),*) 'lon:',currentSite%lon
write(fates_log(),*) 'spread:',currentSite%spread
currentCohort => currentPatch%tallest
do while (associated(currentCohort))
write(fates_log(),*) 'coh ilayer:',currentCohort%canopy_layer
write(fates_log(),*) 'coh dbh:',currentCohort%dbh
write(fates_log(),*) 'coh pft:',currentCohort%pft
write(fates_log(),*) 'coh n:',currentCohort%n
write(fates_log(),*) 'coh carea:',currentCohort%c_area
ipft=currentCohort%pft
write(fates_log(),*) 'maxh:',prt_params%allom_dbh_maxheight(ipft)
write(fates_log(),*) 'lmode: ',prt_params%allom_lmode(ipft)
write(fates_log(),*) 'd2bl2: ',prt_params%allom_d2bl2(ipft)
write(fates_log(),*) 'd2bl_ediff: ',prt_params%allom_blca_expnt_diff(ipft)
write(fates_log(),*) 'd2ca_min: ',prt_params%allom_d2ca_coefficient_min(ipft)
write(fates_log(),*) 'd2ca_max: ',prt_params%allom_d2ca_coefficient_max(ipft)
currentCohort => currentCohort%shorter
enddo
call endrun(msg=errMsg(sourcefile, __LINE__))
end if
enddo ! do while(area_not_balanced)

This hit an endrun during the restart here almost immediately after restart initialization:

if ( currentPatch%total_canopy_area>currentPatch%area ) then
if ( currentPatch%total_canopy_area-currentPatch%area > 0.001_r8 ) then
write(fates_log(),*) 'FATES: canopy area bigger than area', &
currentPatch%total_canopy_area ,currentPatch%area, &
currentPatch%total_canopy_area -currentPatch%area,&
currentPatch%nocomp_pft_label
call endrun(msg=errMsg(sourcefile, __LINE__))

cheyenne folder: /glade/u/home/glemieux/scratch/ctsm-tests/tests_issue1051_d110_f45_areabalavoid

@rgknox
Copy link
Contributor

rgknox commented Oct 13, 2023

@glemieux , are you saying that the model crashes if we don't call that promotion/demotion code during the restart sequence?

@rgknox
Copy link
Contributor

rgknox commented Oct 13, 2023

my tests are now passing when bypassing canopy_structure() via update_site() on the restart, as well as adding patch%ncl_p to the restart

@rgknox
Copy link
Contributor

rgknox commented Oct 13, 2023

Can someone else test these branches to see if these fixes work for them? I'll try some more tests, longer tests to see if that works:

https://github.com/rgknox/ctsm/tree/fates-nocomp-fix

The fates branch was added to the externals of the clm branch btw.

https://github.com/rgknox/fates/tree/nocomp-fix-tests

fates-side diff for convenience: https://github.com/NGEET/fates/compare/main...rgknox:fates:nocomp-fix-tests?w=1

@glemieux
Copy link
Contributor

@glemieux , are you saying that the model crashes if we don't call that promotion/demotion code during the restart sequence?

@rgknox yes that's what I was seeing. This was with the latest ctsm tag using sci.1.67.2_api.27.0.0.

@glemieux
Copy link
Contributor

@rgknox I confirmed that the branch fix worked for me as well. This was a 110, f45 test on Cheyenne.
Cheyenne folder: /glade/u/home/glemieux/scratch/ctsm-tests/tests_issue1051_d110_f45_rgknox-nocompfix

@rosiealice
Copy link
Contributor

Amazing!

@mvdebolskiy
Copy link

Checked that init_cold is not called on restart. Will try to test rgknox's fix with longer test over the weekend.

@mvertens
Copy link
Author

mvertens commented Oct 14, 2023 via email

@mvdebolskiy
Copy link

@rgknox I am a bit confused. You have added variables in FatesRestartInterfaceMod as r8 but in the FatesPatchMod they are integers.

@rgknox
Copy link
Contributor

rgknox commented Oct 16, 2023

@mvdebolskiy , I believe you are correct, the patch%ncl_p variable should be stored as a "cohort_int" and be accessed via this%rvars(***)%int1d. Thanks for catching that. The compiler was doing us a favor and converting from real to int for us. I'll make the change

@mvdebolskiy
Copy link

Quick question:
Do I understand correctly that canopy layers 1 ... nclmax are going in the order from tallest to shortest?

@rgknox
Copy link
Contributor

rgknox commented Oct 17, 2023

yes, layer 1 is the upper-most canopy layer, and then it works downwards.

@rgknox
Copy link
Contributor

rgknox commented Oct 27, 2023

While #1098 seems to enable some long restarts to pass, it does not seem to address the whole problem.
I do generate fails for longer ERS tests, for example 11 month runs with full fates on an f45.
I'm planning a few different tests:

  1. trying with fates_comp_excln -1 (ie strict rank based promotion/demotion) instead of probabilistic
  2. again, removing deciduous for evergreen only
  3. I may try with fixed biogeog/nocomp, (my full fates runs generate the restart fail), but I believe @mvdebolskiy might be generating restart compare fails with this setting already?
  4. fire/no fire

@rgknox
Copy link
Contributor

rgknox commented Oct 28, 2023

I believe that this call to canopy spread is also a problem for restarts: https://github.com/NGEET/fates/blob/main/main/EDMainMod.F90#L789

If you look at the spread calculation, it is an "incrementer", ie it will modify the spread a little more until the canopy goes all the way towards closed. This is problematic if it is called more times in the restart sequence than the original sequence. I turned this call off on the restart, and the ERS 11 month test passed. I'll add this change to the testing branch and continue to see if we can find more problems.

Other tests that also pass:

ERS_Lm15.f45_f45_mg37.I2000Clm51Fates.cheyenne_intel.clm-Fates
ERS_Lm25_P144x1.f10_f10_mg37.I2000Clm51Fates.cheyenne_intel.clm-FatesColdNoComp

@rosiealice
Copy link
Contributor

Hi Ryan,

Nice catch, this sounds like a very viable culprit to me...

@mvertens
Copy link
Author

mvertens commented Oct 30, 2023 via email

@rgknox
Copy link
Contributor

rgknox commented Oct 30, 2023

@mvertens I think that's a great idea. I was wondering how the queue wall-time is calculated for the tests, to make sure that we don't push up against the maximum. And I was also playing around with maximizing run time while keeping the wall-time somewhat consistent with other tests. For instance in that Lm25 I tried, I doubled the core count by using the P144x1 setting, and that took around 1000 seconds (17 minutes), not too bad. I think that is a reasonable wall-time to include in the test suite. Any ideas/brainstorming are welcome.
I was also going to try creating a test that initializes the vegetation demographics from initialization files. I think this will go a long way towards making our tests more robust. If we do this, I'd like us to integrate the pull request that updates allometry (#1093) before we set the initialization files up.

@rgknox
Copy link
Contributor

rgknox commented Oct 30, 2023

I see that we have control over wallclock time in testlist_clm.xml, so this is good

@rgknox
Copy link
Contributor

rgknox commented Oct 30, 2023

@github-project-automation github-project-automation bot moved this from ❕Todo to ✔ Done in FATES issue board Nov 10, 2023
peterdschwartz added a commit to E3SM-Project/E3SM that referenced this issue Jan 30, 2024
… next (PR #6018)

This pull requests updates the ed_update_site call in elmfates_interfacemod to pass a flag for when this procedure is called during restart.
This update should be coordinated with NGEET/fates#1098, which addresses the long duration exact restart issue NGEET/fates#1051.

Additionally this pull request resolves #5548 by expanding the fates regression test coverage to include
more run mode options for fates at a variety of resolutions and runtimes.

[non-BFB] for FATES
Fixes #5548
peterdschwartz added a commit to E3SM-Project/E3SM that referenced this issue Feb 1, 2024
This pull requests updates the ed_update_site call in elmfates_interfacemod to pass a flag
for when this procedure is called during restart. This update should be coordinated with NGEET/fates#1098,
which addresses the long duration exact restart issue NGEET/fates#1051.

Additionally this pull request resolves #5548 by expanding the fates regression
test coverage to include more run mode options for fates at a variety of resolutions and runtimes.

[non-BFB] for FATES
Fixes #5548
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants