-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failure in ERS 3 month nocomp test #1051
Comments
@rosiealice @glemieux - I was not sure where to post this. The code is very close to ctsm5.1.dev129. |
I guess a higher level question is whether we should push this NOCOMP compset back into CTSM to streamline the testing, given it (nocomp) is a pretty core use-case of FATES at the moment? Should we discuss this on a FATES SE call one day (those are at 9pm on Mondays in Norway Mariana) I don't know what the difference is between these two test variants... |
Hi @mvertens, I'll try replicating this test. As @rosiealice asked, what's the difference between the two |
@glemieux - thanks so much for your quick response to this. |
I can confirm that I'm seeing similar COMPARE_base_rest failures using the following test case setup:
Here's what I checked varying the durations:
Here are other checks I did using
This is making me wonder if #897 is actually related to this issue. |
@glemieux - thanks for confirming this. The one way I have always used to iron out restart problems is to get close to where the problem is and write history files every time step - for an initial run and a restart run. I can try to narrow this down some more. Clearly stopping somewhere between day 90 and day 120 is causing the problem. So say we know that stopping at day 100 and restarting will not work. Lets assume that just looking at the day boundary is good enough. Then one way to debug this is the following - assume that coupler history files and clm history files are written every time step.
I've usually been successful using this approach - and just did that to find a problem for a restart problem using a compset with ocean, ice, wave and datm. Anyway - I'm happy to help as well - and am also happy to clarify things if the explanation above has been confusing. |
@mvertens do you know what grid-cell is showing the diff? Also, just to confirm, I see you have a FatesCold test that passes with LM3. Have you found any non-Nocomp tests failing, or are failures just nocomp configurations? |
@rgknox - yes that is correct. The only failures I have found are in the one nocomp configuration. I have only added 4 regressions tests at this point to the NorESM CTSM fork - as a way of moving forwards to run this with CAM60, CICE and BLOM. |
A bunch of us met today to brainstorm next steps on how to approach this issue. I reported that yesterday I ran a few ERS FatesColdNoComp test mods at different durations and narrowed down the restart to some time between 52 and 58 days (Ld100 and Ld110 respectively) and is showing up first in southern latitudes (New Zealand and south Argentina). The issue presents itself right at the start up time. There are some additional nearby gridcells that start showing issues towards the end of the run suggesting that this issue might be more wide spread in the boreal region, but just swamped out because of the plotting range. The latest test runs are with the latest ctsm and fates tags. Some additional notes from the meeting:
The following actions are going to be taken:
We've agreed to meet in two weeks time to discuss progress. |
I have an update on action 2.
I am really perplexed as to why the albedos are different. I really would need some advice as to how to proceed with this. I have not looked at FATES for several years. I am happy to add new variables to the restart files. Would another chat help? |
@rgknox can this be due to #428 and ESCOMP/CTSM#548 ? Do you remember what were the tests performed when merging that PR? |
@mvertens one question I have is on the time stepping that you did. Much of the FATES-specific processes only occur once per day, so every 48th timestep. Was the daily timestep one of the two timesteps in your restart run? Because if not, that says that the daily FATES code shouldn't even be triggered in the experiment you did, despite you showing differences in things like the history-reported disturbance rates? So if that is the case it could considerably narrow the set off possible errors here. |
@ckoven - the daily time step was not one of the two timesteps in my restart. I started at the beginning of the day and did 4 timesteps in one run and 2/2 in the restart run. I then compared the two timesteps at timestep 4. |
@mvdebolskiy that could be a good lead. We had a test suit in 2018 similar to what we have now, but it may had not contained tests long enough. |
@mvdebolskiy , I'm noticing that there is some similar code doing the same thing in the call to PatchNorman radiation during the restart and the normal call sequence. Even if this is not the problem we are tackling in this thread, I think it would be better to have the zero'ing and initializations contained in one routine that is called from both locations. See: https://github.com/NGEET/fates/blob/main/main/FatesRestartInterfaceMod.F90#L3588-L3644 |
Looks like this line is different in the restart process vs normal timestepping: |
I ran a test where I modify the line linked in my previous post to match that of the call in EDSurfaceAlbedoMod, but it did not remove the error. I think we are getting closer though, the diffs that you uncovered in the restart file @mvertens do seem to point to something in the radiation restart. |
@rgknox fates/biogeophys/EDSurfaceAlbedoMod.F90 Line 100 in 21e18c6
In the |
A thought. If this is a radiation code thing, does it fail the test with the new two stream scheme? |
@rosiealice its worth testing. |
A couple of other things I tried: I ran the latter test because I was concerned the radiation scheme may had been pulling from indices outside of the allocation bounds on arrays, and thus mixing in un-initialized data. It was a long shot, but this was not the issue. |
@mvertens this is probably a separate issue from the fundamental one(s) responsible for the problems here, but the history fields |
Small update, I;ve expanded @mvertens investigations:
and
After looking at the difference between the 2 files, I've found that Looking further, I've found that active columns and pfts are setup through calling The restart files for 2001-01-01-00000 are identical for both For
And for
This is not a full list, since those are a bit long. I will check if this happens with Forgot to mention. The values are different for around 2800 patches out of around 32k. |
OK, reading through Fates calls during restart, I've found that there are cohorts being terminated during the reading of the restart file for the failing tests. |
That could certainly explain the diffs. I see we call restart() in the clmfates_interfaceMod, and in that routine we call update_site(), which calls termination here: Looking for more instances. |
Nice find @mvdebolskiy ! |
@rgknox More specifically. The termination happens in canopy_structure(), in the call index 14 after the cohort fuse. |
It is indeed confusing why things should be terminating on restart and why this is triggered after and not before the restart occurs. I also agree that we should not in principle need to call canopy_structure during the restarting process. Doing away with it might prevent this termination occurring, but might actually mask the reason why the conditions wrt the termination change? Probably would be useful to track down what is triggering the termination here? |
fates branch I am working in Regarding failing pfts: it's not just 5 and 6 but 1,3,9,12. |
Things from today's meeting
|
During the meeting I ran a 110 day I also tried adding a check for fates/biogeochem/EDCanopyStructureMod.F90 Lines 190 to 300 in a3048a6
This hit an fates/biogeochem/EDCanopyStructureMod.F90 Lines 1421 to 1427 in a3048a6
cheyenne folder: |
@glemieux , are you saying that the model crashes if we don't call that promotion/demotion code during the restart sequence? |
my tests are now passing when bypassing canopy_structure() via update_site() on the restart, as well as adding patch%ncl_p to the restart |
Can someone else test these branches to see if these fixes work for them? I'll try some more tests, longer tests to see if that works: https://github.com/rgknox/ctsm/tree/fates-nocomp-fix The fates branch was added to the externals of the clm branch btw. https://github.com/rgknox/fates/tree/nocomp-fix-tests fates-side diff for convenience: https://github.com/NGEET/fates/compare/main...rgknox:fates:nocomp-fix-tests?w=1 |
@rgknox I confirmed that the branch fix worked for me as well. This was a 110, f45 test on Cheyenne. |
Amazing! |
Checked that |
Fantastic news!!!!
lør. 14. okt. 2023 kl. 15:06 skrev mvdebolskiy ***@***.***>:
… Checked that init_cold is not called on restart. Will try to test
rgknox's fix with longer test over the weekend.
—
Reply to this email directly, view it on GitHub
<#1051 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4XCE3FMKQPLBMQX6RJ2X3X7KE4TAVCNFSM6AAAAAA2F5426OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRSHA4DSMZWGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@rgknox I am a bit confused. You have added variables in |
@mvdebolskiy , I believe you are correct, the patch%ncl_p variable should be stored as a "cohort_int" and be accessed via this%rvars(***)%int1d. Thanks for catching that. The compiler was doing us a favor and converting from real to int for us. I'll make the change |
Quick question: |
yes, layer 1 is the upper-most canopy layer, and then it works downwards. |
While #1098 seems to enable some long restarts to pass, it does not seem to address the whole problem.
|
I believe that this call to canopy spread is also a problem for restarts: https://github.com/NGEET/fates/blob/main/main/EDMainMod.F90#L789 If you look at the spread calculation, it is an "incrementer", ie it will modify the spread a little more until the canopy goes all the way towards closed. This is problematic if it is called more times in the restart sequence than the original sequence. I turned this call off on the restart, and the ERS 11 month test passed. I'll add this change to the testing branch and continue to see if we can find more problems. Other tests that also pass: ERS_Lm15.f45_f45_mg37.I2000Clm51Fates.cheyenne_intel.clm-Fates |
Hi Ryan, Nice catch, this sounds like a very viable culprit to me... |
Hi Ryan,
Great work! Do we want to do a longer run to verify that restarts are good
out to longer times?
Mariana
man. 30. okt. 2023 kl. 09:10 skrev Rosie Fisher ***@***.***>:
… Hi Ryan,
Nice catch, this sounds like a very viable culprit to me...
—
Reply to this email directly, view it on GitHub
<#1051 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4XCE3IHXVIDI3UAE43EG3YB5OFPAVCNFSM6AAAAAA2F5426OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBUGY3TKOJUGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@mvertens I think that's a great idea. I was wondering how the queue wall-time is calculated for the tests, to make sure that we don't push up against the maximum. And I was also playing around with maximizing run time while keeping the wall-time somewhat consistent with other tests. For instance in that Lm25 I tried, I doubled the core count by using the P144x1 setting, and that took around 1000 seconds (17 minutes), not too bad. I think that is a reasonable wall-time to include in the test suite. Any ideas/brainstorming are welcome. |
I see that we have control over wallclock time in testlist_clm.xml, so this is good |
New test prototype added: https://github.com/ESCOMP/CTSM/pull/2199/files#diff-e9fa716c208fbff28067b20a7bdb1b23ca47f77e28e587c691b486aeed9aca3bR1565 Looking for feedback |
… next (PR #6018) This pull requests updates the ed_update_site call in elmfates_interfacemod to pass a flag for when this procedure is called during restart. This update should be coordinated with NGEET/fates#1098, which addresses the long duration exact restart issue NGEET/fates#1051. Additionally this pull request resolves #5548 by expanding the fates regression test coverage to include more run mode options for fates at a variety of resolutions and runtimes. [non-BFB] for FATES Fixes #5548
This pull requests updates the ed_update_site call in elmfates_interfacemod to pass a flag for when this procedure is called during restart. This update should be coordinated with NGEET/fates#1098, which addresses the long duration exact restart issue NGEET/fates#1051. Additionally this pull request resolves #5548 by expanding the fates regression test coverage to include more run mode options for fates at a variety of resolutions and runtimes. [non-BFB] for FATES Fixes #5548
I am trying to create longer runs using the latest CTSM in NorESM configurations (with the latest CMEPS, CDEPS, etc). The following 2 tests fail restarts:
ERS_Lm3.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire
ERS_Ld90.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire
However what is interesting is that this test passes restart:
ERS_Ld90.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire.20230707_114133_1m5tec/
Since the restart test fails - we cannot really start an longer runs until we resolve this problem.
The text was updated successfully, but these errors were encountered: