Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LWISO_Ld10.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_gnu.clm-coldStart FAIL in ctsm5.2 #2041

Closed
slevis-lmwg opened this issue Jun 22, 2023 · 13 comments · Fixed by #2053
Closed
Assignees
Labels
bug something is working incorrectly

Comments

@slevis-lmwg
Copy link
Contributor

Brief summary of bug

This test fails in ctsm5.2
LWISO_Ld10.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_gnu.clm-coldStart
with
ERROR in CompareBulkToTracer: tracer does not agree with bulk water

General bug information

CTSM version you are using: [output of git describe]
alpha-ctsm5.2.mksrf.16_ctsm5.1.dev123

Does this bug cause significantly incorrect results in the model's science? [Yes / No]
Yes

Configurations affected:
See test name above.

Details of bug

This is the first time that we have tried to run the test suites with ctsm5.2. The failing test uses a new fsurdat in ctsm5.2:
/glade/p/cesmdata/cseg/inputdata/lnd/clm2/surfdata_esmf/ctsm5.2.0/surfdata_10x15_hist_78pfts_CMIP6_2000_c230517.nc
...relative to ctsm5.1:
/glade/p/cesmdata/cseg/inputdata/lnd/clm2/surfdata_map/release-clm5.0.18/surfdata_10x15_hist_78pfts_CMIP6_simyr2000_c190214.nc

@billsacks
@ekluzek suggested that you may have quicker insight into this failure than we would.

@slevis-lmwg slevis-lmwg self-assigned this Jun 22, 2023
@slevis-lmwg slevis-lmwg added the bug something is working incorrectly label Jun 22, 2023
@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Jun 26, 2023

Met with @billsacks and @ekluzek and decided the following:

  • understand why ctsm5.1 does not trigger same error
  • see possible connection with this function/subroutine: UpdateState_RemoveCanfallFromCanopy and also Bill's notes here: UpdateState_RemoveSnowUnloading
  • apply truncate small values function? Increase error tolerance?
  • tried running _D_ (debug) mode and got the same error

@slevis-lmwg
Copy link
Contributor Author

@billsacks @ekluzek
I have promising news and am interested in your thoughts:

  1. diff between the lnd_in files of the ctsm5.1 and ctsm5.2 tests showed that the latter was setting convert_ocean_to_land = .false.
  2. In both versions, the test uses clm50 (I2000Clm50BgcCrop).
  3. In ctsm5.2 I saw that, if I changed to clm51, I would get convert_ocean_to_land = .true.
  4. I tried that and the test passed!

@billsacks
Copy link
Member

Thanks for your work on this and for providing this update @slevis-lmwg !

I have a few thoughts here:

(1) It seems like this points out a general need to update namelist_defaults so that all ctsm5.1 options are duplicated for ctsm5.2 if we are introducing a new ctsm5.2 physics option. (I had a vague recollection that we weren't going to have a new ctsm5.2 physics option, since the only difference was going to be new surface datasets which we were going to apply for all versions, but maybe I'm remembering wrong or maybe the plan has changed?)

(2) I don't think this fully explains why the test is newly failing now, since as you say, the test uses clm50 physics, and convert_ocean_to_land is always (implicitly) false on master, since that option isn't implemented there. But maybe there's some interaction between this option and changes in the new surface datasets... I guess I could believe that.

(3) I'm struggling a bit to see why setting convert_ocean_to_land = .true. would make things pass when having this be .false. leads to the error you're seeing. If I remember correctly, the error appeared in a vegetated PFT. One thing I wonder is if it's a zero-area (virtual) patch in the run with convert_ocean_to_land = .false. (i.e., patch%wtgcell(p) = 0). In principle I don't think that should cause problems, but maybe something is working wrong for the initialization or evolution of virtual columns in this respect? I think this is worth digging into a little more, because it feels like something is going wrong here....

@slevis-lmwg
Copy link
Contributor Author

In case it might help me narrow things down, I tried the same test
LWISO_Ld10.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_gnu.clm-coldStart
with ctsm5.2 (so new code) but with the ctsm5.1 fsurdat (i.e. old file) and it worked.

I don't have a good understanding of the code, so I will spend some time trying to understand it.

@billsacks
Copy link
Member

@slevis-lmwg based on that result and your previous one, the first thing I'd look at is whether there are changes in the subgrid breakdown for the grid cell with the failure: it seems like there may be a difference in the presence / absence of some subgrid tile in the old vs. new, maybe?

@slevis-lmwg
Copy link
Contributor Author

@billsacks here's what I see with ncview...
At lon = 90, lat = 40 (i = 6, j = 13), looking at new minus old fsurdat files (i.e. the diffs):
PCT_CROP = -7.5e-4 (this consists of reductions and increases in the PCT_CFT of various cfts)
PCT_NATVEG = 7.5e-4 (this consists of increased PCT_NAT_PFT in one pft and reductions in a few others)
PCT_GLACIER, _LAKE, _WETLAND, _URBAN = 0 (new = old = 0)
PCT_GLC_MEC (and TOPO_GLC_MEC) differs in nglcec 9 and 10
For me, these differences do not raise red flags because many grid cells have such differences, as far as I can tell...

@billsacks in an earlier comment you wondered whether patch%wtgcell(p) = 0 in the failing grid cell and the answer is no. I added corresponding info to the error message:

20:iam = 20: gridcell longitude    =   90.0000000
20:iam = 20: gridcell latitude     =   40.0000000
20:iam = 20: pft      type         = 12
20:iam = 20: pft      wtcol        =    0.1068790
20:iam = 20: pft      wtgcell      =    0.1055263
20:iam = 20: column   type         = 1
20:iam = 20: column   wtlunit      =    1.0000000
20:iam = 20: landunit type         = 1
20:iam = 20: landunit wtgcell      =    0.9873434

@billsacks
Copy link
Member

Hmmmm, I am very puzzled. I am having trouble seeing why the change in convert_ocean_to_land has any bearing on this: Are there even any differences in the subgrid weights of this point when setting that to true vs false? It seems like there wouldn't be based on PCT_WETLAND being 0.

Have you been able to reproduce the failure (to verify that this wasn't just a machine glitch) and also reproduce the pass when setting convert_ocean_to_land = .true. (without any other changes)? I'm wondering if that was a red herring.

@slevis-lmwg
Copy link
Contributor Author

I reproduced the error three times: once when running the test in _D_ debug mode, and twice when I added new outputs to the error message (pft wtcol, pft wtgcell).

Also I reran with convert_ocean_to_land = .true. and reproduced the PASS just now.

@slevis-lmwg
Copy link
Contributor Author

I removed my last two posts because they did not add helpful information.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Jul 2, 2023

Write statements from subroutine CanopyInterceptionAndThroughfall show the tracer and bulk values for begp=1 to endp=67 including the value for index = 21 that triggers the error. To make easier to read, I have removed values after index = 27 because they are zeros. Also I placed ** around the value at index = 21.

index = 21 seems to behave consistently with other index values, so I don't think that these write statements raise red flags. :

20: after UpdateState_RemoveSnowUnloading trac_snocan =   0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        3.4990048392281783E-008   3.2818367146871103E-008   7.2038970438381224E-032   2.7632431574775100E-031   7.5773957284976151E-056   0.0000000000000000        **2.7864890851993229E-307**   2.9447375187772244E-283   2.9447375187772244E-283   0.0000000000000000        5.9707463362854256E-003   2.4230549570754562E-003   2.4230549571078448E-003
20: after UpdateState_RemoveSnowUnloading bulk_snocan =   0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        3.4990048392281783E-008   3.2818367146871103E-008   7.2038970438381224E-032   2.7632431574775100E-031   7.5773957284976151E-056   0.0000000000000000        **2.7864890851993229E-307**   2.9447375187772244E-283   2.9447375187772244E-283   0.0000000000000000        5.9707463362854256E-003   2.4230549570754562E-003   2.4230549571078448E-003
20: i, begp, endp =           0           1          67

20: after UpdateState_RemoveSnowUnloading trac_snocan =   0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        3.4990048392281783E-008   3.2818367146871103E-008   7.2038970438381224E-032   2.7632431574775100E-031   7.5773957284976151E-056   0.0000000000000000        **2.7864890851993229E-307**   2.9447375187772244E-283   2.9447375187772244E-283   0.0000000000000000        5.9707463362854256E-003   2.4230549570754562E-003   2.4230549571078448E-003
20: after UpdateState_RemoveSnowUnloading bulk_snocan =   0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        3.4990048392281783E-008   3.2818367146871103E-008   7.2038970438381224E-032   2.7632431574775100E-031   7.5773957284976151E-056   0.0000000000000000        **2.7864890851993229E-307**   2.9447375187772244E-283   2.9447375187772244E-283   0.0000000000000000        5.9707463362854256E-003   2.4230549570754562E-003   2.4230549571078448E-003
20: i, begp, endp =           1           1          67

20: after UpdateState_RemoveSnowUnloading trac_snocan =   0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        3.4990048392281785E-009   3.2818367146871105E-009   7.2038970438381229E-033   2.7632431574775102E-032   7.5773957284976148E-057   0.0000000000000000        **2.7864890851993229E-308**   2.9447375187772246E-284   2.9447375187772246E-284   0.0000000000000000        5.9707463362854260E-004   2.4230549570754563E-004   2.4230549571078449E-004
20: after UpdateState_RemoveSnowUnloading bulk_snocan =   0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        3.4990048392281783E-008   3.2818367146871103E-008   7.2038970438381224E-032   2.7632431574775100E-031   7.5773957284976151E-056   0.0000000000000000        **2.7864890851993229E-307**   2.9447375187772244E-283   2.9447375187772244E-283   0.0000000000000000        5.9707463362854256E-003   2.4230549570754562E-003   2.4230549571078448E-003
20: i, begp, endp =           2           1          67

20: after UpdateState_RemoveSnowUnloading trac_snocan =   0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        3.4990048392281787E-018   3.2818367146871103E-018   7.2038970438381234E-042   2.7632431574775101E-041   7.5773957284976162E-066   0.0000000000000000        **2.7865391357262557E-317**   2.9447375187772242E-293   2.9447375187772242E-293   0.0000000000000000        5.9707463362854255E-013   2.4230549570754567E-013   2.4230549571078447E-013
20: after UpdateState_RemoveSnowUnloading bulk_snocan =   0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        3.4990048392281783E-008   3.2818367146871103E-008   7.2038970438381224E-032   2.7632431574775100E-031   7.5773957284976151E-056   0.0000000000000000        **2.7864890851993229E-307**   2.9447375187772244E-283   2.9447375187772244E-283   0.0000000000000000        5.9707463362854256E-003   2.4230549570754562E-003   2.4230549571078448E-003
20: i, begp, endp =           3           1          67

20: after UpdateState_RemoveSnowUnloading trac_snocan =   0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        3.4990048392281781E-007   3.2818367146871102E-007   7.2038970438381222E-031   2.7632431574775100E-030   7.5773957284976154E-055   0.0000000000000000        **2.7864890851993229E-306**   2.9447375187772240E-282   2.9447375187772240E-282   0.0000000000000000        5.9707463362854256E-002   2.4230549570754562E-002   2.4230549571078449E-002
20: after UpdateState_RemoveSnowUnloading bulk_snocan =   0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        3.4990048392281783E-008   3.2818367146871103E-008   7.2038970438381224E-032   2.7632431574775100E-031   7.5773957284976151E-056   0.0000000000000000        **2.7864890851993229E-307**   2.9447375187772244E-283   2.9447375187772244E-283   0.0000000000000000        5.9707463362854256E-003   2.4230549570754562E-003   2.4230549571078448E-003
20: i, begp, endp =           4           1          67

20:ERROR in CompareBulkToTracer: tracer does not agree with bulk water
20:Called from: after first stage of hydrology
20:Variable: snocan_patch
20:First difference at index: 21
20:Bulk  :   0.27864890851993229-306
20:Tracer:   0.27865391357262557-316
20:ratio:   0.10000000000000000E-09
20:Bulk*ratio:   0.27864892350960257-316
20:iam = 20: local  patch    index = 21
20:iam = 20: global patch    index = 2950

@slevis-lmwg
Copy link
Contributor Author

Repeated the above test but in ctsm5.1 (dev129) with the corresponding old fsurdat. General behavior seems the same. No red flags from inspecting the cesm.log of this test against the cesm.log from ctsm5.2 (that stops with the error).

Next: I will look into using the truncate_small_values function.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Jul 6, 2023

As recommended in the software eng. meeting (2023/7/6), I have confirmed that the I2000Clm50BgcCrop test with convert_ocean_to_land = .true. in ctsm5.2 fails with the exact same error as with convert_ocean_to_land = .false.
(this versus the I2000Clm51BgcCrop test which I have shown passes).

Applying truncate_small_values on both tracer and bulk may correct the issue. However, it does not explain why running with ctsm51 I see bulk and tracer values of 1e-200, but the error is not triggered.

@slevis-lmwg
Copy link
Contributor Author

The test has now passed. I will commit and push the corresponding code change for review and discussion. I doubt that we will agree on my exact code change, but it's a start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Projects
No open projects
3 participants