Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grid cell-level error check for H2O #1228

Merged
merged 23 commits into from
Feb 20, 2021

Conversation

slevis-lmwg
Copy link
Contributor

Description of changes

Introducing a grid cell-level error check for H2O to address issue #201
Following the approach of pull requests #984 and #1022

Specific notes

Contributors other than yourself, if any:
@billsacks

CTSM Issues Fixed (include github issue #): #201

Are answers expected to change (and if so in what way)?
No

Any User Interface Changes (namelist or namelist defaults changes)?
No

Testing performed, if any:
ERI_D_Ld9.1x1_camdenNJ.I2000Clm50BgcCruGs.cheyenne_intel.clm-default PASS
ERI_D_Ld9.f10_f10_musgs.I1850Clm50Bgc.cheyenne_gnu.clm-default PASS
ERP_D_Ld5.f10_f10_musgs.IHistClm50BgcCrop.cheyenne_intel.clm-allActive PASS
ERS_Ly3_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropQianGs.cheyenne_gnu.clm-cropMonthOutput FAIL
The last test fails because I have not handled the special case of transient simulations, yet. I will address this in this PR.

Works for non-transient simulations. In this attempt I convert most
column-level variables to grid cell-level locally in subr. BalanceCheck.
Using grid cell-level variables calculated in subr. lnd2atm did not work
because subr. lnd2atm is called after subr. BalanceCheck.
@billsacks
Copy link
Member

@slevisconsulting please let me know when you feel this is ready for review.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Dec 8, 2020

@slevisconsulting please let me know when you feel this is ready for review.

@billsacks I would first like to discuss the fact that call lnd2atm occurs long enough after call BalanceCheck to be giving different grid cell-level (grc) values than the locally calculated ones in subr. BalanceCheck. In clm_driver you wrote a relevant comment:

! COMPILER_BUG(wjs, 2016-02-24, pgi 15.10) In principle, we should be able to make
[...]

Bill clarified that this comment does not pertain to call lnd2atm.

@billsacks
Copy link
Member

@slevisconsulting I don't understand the connection between that compiler bug comment and the difference in gridcell-level values between balance check and lnd2atm. Also, I looked at what is called between those two points, and the main thing that I could imagine changing some water terms is the call to ch4. Is there really some changes to water fluxes or states happening in the ch4 routine? Maybe we need to talk about this further.

@slevis-lmwg
Copy link
Contributor Author

@billsacks somehow I confused myself during our call today. Now I am convinced again that calling lnd2atm after calling BalanceCheck means that the _grc variables used in BalanceCheck come from the previous time step. This is why the _locgrc variables contain different values.

@billsacks
Copy link
Member

@billsacks somehow I confused myself during our call today. Now I am convinced again that calling lnd2atm after calling BalanceCheck means that the _grc variables used in BalanceCheck come from the previous time step. This is why the _locgrc variables contain different values.

Sorry, I don't understand what you mean here. Let me know if you'd like to talk more tomorrow.

@slevis-lmwg
Copy link
Contributor Author

I hope this explains it better... I am considering splitting my existing call BalanceCheck into two calls:

  1. Rename the existing call to call ColumnBalanceCheck and not move it.
  2. Place a new call GridcellBalanceCheck after the call lnd2atm in order to make use of existing _grc variables.

This differs from our approach for C, N, and CH4 where we used a single subroutine to perform both the column and the grid cell checks.

@billsacks
Copy link
Member

@slevisconsulting thanks for explaining. That is more clear. Your plan sounds fine, but I wonder if another possibility would be to move the column-level balance checks also to after the call to lnd2atm: At least for water, I would think that should give the same result, unless ch4 is doing something to the water variables... and in that case, we may run into trouble with the gridcell-level balance checks, too. My gut feeling is to slightly prefer that (moving the column balance checks to later), but I don't feel strongly at this point if you feel that would be a big can of worms.

@slevis-lmwg
Copy link
Contributor Author

Since I haven't tried either, yet, I will start with your preferred option of moving the whole call BalanceCheck to after the call lnd2atm. If it works, we'll go with that.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Dec 11, 2020

As agreed, I moved the call BalanceCheck to after the call lnd2atm.

I ran my single point test twice. The first time I kept the locally calculated grid cell-level variables. The second time I switched to the non-local endwb_grc (only). Both times I got the same result, which seems promising. I want to show you the test result @billsacks before I replace the remaining non-local variables:

So the test
ERI_D_Ld9.1x1_camdenNJ.I2000Clm50BgcCruRs.cheyenne_intel.clm-default
mostly passed but returned:
FAIL BASELINE /glade/p/cgd/tss/ctsm_baselines/ctsm5.1.dev018: DIFF
and I found only one variable in one cprnc.out file that's different:

ERRH2O   (lndgrid,time)  t_index =      1     1
1        1  (     1,     1) (     1,     1) (     1,     1) (     1,     1)
          1  -7.918655101265861E-16  -7.918655101265861E-16 7.7E-19 -7.918655101265861E-16 9.7E-04 -7.918655101265861E-16
          1  -7.926310619598004E-16  -7.926310619598004E-16         -7.926310619598004E-16         -7.926310619598004E-16
          1  (     1,     1) (     1,     1)
avg abs field values:    7.918655101265861E-16    rms diff: 7.7E-19   avg rel diff(npos):  9.7E-04
                                   7.926310619598004E-16                        avg decimal digits(ndif):  3.0 worst:  3.0
 RMS ERRH2O   7.6555E-19            NORMALIZED  9.6630E-04

Show stopper or expected in your opinion @billsacks ?

@billsacks
Copy link
Member

@slevisconsulting thanks a lot for the update and for your continued work on this. This is somewhere in between show stopper and expected. It definitely is NOT a show stopper: ERRH2O is the small difference between large terms, so it is normal for the normalized RMS difference to be large for that variable even when there are only very small changes in the underlying terms. However, I think our expectation was that moving the end balance check call shouldn't change things at all, because we didn't expect any changes to water fluxes or states in between the old placement and the new placement; do you agree? If so, I think this could warrant a bit more investigation so that we understand why this is changing, to make sure that it is safe to move this call (e.g., looking at which component(s) of errh2o are changing, and seeing where in the code they change between the old and new placement). Does that make sense to you?

@slevis-lmwg
Copy link
Contributor Author

Yes, this makes sense, and I will try to track it down.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Dec 16, 2020

I have solved it. Subroutine lnd2atm changes qrgwl like this:
qflx_qrgwl_col(c) = qflx_qrgwl_col(c) + qflx_liq_from_ice_col(c)
Commenting out the above line eliminates the ERRH2O diff that I posted on Dec 10 and introduces a QRGWL diff:

QRGWL   (lndgrid,time)  t_index =      1     1
1        1  (     1,     1) (     1,     1) (     1,     1) (     1,     1)
          1   0.000000000000000E+00   0.000000000000000E+00 1.3E-21  0.000000000000000E+00 1.0E+00  0.000000000000000E+00
          1   1.285966334008362E-21   1.285966334008362E-21          1.285966334008362E-21          1.285966334008362E-21
          1  (     1,     1) (     1,     1)
avg abs field values:    0.000000000000000E+00    rms diff: 1.3E-21   avg rel diff(npos):  1.0E+00
                                   1.285966334008362E-21                        avg decimal digits(ndif):  0.0 worst:  0.0
 RMS QRGWL  1.2860E-21            NORMALIZED  2.0000E+00

The alternative of not commenting out the above line and removing qflx_liq_from_ice_col from the accounting, results in a change of order of operations and a diff in ERRH2O that is smaller than the one I posted on Dec 10.

If we want no diffs from the baseline, we need to split subr. BalanceCheck into the column part and the grid cell part. I am open to your recommendations @billsacks

@billsacks
Copy link
Member

@slevisconsulting I feel it's totally fine to have answer changes in ERRH2O for this PR, as long as we understand them and are fairly confident that the column-level balance check is still working correctly.

If I understand your comment correctly, one of your options involves:

  • Moving the BalanceCheck call (including the column-level part of this) to after lnd2atm
  • Making no changes in lnd2atm
  • In the column-level balance check, adding(?) qflx_liq_from_ice_col in the parenthesized accumulation of fluxes in the errh2o calculation, to avoid double-counting of that term

If so, I'd support that solution.

In looking into this, though, I saw another possible solution that may be a bit more intuitive for the sake of long-term maintenance and would keep the column-level balance check more similar to the gridcell-level, at the expense of adding an extra column-level variable to waterlnd2atm_type: I noticed that the column-level errh2o calculation currently includes:

                  - qflx_ice_runoff_snwcp(c) &
                  - qflx_ice_runoff_xs(c)    &

In lnd2atm, it looks like the sum of those two terms gets put into qflx_ice_runoff_col; then either (a) qflx_ice_runoff_col is kept as is and qflx_liq_from_ice_col is kept at 0, or (b) qflx_liq_from_ice_col is set to qflx_ice_runoff_col and qflx_ice_runoff_col is set to 0. So I think what this means (but it would be good to double-check it) is that you could:

  • Add qflx_ice_runoff_col to waterlnd2atm_type (alongside the existing gridcell-level version of that flux, named qflx_rofice_grc) rather than having it be a local variable
  • Replace the two terms in the code snippet above with the single qflx_ice_runoff_col
  • (Then there's no need to include qflx_liq_from_ice_col in the errh2o calculation)

I'd slightly prefer this alternative. But since you've been looking at this for a while, I'm fine going with whichever seems better to you.

@slevis-lmwg
Copy link
Contributor Author

I did it the way that you suggested and got bfb the same ERRH2O diff in the cprnc.out file as with my earlier test. Next I will replace the rest of the locally calculated _grc terms with their non-local counterparts.

These tests PASS:
ERP_D.f10_f10_musgs.IHistClm50Bgc.cheyenne_gnu.clm-decStart
ERS_Ly3_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropQianRs.cheyenne_gnu.clm-cropMonthOutput
@slevis-lmwg
Copy link
Contributor Author

@billsacks
Ready for review. In the meantime I will draft the ChangeLog.

Comment on lines 480 to 481
! qflx_liq_dynbal_grc => waterflux_inst%qflx_liq_dynbal_grc , & ! Input: [real(r8) (:) ] slevis: place holder
! qflx_ice_dynbal_grc => waterflux_inst%qflx_ice_dynbal_grc , & ! Input: [real(r8) (:) ] slevis: place holder
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oups, I will remove these two in the next commit.

Fix ndep from coupler

There was a bug in ndep forcing (forc_ndep_grc) when receiving ndep from
CAM (ndep_from_cpl true). This first appeared in ctsm5.1.dev002. NDEP
forcings have been garbage since then when receiving ndep from CAM.

This tag fixes that issue.
doc/ChangeLog Outdated

[X] ctsm5_0-nwp

[X] clm4_5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am realizing that I should have considered the rms diff (not the normalized diff), and I should remove the Xs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@slevis-lmwg
Copy link
Contributor Author

Test suite on izumi:
OK (due to expected diffs from dev018)

Copy link
Member

@billsacks billsacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good. Thanks so much for your work on it, and I'm really sorry it took me so long to review it! I have some relatively minor comments below. In addition, I have two other comments:

c2l_scale_type= 'urbanf', l2g_scale_type='unity' )
do g = bounds%begg, bounds%endg
water_inst%waterdiagnosticbulk_inst%tws_grc(g) = water_inst%waterdiagnosticbulk_inst%tws_grc(g) + water_inst%wateratm2lndbulk_inst%volr_grc(g) / grc%area(g) * 1.e-3_r8
water_inst%waterdiagnosticbulk_inst%tws_grc(g) = water_inst%waterbalancebulk_inst%endwb_grc(g) + water_inst%wateratm2lndbulk_inst%volr_grc(g) / grc%area(g) * 1.e-3_r8
Copy link
Member

@billsacks billsacks Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [optional] I noticed that this line and some of the others in this file are really long. I don't understand why nag didn't complain, but would you be willing to break these long lines so they are less than 132 characters?

@@ -124,6 +133,9 @@ subroutine InitAllocate(this, bounds, tracer_vars)
call AllocateVar1d(var = this%errh2o_col, name = 'errh2o_col', &
container = tracer_vars, &
bounds = bounds, subgrid_level = BOUNDS_SUBGRID_COLUMN)
call AllocateVar1d(var = this%errh2o_grc, name = 'errh2o_grc', &
Copy link
Member

@billsacks billsacks Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [optional] If we don't output this to the history file, then we could make it subroutine-local rather than adding it to the class.

Comment on lines 258 to 281
! wa(c) gets added to liquid_mass in ComputeLiqIceMassNonLake
if(use_aquifer_layer) then
do fc = 1, num_nolakec
c = filter_nolakec(fc)
if (col%hydrologically_active(c)) then
if(zwt(c) <= zi(c,nlevsoi)) then
wa(c) = aquifer_water_baseline
end if
end if
end do
endif

! NOTES subroutines Compute*Mass* are in TotalWaterAndHeatMod.F90
! endwb is calculated in HydrologyDrainageMod & LakeHydrologyMod
call ComputeWaterMassNonLake(bounds, num_nolakec, filter_nolakec, &
waterstate_inst, waterdiagnostic_inst, &
subtract_dynbal_baselines = .false., &
water_mass = begwb_col(begc:endc))

call ComputeWaterMassLake(bounds, num_lakec, filter_lakec, &
waterstate_inst, lakestate_inst, &
add_lake_water_and_subtract_dynbal_baselines = .false., &
water_mass = begwb_col(begc:endc))

Copy link
Member

@billsacks billsacks Jan 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It looks like this block duplicates code in BeginWaterColumnBalanceSingle. There's enough subtle complexity here that I'd be more comfortable if this were moved into a shared subroutine, like ComputeBegwbCol. (It looks like you'll need to pass in a begwb_col variable, since in the column version you are setting waterbalance_inst%begwb_col whereas in the gridcell version you are setting a local variable.)

Copy link
Member

@billsacks billsacks Jan 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, as I think about this more: the main subtlety relates to the setting of wa(c) = aquifer_water_baseline under certain conditions. I forget exactly what this is doing – I remember this was a real subtlety that tripped me up a few years ago, and I know I have some notes about this that I could dig up if needed – but I'm thinking that this isn't directly related to balance checks, so should maybe be extracted to its own routine (like reset_wa) that is called early in the driver sequencing (exactly when may depend on when wa and zwt are set... my guess is that you'd want to do this before initializing the gridcell balance checks, but you may want to confirm that wa and zwt aren't changed between there and the initialization of the column balance checks). If you do that, then I think the remaining duplication (the two calls to ComputeWaterMass*) can stay as is - i.e., you don't need to do my above suggestion of introducing a shared routine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented out the if(use_aquifer_layer) code here (see cb6a840)
...and this test passed:
ERP_D.f10_f10_musgs.IHistClm50Bgc.cheyenne_gnu.clm-decStart
If the izumi test-suite also passes, I will move the if(use_aquifer_layer) code to its own routine as you suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Izumi test-suite OK (expected diffs from baseline, which for now is dev019 in my case).

Copy link
Member

@billsacks billsacks Jan 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just spent too much time trying to get my head back into this issue. I think this is okay the way you've done it now - just doing this reset in BeginWaterGridcellBalanceSingle. The one question I have is this: When we initialize a new column (due to dynamic landunits/columns), we set wa in the new column to wa in some template column on that grid cell. This initialization happens after the gridcell balance check initialization. With the old logic (where wa was reset to aquifer_water_baseline in the column-level balance check initialization), this resetting of wa happened after the dynamic landunit column initialization.

So my question is this... I think mainly for @swensosc -

  • Is it a problem when use_aquifer_layer=true, if for one time step (the first timestep after initialization of a new column in a transient case) we have not reset wa(c) = aquifer_water_baseline? i.e., is it critical that wa(c) is reset to aquifer_water_baseline at the beginning of every time step when use_aquifer_layer = true, or is it okay for this not to be the case for one time step?

I think that, if the answer is that we need to do this resetting every time step, then the right place to do it is in dynInitColumnsMod:copy_state (I think that doing this resetting at the start of the column balance check, where it was done before, could potentially break the gridcell-level balance check).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slevisconsulting I would give a different explanation than the one you gave. First off, while the dynamic landuse adjustment did play in to our discussion of this (and influenced our choice of how to implement this), the non-conservation term in the latest implementation really is unrelated to dynamic landuse, at least the way that I'm thinking about it: the adjustments in wa due to dynamic landuse are already accounted for by the dynbal dribblers. Instead, I would explain this as follows:

We want to adjust wa back to the baseline under certain conditions. The right way to do this might be to use explicit fluxes from some other state, but in this case we don't have a source to pull from, so we just adjust wa without any explicit fluxes. Because we do this adjustment before initializing the column-level balance check, the column-level check is completely unaware of the adjustment. However, since this adjustment happens after initializing the gridcell-level balance check, we do have to account for this adjustment in the gridcell-level balance check. The normal way to account for an adjustment like this would be to include the flux in the balance check. But here we don't have an explicit flux, so instead track the non-conservation state. (In principle, we could calculate an explicit flux and use that, but we don't gain anything from using an explicit flux in this case.)

(This doesn't explain why we chose to use this implementation. That explanation would tie-in with subtleties of dynamic land use. I can take a stab at explaining that, too, if you'd like.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included your explanation in the latest commit. Thanks, @billsacks

Explaining the subtleties of dynamic landuse that come into play seems like a good idea for when we look at this again later. I for one will not remember these subtleties even a couple of months from now :-)

In the explanation, I am interested in understanding why it makes sense to place the code below at the initialization rather than the finalizing of the grc-level balance check:

if (use_aquifer_layer) then
       call c2g( bounds, &
            wa_reset_nonconservation_gain_col(begc:endc), &
            wa_reset_nonconservation_gain_grc(begg:endg), &
            c2l_scale_type='urbanf', l2g_scale_type='unity' )
else
       wa_reset_nonconservation_gain_grc(begg:endg) = 0._r8
end if

[...]

    do g = begg, endg
       begwb_grc(g) = begwb_grc(g) - qflx_liq_dynbal_left_to_dribble(g)  &
                                   - qflx_ice_dynbal_left_to_dribble(g)  &
                                   - wa_reset_nonconservation_gain_grc(g)
    end do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be abs. clear, if this were moved to the finalizing of the grc-level bal. check, the sign in front of wa_reset_... would change to +.

So again, does this code need to be run at initialization instead of when finalizing? If either is ok, I'm wondering whether it is more intuitive at the finalizing because the wa_reset_... term is from the same time step.

Copy link
Member

@billsacks billsacks Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Ah, yes, sorry: I missed that when reviewing your latest changes. Indeed, I was imagining that wa_reset_nonconservation_gain_grc would adjust endwb_grc, not begwb_grc. I think you want to subtract it from endwb_grc. (Imagine that wa was 4000 at the start of the time step, and the baseline is 5000. Then wa would have been increased by 1000, and wa_reset_nonconservation_gain will also be 1000. To keep the balance check working, I think we need to subtract this 1000 from endwb... if you feel otherwise, please say so, because I'm not 100% sure about this.)

If you want to say something about the relationship to dynamic landuse, I would probably put a comment like this in BeginWaterColumnBalanceSingle: This resetting of wa needs to be done after dynamic landuse so that it is done after initializing any new columns; therefore, we cannot do this resetting before initializing the gridcell-level balance check (since that is done before dynamic landuse). We do it here before initializing the column-level balance check so that we don't need to account for this non-conservation in the column-level balance check, although we still need to account for this non-conservation in the gridcell-level balance check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, and I apologize for misunderstanding the first time around. Now I am less inclined to include this most recent explanation, so I will leave it out.

Comment on lines 288 to 291
do g = begg, endg
begwb_grc(g) = begwb_grc(g) - qflx_liq_dynbal_left_to_dribble(g) &
- qflx_ice_dynbal_left_to_dribble(g)
end do
Copy link
Member

@billsacks billsacks Jan 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The sign here (- instead of +) looks wrong. It very well may actually be right (I remember there being some not-entirely-intuitive things about the sign convention of these dynbal fluxes), but assuming it's right, it would be worth adding a comment here asserting this, and on the similar lines for endwb_grc. I can help you with the wording if you aren't sure yourself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled with this during testing. I had a hard time convincing myself that the + sign was wrong. But as soon as I decided to flip the sign, the previously failing conservation test passed. Two hints led me to flip the sign:

  • The size of the conservation error
  • The - signs here:
    do g = bounds%begg, bounds%endg
    water_inst%waterlnd2atmbulk_inst%qflx_rofliq_qgwl_grc(g) = water_inst%waterlnd2atmbulk_inst%qflx_rofliq_qgwl_grc(g) - water_inst%waterfluxbulk_inst%qflx_liq_dynbal_grc(g)
    water_inst%waterlnd2atmbulk_inst%qflx_rofliq_grc(g) = water_inst%waterlnd2atmbulk_inst%qflx_rofliq_grc(g) - water_inst%waterfluxbulk_inst%qflx_liq_dynbal_grc(g)
    enddo

    do g = bounds%begg, bounds%endg
    water_inst%waterlnd2atmbulk_inst%qflx_rofice_grc(g) = water_inst%waterlnd2atmbulk_inst%qflx_rofice_grc(g) - water_inst%waterfluxbulk_inst%qflx_ice_dynbal_grc(g)
    enddo

I would not mind a suggestion regarding the wording, since you offered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this (I don't love it, but it's something...):

These dynbal dribblers actually store the delta state, (end - beg). Thus, the amount dribbled out is the negative of the amount stored in the dribblers. Similarly, to achieve conservation, we need to subtract (rather than add) the amount remaining to dribble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @billsacks
I slightly reworded the comment and added to both locations.

I will not investigate this further unless you want me to; however, I can't help but wonder why the C-balance code uses + signs in CNBalanceCheck. Without digging deeper, I suspect an inconsistency in the handling of the C vs. H2O dribble terms. Let me know if you'd like me to look into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that the two are inconsistent, and that's my fault. In retrospect, I think the convention I used for the water and energy dribblers is counter-intuitive. Probably what we should do is to store the negative of what we currently store in those dribblers. To be honest, I'm not sure I have the mental energy to make this change right now, but I'm certainly open to your doing it if you think the benefit would be significant. One of the things stopping me from doing it is that it will cause incorrect fluxes in the first year of any transient run that starts mid-year from an old restart file (with the current sign convention). That probably won't impact many people (and the impact should be small), so that is probably okay if you are inclined to tackle this. I'm fine either way (leaving it for now or your spending the time to change this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine not addressing it now. If you don't mind, I can include some of this info in the comment that I'm adding about the - sign.


end if

! Snow balance check at the grid cell level.
Copy link
Member

@billsacks billsacks Jan 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is this comment right ("at the grid cell level")? It looks to me like this check is column-level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops...

Comment on lines 645 to 646
call waterflux_inst%qflx_liq_dynbal_dribbler%get_amount_left_to_dribble_end(bounds, qflx_liq_dynbal_left_to_dribble(bounds%begg:bounds%endg))
call waterflux_inst%qflx_ice_dynbal_dribbler%get_amount_left_to_dribble_end(bounds, qflx_ice_dynbal_left_to_dribble(bounds%begg:bounds%endg))
Copy link
Member

@billsacks billsacks Jan 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please break these long lines

@slevis-lmwg
Copy link
Contributor Author

I think it's important to add the handling discussed here: #201 (comment) . It looks like for_testing_zero_dynbal_fluxes isn't triggered in any aux_clm tests, but it is used in some fully-active (B compset) tests, via the allactive-cism-test_coupling testmod. I think (but am not positive) that you could reproduce the behavior in an I compset by running a 5-day smoke test with a compset with active CISM (compset ending with "G"), by creating a testmods directory that has the shell_commands, user_nl_cism and user_nl_clm from https://github.com/ESCOMP/CESM/tree/master/cime_config/testmods_dirs/allactive/cism/test_coupling . I don't think we need to add a test like this to the test suite, but it would be good to run this as a one-time thing to verify that we can get it to pass. I expect it will fail currently from this branch. My suggestion in the above comment was to add a new variable to track the lack of conservation, but I'm now thinking that it would be okay to just bypass the gridcell-level balance check if this flag is set. I'm okay with either approach, if you have a preference.

Sorry for having missed this comment in #201 and thank you for bringing it to my attention.

First I have confirmed that the following test passes without your recommended testmod:
SMS_Ld5.f10_f10_musgs.I1850Clm50SpG.cheyenne_intel.clm-glcMEC

Next I included your recommended testmod and the test fails with a grid cell level h2o balance error, as you expected.

Finally I added the bypass that you recommended and the test passes.

@billsacks I am revisiting this conversation bc I see failures in these glcMEC tests:

ERP_P180x2_D_Ld5.f19_g17_gl4.I1850Clm50BgcCropG.cheyenne_intel.clm-glcMEC_increase RUN
ERP_P36x2_D_Ld10.f10_f10_musgs.IHistClm50SpG.cheyenne_intel.clm-glcMEC_decrease RUN
ERS_D_Ld12.f10_f10_musgs.I1850Clm50BgcCropG.cheyenne_intel.clm-glcMEC_spunup_inc_dec_bgc RUN
SMS_Lm37.f10_f10_musgs.I1850Clm50SpG.cheyenne_intel.clm-glcMEC_long RUN

Our conversation left me with the impression that I would NOT need to add get_for_testing_zero_dynbal_fluxes = .true. to the user_nl_clm files of these tests. Did I misunderstand?

@slevis-lmwg
Copy link
Contributor Author

I have two other tests failing:
ERP_D_Ld3.f09_g17.I2000Clm50Sp.cheyenne_intel.clm-prescribed
ERS_Ly3_P72x2.f10_f10_musgs.IHistClm50BgcCropG.cheyenne_intel.clm-cropMonthOutput

In the first one of these, might you have expected the "prescribed" testmods to fail at the grid cell level?

The second seems like a legitimate error that I will have to investigate and it shows an error in the thousands of mm. This may be the answer to my previous post. The issue affecting the glcMEC cases may be the same as this one.

@billsacks
Copy link
Member

Good point about the prescribed soil moisture streams. Indeed, I see that prescribed soil moisture is applied after the gridcell-level balance check is initialized but before the column-level balance check is initialized (in PrescribedSoilMoistureInterp). I guess the two options are (1) bypass gridcell-level balance checks when use_soil_moisture_streams is true, or (2) add an extra tracking term that tracks the change in h2osoi_liq and h2osoi_ice due to prescribed soil moisture. From looking at PrescribedSoilMoistureInterp (relevant lines seem to be the last few lines of that routine) it looks like it would be pretty straightforward to do (2), but I'm not sure how much value that has. I'm pretty equal on the two so will defer to you if you have any preference.

Regarding the other failing tests, let me know if you want to talk about these. Your understanding is right that you should not need to add for_testing_zero_dynbal_fluxes.

@ekluzek
Copy link
Collaborator

ekluzek commented Feb 12, 2021

I agree with @billsacks that there's not a lot of value in the grid-cell level checks for prescribed soil moisture, so either of his solutions would be fine. And his analysis sounds right to me as well.

@billsacks
Copy link
Member

@slevisconsulting I looked at one of your failing cases (/glade/scratch/slevis/tests_0211-104010ch/SMS_Lm37.f10_f10_musgs.I1850Clm50SpG.cheyenne_intel.clm-glcMEC_long.GC.0211-104010ch_int/run). It looks like the error happens at the start of the year, which suggests to me that it's coming from dynamic landunits. (I was wondering if it's coming from the lnd -> glc fluxes or dynamic landunits; I think that if it were coming from the lnd -> glc fluxes, we'd likely see the error in other time steps.) Please let me know if you'd like me to look into this.

@slevis-lmwg
Copy link
Contributor Author

Thank you for your feedback @billsacks and @ekluzek

So the prescribed soil moisture is the easy one.

With the others, I now see that the common element in the compsets is the "G". I have opted to pursue the problem in the test ERS_D_Ld12.f10_f10_musgs.I1850Clm50BgcCropG.cheyenne_intel.clm-glcMEC_spunup_inc_dec_bgc because it crashes much sooner, in step 49. I will try to track it down and may request your help @billsacks if I find myself stuck.

@slevis-lmwg
Copy link
Contributor Author

@billsacks I will list a whole bunch of numbers here to show what I have found and to ask for guidance.

First the error message. The numbers suggest to me that the error comes from a "jump" in endwb_grc. My suspicion is supported when I look at the progression of begwb_grc --> endwb_grc --> begwb_grc for the last few tsteps. I have removed here any terms equal to 0 for brevity:

42: nstep                     =           49
42: errh2o_grc                =    8.86328450071071
42: forc_snow                 =   7.013212011131031E-006
42: endwb_grc                 =    47693.7457854829
42: begwb_grc                 =    47684.9363899826
42: qflx_evap_tot             =  -7.999220945582354E-004
42: qflx_surf                 =   2.381131620950172E-015
42: qflx_qrgwl                =   8.202903886681019E-002
42: qflx_ice_runoff           =  -2.798142588044916E-002
42: deltawb                   =    8.80939550027688
42: deltawb/dtime             =   4.894108611264932E-003
42: qflx_glcice_dyn_water_flux =  -6.483227540422337E-004
42: CTSM is stopping
42: calling getglobalwrite with decomp_index=          172  and clmlevel= gridcell
42: local  gridcell index =          172
42: global gridcell index =          259
42: gridcell longitude    =    330.000000000000
42: gridcell latitude     =    70.0000000000000
42: ERROR: ERROR in BalanceCheckMod.F90 at line 758

For g = 172 I show next the numbers that go through subroutine c2g. First for begwb_col --> begwb_grc and then for endwb_col --> endwb_grc. I only show the last three carr(c) terms that contribute to garr(g) because the "jump" occurs in the last two. Note that the last three terms belong to a landunit with 10 columns, so I'm thinking glacier landunit:

42: garr(g), carr(c), scale_c2l(c), scale_l2g(l), col%wtgcell(c), c, l
42:   22797.4114766509        55442.9889023135        1.00000000000000
42:   1.00000000000000       0.129199835661822             1977         563
42: garr(g), carr(c), scale_c2l(c), scale_l2g(l), col%wtgcell(c), c, l
42:   38532.5989420977        55442.9889023135        1.00000000000000
42:   1.00000000000000       0.283808426944137             1978         563
42: garr(g), carr(c), scale_c2l(c), scale_l2g(l), col%wtgcell(c), c, l
42:   47684.9363899826        55442.9889023135        1.00000000000000
42:   1.00000000000000       0.165076552131968             1979         563

The last garr value agrees with begwb_grc in the error msg above, and the last garr value in the next set agrees with endwb_grc in the error msg:

42: garr(g), carr(c), scale_c2l(c), scale_l2g(l), col%wtgcell(c), c, l
42:   22795.3828424543        55442.9889023135        1.00000000000000
42:   1.00000000000000       0.129199835661822             1977         563
42: garr(g), carr(c), scale_c2l(c), scale_l2g(l), col%wtgcell(c), c, l
42:   38537.0372663466        55442.9889023135        1.00000000000000
42:   1.00000000000000       0.283925068535321             1978         563
42: garr(g), carr(c), scale_c2l(c), scale_l2g(l), col%wtgcell(c), c, l
42:   47693.7457854829        55442.9889023135        1.00000000000000
42:   1.00000000000000       0.165155391158108             1979         563

Notice that carr(c) has not changed while wtgcell(c) has changed from the first set to the second set of last three values.

Now come the questions:
Should the carr(c) values have changed to compensate for the changed wtgcell(c)? Or should the adjustment have appeared in a flux somewhere?

@billsacks
Copy link
Member

@slevisconsulting Thanks a lot for digging into this. What you show agrees with what's expected for dynamic landunits - well, except for the balance error in the end: we do not adjust the column-level states when areas change, so we expect a jump in the total water state when areas change. This should be compensated by the "dynbal" adjustment terms: these adjustment terms are calculated by summing the gridcell water balance before and after the area adjustments (using the same routines as are used to compute begwb and endwb), and then the difference is put in the dynbal flux.

In standard cases, where the dynamic glaciers just happen once a year, most of this should end up in the dynbal dribbler; this should be true in the multi-year case, but not in this short test case you're running. However, in the short test case you're running here, the fluxes aren't generated on the year boundary, so the dribbling logic is different: any flux not generated on the year boundary should get fluxed out immediately. This should show up in this term:

       water_inst%waterlnd2atmbulk_inst%qflx_rofliq_qgwl_grc(g) = &
         water_inst%waterlnd2atmbulk_inst%qflx_rofliq_qgwl_grc(g) - &
         water_inst%waterfluxbulk_inst%qflx_liq_dynbal_grc(g)

via qflx_liq_dynbal_grc. It would be interesting to look at that particular term in your test case.

By the way: In the multi-year case, I think that the offset by the dynbal dribblers here:

          endwb_grc(g) = endwb_grc(g) - qflx_liq_dynbal_left_to_dribble(g)  &
                                      - qflx_ice_dynbal_left_to_dribble(g)  &
                                      - wa_reset_nonconservation_gain_grc(g)

should make the endwb close to the begwb, but apparently something isn't working right there, either.

I'd be happy to help look into this next week if you'd like.

@slevis-lmwg
Copy link
Contributor Author

I got the five G tests listed in this and its preceding post to PASS by introducing this change:

In subroutine dyn_water_content I changed subtract_dynbal_baselines from .true. to .false.
Issue #659 suggests that I should have gone the other way and instead changed this to .true. in subr. ComputeWaterMassNonLake.

The change causes additional variables to differ in comparison to the baseline (ERRH2O was expected):
EFLX_DYNBAL
FSH_TO_COUPLER
ICE_CONTENT1
LIQUID_CONTENT1
QFLX_ICE_DYNBAL
QFLX_LIQ_DYNBAL
QRUNOFF_ICE_TO_COUPLER
QRUNOFF_TO_COUPLER
TWS
VOLR
VOLRMCH
The ERS_D test (of the 5 that we're discussing here) also produced diffs in some CH4 variables, as might be expected from #658 . Issue #659 says that it's blocked by #658 .

Recommendations?

@billsacks
Copy link
Member

@slevisconsulting Ah, good work tracking this down. This makes total sense to me, and I'm kicking myself for not realizing this would be an issue.

Given #658 , I think what we need to do is to compute the begwb_grc and endwb_grc slightly differently from what is done for begwb_col and endwb_col: For the gridcell water balance, I think we need to include the dynbal baseline terms, so that the water balance check is consistent with what's used to calculate those dynbal terms.

For begwb_grc, that should be straightforward: Since you use a different subroutine for calculating begwb_grc vs. begwb_col, you can change BeginWaterGridcellBalanceSingle to set subtract_dynbal_baselines and add_lake_water_and_subtract_dynbal_baselines to true rather than false.

For endwb_grc, I think some slightly more extensive changes are needed: For the sake of calculating tws in lnd2atm, we want to go back to the old version, where we did the c2g call directly into tws_grc rather than into endwb_grc:

    call c2g( bounds, &
         water_inst%waterbalancebulk_inst%endwb_col(bounds%begc:bounds%endc), &
         water_inst%waterdiagnosticbulk_inst%tws_grc  (bounds%begg:bounds%endg), &
         c2l_scale_type= 'urbanf', l2g_scale_type='unity' )
    do g = bounds%begg, bounds%endg
       water_inst%waterdiagnosticbulk_inst%tws_grc(g) = water_inst%waterdiagnosticbulk_inst%tws_grc(g) + water_inst%wateratm2lndbulk_inst%volr_grc(g) / grc%area(g) * 1.e-3_r8
    enddo

This is important because tws needs to NOT take the dynbal baseline terms into account, due to #658 . But then we need to calculate endwb_grc similarly to begwb_grc - i.e., setting subtract_dynbal_baselines and add_lake_water_and_subtract_dynbal_baselines to true rather than false. Probably the best thing to do is to extract the core of BeginWaterGridcellBalanceSingle into a routine that has an extra argument giving the variable to set; then you can call it from BeginWaterGridcellBalanceSingle setting begwb_col and then again later setting endwb_col. This will have the side-benefit of consolidating some code that is currently duplicated between BeginWaterGridcellBalanceSingle and BalanceCheck.

Does that make sense? If not, we can chat.

By the way: I considered an alternative solution of making begwb_col and endwb_col also taking the dynbal baselines into account, and then doing a separate call for the sake of tws (rather than using endwb_col to set tws). I think that could mostly work, but it would also impact qflx_qrgwl (by roundoff, I think), leading to answer changes that would probably be best to avoid in this PR. But if you think that's a significantly better solution, I'm open to that as long as we can convince ourselves that the resulting answer changes are okay.

1) Addressed the former with some code reorg. At the crux of it though
was the need to pass subtract_dynbal_baselines = .true. and
add_lake_water_and_subtract_dynbal_baselines = .true. instead of .false.
when calling ComputeWaterMassNonLake and ComputeWaterMassLake at the
beginning and end of the time step.

2) Addressed the *clm-prescribed test failure by bypassing the
grc-level water balance check when use_soil_moisture_streams = .false.
@slevis-lmwg
Copy link
Contributor Author

The two test suites are running. I will let you know when OK.

@billsacks
Copy link
Member

billsacks commented Feb 19, 2021

@slevisconsulting thank you for these changes. I am mostly very happy with how your latest set of changes look! There is one change I would make. If you need to redo testing anyway, it would be great if you could fold this change in, if you agree with my suggestion. But if your testing all passes, we can fold this change into a follow-up tag (it's not worth redoing all the testing just for this change):

  • [no] I feel it would be better if the call to WaterGridcellBalance was moved out of BalanceCheck and instead was done from clm_driver, just before the call to BalanceCheck. I prefer this for two reasons:
    • It is more symmetrical: since the initial WaterGridcellBalance call happens from clm_driver, it is nice to have the final one happen there, too, so that you can more clearly see the order of when things happen in the driver
    • It is confusing to me to have both water_inst and individual components of water_inst passed in to the same routine; I've been trying to avoid this.

Let me know what you think. Again, if your testing passes, we can defer this to a follow-up tag.

@slevis-lmwg
Copy link
Contributor Author

I never hope for tests to fail, but I also dislike putting off small changes like this :-)
In any case, tests passed except for baseline tests due to the expected roundoff diffs in ERRH2O. HOWEVER this test
SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA.GC.0218-183310iz_nag
generated greater than roundoff diffs in a long list of variables in the h0, h1, and hi files, so something's up here. Any clues right off the bat?

(Summary file located here:
/fs/cgd/data0/slevis/git/gcell_bal_checks_cn/cime/tools/cprnc/cprnc.summary.0218-183310iz_nag.by_test)

@billsacks
Copy link
Member

I never hope for tests to fail, but I also dislike putting off small changes like this :-)

No problem putting it off: I have a backlog of some simple bfb changes, so I really need to make a simple bfb tag soon anyway, and this could go in there.

Regarding the one test with larger diffs: that seems really weird. Wasn't this passing before your latest changes? It seems weird that your latest changes would cause a failure in that test and nothing else. I'd probably try rerunning it before spending any time digging in, just in case there was some machine glitch. If you're still seeing diffs, I have no idea what could be going on.

@slevis-lmwg
Copy link
Contributor Author

I'm afraid the test still fails. I last ran the summary comparison on the izumi tests prior to updating to dev023 and, you're right that the test had passed at the time. Anyway, I'll investigate.

@billsacks
Copy link
Member

I'm afraid the test still fails. I last ran the summary comparison on the izumi tests prior to updating to dev023 and, you're right that the test had passed at the time. Anyway, I'll investigate.

Then I'd start by confirming that it passes if you run it from a fresh checkout of master.

@slevis-lmwg
Copy link
Contributor Author

Same error! This suggests that the archived dev023 test is wrong. How do you go about correcting such a situation?

@ekluzek
Copy link
Collaborator

ekluzek commented Feb 19, 2021

@slevisconsulting if the baseline is wrong it sounds like it's my fault since I made dev023. So I should redo it. Is it just one test thats bad? I'll look into why the baseline would be wrong. Possibly I reran it or something? I'm not quite sure...

@ekluzek
Copy link
Collaborator

ekluzek commented Feb 19, 2021

OK, there was some sort of glitch in my test case. But, I was able to rerun and it and now it passed as expected.

PASS SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA CREATE_NEWCASE
PASS SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA XML
PASS SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA SETUP
PASS SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA SHAREDLIB_BUILD time=17
FAIL SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA NLCOMP
PASS SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA MODEL_BUILD time=21
PASS SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA SUBMIT
PASS SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA RUN time=83
PASS SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA GENERATE ctsm5.1.dev023
PASS SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA BASELINE ctsm5.1.dev022:
PASS SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA TPUTCOMP
PASS SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA MEMLEAK insuffiencient data for memleak test
PASS SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA SHORT_TERM_ARCHIVER

It looks like it archived the baseline as well. So I think you should be good to go now. Let me know if there's more trouble with it...

@slevis-lmwg
Copy link
Contributor Author

@ekluzek for some reason I still get the large diffs between dev023 master and baseline when I run
./create_test SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50Sp.izumi_nag.clm-ptsRLA -c ctsm5.1.dev023
in this directory
/fs/cgd/data0/slevis/git/latest_master/cime/scripts
Pls let me know if I'm doing something wrong...

@ekluzek
Copy link
Collaborator

ekluzek commented Feb 20, 2021

Hmmm. OK. Let me do some more work and make sure there isn't something different on the branch I tested and dev023. I'll try to replicate what you did as well. Sorry about this trouble. Let me get back to you...

@billsacks
Copy link
Member

I'd suggest at this point:

  • @slevisconsulting confirms that his branch gives bfb answers with master for this test (e.g., by running the test from master and comparing against the baselines he generated from his branch - presumably named dev024)
  • Then, once @slevisconsulting gives me the go-ahead, I can merge this to master
  • In the meantime, @ekluzek can confirm that there were no true problems here – but I see no reason why that investigation should hold up @slevisconsulting 's tag

@slevisconsulting I have moved my one outstanding request to #1286 . I'll take care of that since I have a clear vision of what I want and it will be quick to implement - and it's about time to make a simple bfb tag anyway.

@slevis-lmwg
Copy link
Contributor Author

I'd suggest at this point:

  • @slevisconsulting confirms that his branch gives bfb answers with master for this test (e.g., by running the test from master and comparing against the baselines he generated from his branch - presumably named dev024)
  • Then, once @slevisconsulting gives me the go-ahead, I can merge this to master
  • In the meantime, @ekluzek can confirm that there were no true problems here – but I see no reason why that investigation should hold up @slevisconsulting 's tag

@slevisconsulting I have moved my one outstanding request to #1286 . I'll take care of that since I have a clear vision of what I want and it will be quick to implement - and it's about time to make a simple bfb tag anyway.

I have confirmed (1).
@billsacks you have my go-ahead.

@billsacks billsacks merged commit f2faca6 into ESCOMP:master Feb 20, 2021
@slevis-lmwg slevis-lmwg deleted the gcell_bal_checks_h2o branch March 15, 2021 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done (non release/external)
Development

Successfully merging this pull request may close these issues.

3 participants