Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing ne0CONUSne30x8_ne0CONUSne30x8_mt12 in CESM testing #2544

Open
9 of 10 tasks
ekluzek opened this issue May 15, 2024 · 16 comments · Fixed by #2805 or #2901
Open
9 of 10 tasks

Failing ne0CONUSne30x8_ne0CONUSne30x8_mt12 in CESM testing #2544

ekluzek opened this issue May 15, 2024 · 16 comments · Fixed by #2805 or #2901
Assignees
Labels
bug something is working incorrectly

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented May 15, 2024

Brief summary of bug

@fischer-ncar found the following test to be failing in cesm2_3_alpha17f testing:

PEND SMS_D_Ln9_P1280x1.ne0CONUSne30x8_ne0CONUSne30x8_mt12.FCnudged.derecho_intel.cam-outfrq9s SHAREDLIB_BUILD

General bug information

CTSM version you are using: ctsm5.2.005

Does this bug cause significantly incorrect results in the model's science? No
Configurations affected: CONUS VR grid for FCnudged compset

Details of bug

Important output or errors that show the problem

err=ERROR : CLM build-namelist::CLMBuildNamelist::setup_logic_initial_conditions() : use_init_interp is NOT synchronized with init_interp_attributes in the namelist_defaults file, this should be corrected there'

Definition of done:

  • Test SMS_Ln9.ne0CONUSne30x8_ne0CONUSne30x8_mt12.IHistClm50Sp.derecho_intel.clm-clm50cam6LndTuningMode_2013Start to show it fails
  • Change CONUS test in build-namelist test to do a transient use-case
  • Make sure we have f19 tests in build-namelist tester for transient
  • Make sure the namelist testing is doing transient tests with the correct start dates for the datasets we have for 1979-PD.
  • Run build-namelist tests to show those new tests fail
  • Fix use_init_interp for CLM50 so it works
  • Fix use_init_interp for f19
  • Show that build-namelist tests now pass
  • Test SMS_Ln9.ne0CONUSne30x8_ne0CONUSne30x8_mt12.IHistClm50Sp.derecho_intel.clm-clm50cam6LndTuningMode_2013Start to show it now works
  • @ekluzek test these to show that they work
    SMS_D_Ln9.f19_f19_mg17.FXHIST.derecho_intel.cam-outfrq9s_amie
    SMS_D_Ln9_P1280x1.ne0CONUSne30x8_ne0CONUSne30x8_mt12.FCnudged.derecho_intel.cam-outfrq9s
@ekluzek ekluzek added the bug something is working incorrectly label May 15, 2024
@ekluzek ekluzek added this to the cesm2_3_beta19 milestone May 15, 2024
@ekluzek
Copy link
Collaborator Author

ekluzek commented May 15, 2024

We talked about this in CSEG and decided since this is only for a less used VR grid (and there is a namelist fix for it) that we would do this after the remove externals beta tag. @cacraigucar @adamrher

The namelist fix is to just add the following to user_nl_clm:

use_init_interp = .true.

I have NOT explicitly tested the above, but do believe it will work. If someone actually tries this -- let us know if it works.

@adamrher
Copy link
Contributor

We talked about this in CSEG and decided since this is only for a less used VR grid (and there is a namelist fix for it) that we would do this after the remove externals beta tag. @cacraigucar @adamrher

Fine by me. But for the record, this is probably the most used VR grid. But lately we have been more focused on the 1deg workhorse and so we haven't really been running the VR grids.

@ekluzek
Copy link
Collaborator Author

ekluzek commented May 15, 2024

This is the same problem as #2520 and again is a result of the fragility issues identified in #2169.

FCnudged is a HIST compset currently with CLM50%SP that starts in 2013. The following test was run and PASSED in ctsm5.2.005 and looks to me like it's almost identical (other than using CLM60 rather than CLM50).

SMS_Ln9.ne0CONUSne30x8_ne0CONUSne30x8_mt12.IHistClm60Sp.derecho_intel.clm-clm60cam6LndTuningMode_2013Start

The build-namelist testing does NOT cover it because it's not doing a transient case for is only 1850 and 2000 control. So we should add a transient test in the namelist testing as well.

@slevis-lmwg
Copy link
Contributor

As expected (see checklist at top of this issue):

  • ./create_test SMS_Ln9.ne0CONUSne30x8_ne0CONUSne30x8_mt12.IHistClm50Sp.derecho_intel.clm-clm50cam6LndTuningMode_2013Start fails with
    CLMBuildNamelist::setup_logic_initial_conditions() : use_init_interp is NOT synchronized with init_interp_attributes in the namelist_defaults file
  • ./build-namelist_test.pl fails in
916/3997 < FAIL> <Test Id: 916> <Desc: -res ne0np4CONUS.ne30x8 -bgc sp -use_case 20thC_transient -namelist '&a start_ymd=20130101/' -lnd_tuning_mode clm4_5_cam7.0 -phys clm4_5: lnd_in file exists>
949/3997 < FAIL> <Test Id: 949> <Desc: -res ne0np4CONUS.ne30x8 -bgc sp -use_case 20thC_transient -namelist '&a start_ymd=20130101/' -lnd_tuning_mode clm5_0_cam7.0 -phys clm5_0: lnd_in file exists>

@slevis-lmwg
Copy link
Contributor

Further updates appear in the PR.

@brian-eaton
Copy link

I have just updated cam6_4_038 to use the same externals as cesm3_0_alpha03d (uses ctsm5.3.002). In addition to the test mentioned in this issue, I now get the same failure from test SMS_D_Ln9.f19_f19_mg17.FXHIST.derecho_intel.cam-outfrq9s_amie

@ekluzek
Copy link
Collaborator Author

ekluzek commented Oct 7, 2024

@brian-eaton thanks for the update on the f19 test. Another part on this is that we are bringing in a 16-pft finidat file f19 for the PPE work, that will help with that particular problem.

@slevis-lmwg slevis-lmwg moved this from Stalled to In Progress in LMWG: Near Term Priorities Oct 17, 2024
@slevis-lmwg slevis-lmwg moved this from In Progress to Done in LMWG: Near Term Priorities Oct 22, 2024
@ekluzek ekluzek removed their assignment Nov 5, 2024
@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 11, 2024

@slevis-lmwg I tried the F compset as well as the basic I compset and they both failed for me on the cesm3_0_beta04_changes branch for the CONUS grid. Can you look at the branch again and see what happened? There are many things that could have happened here.

Thanks.

@slevis-lmwg slevis-lmwg moved this from Done to Todo in LMWG: Near Term Priorities Nov 11, 2024
@slevis-lmwg
Copy link
Contributor

From today's software meeting:
Troubleshooting this failure may end up on Erik's plate instead of SamL's, so I will add Erik back to this issue.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Nov 23, 2024

I started bisecting through the cesm3_0_beta04_changes branch and submitted this test:
./create_test SMS_Ln9.ne0CONUSne30x8_ne0CONUSne30x8_mt12.IHistClm50Sp.derecho_intel.clm-clm50cam6LndTuningMode_2013Start
I got these three results that likely reproduce the failure that @ekluzek reported, BUT SEE important note after these three attempts:

  1. merge2716_bisect1 (git describe: ctsm5.3.009-63-g4d8f30827)
ERROR: Command /glade/work/slevis/git/mksurfdata_toolchain/bld/build-namelist failed rc=255
        err=ERROR : CLM build-namelist::CLMBuildNamelist::setup_logic_initial_conditions() : use_init_interp is NOT synchronized with init_interp_attributes in the namelist_defaults file, this should be corrected there
  1. merge2805_bisect2 (git describe: ctsm5.3.009-28-g932337a07) Same error as (1)
  2. premerge2805_bisect3 (git describe: ctsm5.3.009-11-ge90189d51)
ERROR: Command /glade/work/slevis/git/mksurfdata_toolchain/bld/build-namelist failed rc=255
        err=XML::Lite:The XML-like element starting at position 42513072 is incomplete. (Did you forget to escape a '<'?)

Looking in testlist_clm.xml, I see that in #2805 we converged on two CONUS tests that differ from the above:

SMS_Ln9.ne0CONUSne30x8_ne0CONUSne30x8_mt12.IHistClm50Sp.derecho_intel.clm-clm50cam7LndTuningMode_2013Start--clm-nofireemis
SMS_Ln9.ne0CONUSne30x8_ne0CONUSne30x8_mt12.IHistClm60Sp.derecho_intel.clm-clm60cam7LndTuningMode_2013Start--clm-nofireemis

Testing these tests instead:

  1. PEND merge2716_bisect1 (git describe: ctsm5.3.009-63-g4d8f30827): The two tests in testlist_clm.xml got past the error. They remained pending in the RUN phase for some reason, but since (2) worked, I will not investigate.
  2. PASS Last commit in cesm3_0_beta04_changes (git describe: ctsm5.3.012-115-ga9b02b749).

@slevis-lmwg
Copy link
Contributor

In today's Stand-up with @ekluzek et al.,
I summarized that the original ne0CONUS test that I fixed in this issue and now fails again used testmod clm50cam6LndTuningMode_2013Start
while the tests that we ended up adding to testlist_clm and still pass use testmod
clm50cam7LndTuningMode_2013Start

Erik now wonders whether we need both to work.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 26, 2024

From looking at CAM testing we do need this to work with CAM6/CLM5 at least right now.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 26, 2024

We should also add some tests for this to aux_clm as we want to make sure we have standalone tests for things that CAM is testing. I think CAM will also continue to test this configuration so we should make sure it is in place in our testing, so we don't break it for CAM.

@slevis-lmwg adding here:
As before, same comments apply to all the ne0 grids.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Nov 26, 2024

From a Sprint perspective, I will resolve all the ne0 grids here, and I moved #2548 back to "done" for more clarity of what I'm doing.

@ekluzek ekluzek modified the milestones: cesm3_0_beta05, cesm3_0_beta06 Dec 4, 2024
@slevis-lmwg
Copy link
Contributor

The next four tests PASS with code mods that i will push soon:

SMS_Ln9.ne0POLARCAPne30x4_ne0POLARCAPne30x4_mt12.IHistClm50Sp.derecho_intel.clm-clm50cam6LndTuningMode_1979Start--clm-nofireemis
SMS_Ln9.ne0CONUSne30x8_ne0CONUSne30x8_mt12.IHistClm50Sp.derecho_intel.clm-clm50cam6LndTuningMode_2013Start--clm-nofireemis
SMS_Ln9.ne0ARCTICne30x4_ne0ARCTICne30x4_mt12.IHistClm50Sp.derecho_intel.clm-clm50cam6LndTuningMode_1979Start--clm-nofireemis
SMS_Ln9.ne0ARCTICGRISne30x8_ne0ARCTICGRISne30x8_mt12.IHistClm50Sp.derecho_intel.clm-clm50cam6LndTuningMode_1979Start--clm-nofireemis

@slevis-lmwg
Copy link
Contributor

I'm leaving this issue open for any testing that @ekluzek still needs to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Projects
Status: Done
4 participants