Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test fails with Meier roughness on and with threading and carbon isotopes and transient case #2238

Closed
ekluzek opened this issue Nov 6, 2023 · 4 comments
Labels
bug something is working incorrectly

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Nov 6, 2023

Brief summary of bug

ERP_D_Ld10_P108x2.f10_f10_mg37.IHistClm51BgcCrop.cheyenne_intel.clm-ciso_decStart

is failing when Meier roughness is turned on in the CESM3_dev branch. It has trouble doing initialization and hangs after 2 days...

General bug information

CTSM version you are using: branch_tags/CESM3_dev.n02_ctsm5.1.dev145-6-ga73ddf27d

Does this bug cause significantly incorrect results in the model's science? ?

Configurations affected:

Seems to only be for transient with carbon isotopes and threading on

Details of bug

Carbon Isotope Tests that PASS:

ERP_D_Ld5.f10_f10_mg37.I2000Clm50BgcCru.cheyenne_gnu.clm-ciso_flexCN_FUN
ERP_P36x2_D_Ld5.f10_f10_mg37.I1850Clm50Bgc.cheyenne_intel.clm-ciso
ERS_D_Ld5.f10_f10_mg37.IHistClm50BgcQian.cheyenne_intel.clm-ciso_bombspike1963DecStart
ERS_Ld3.f10_f10_mg37.I2000Clm51Bgc.cheyenne_intel.clm-ciso_cwd_hr
LGRAIN2_Ly1_P72x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_gnu.clm-ciso--clm-cropMonthOutput
LGRAIN2_Ly2_P72x1.f10_f10_mg37.I1850Clm45BgcCrop.cheyenne_gnu.clm-ciso--clm-cropMonthOutput
LREPRSTRUCT_Ly1_P72x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_gnu.clm-ciso--clm-cropMonthOutput
LREPRSTRUCT_Ly2_P72x1.f10_f10_mg37.I1850Clm45BgcCrop.cheyenne_gnu.clm-ciso--clm-cropMonthOutput
SMS_Ld5.f10_f10_mg37.ISSP245Clm50BgcCrop.cheyenne_gnu.clm-ciso_dec2050Start
SMS_Ld5.f10_f10_mg37.ISSP370Clm50BgcCrop.cheyenne_gnu.clm-ciso_dec2050Start
SMS_Ld5.f10_f10_mg37.ISSP585Clm50BgcCrop.cheyenne_intel.clm-ciso_dec2050Start
SSP_D_Ld4.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_intel.clm-ciso_rtmColdSSP

Note the threaded test in above that passes, but isn't transient (ERP_P36x2_D_Ld5.f10_f10_mg37.I1850Clm50Bgc.cheyenne_intel.clm-ciso).

Important details of your setup / configuration so we can reproduce the bug

I upped the number of processors to 108 tasks (144 tasks failed for the f10 grid) and gave it 2 hours and 40 minutes to run which should be way more than needed. But, it seems to hang right at the year boundary (it starts on Dec/30th).

Important output or errors that show the problem

If not enough time or processors are given the test may fail in just doing the finidat interpolation.

cesm.log:

96:  nstep =           97  htop =   0.000000000000000E+000
96:iam = 96: local  patch    index = 39
96:iam = 96: global patch    index = 4323
96:iam = 96: global column   index = 2965
96:iam = 96: global landunit index = 289
96:iam = 96: global gridcell index = 97
96:iam = 96: gridcell longitude    =   30.0000000
96:iam = 96: gridcell latitude     =  -10.0000000
96:iam = 96: pft      type         = 78
96:iam = 96: column   type         = 278
96:iam = 96: landunit type         = 2
37:  nstep =           97  htop =   0.000000000000000E+000
37:iam = 37: local  patch    index = 79
37:iam = 37: global patch    index = 6583
37:iam = 37: global column   index = 4539
37:iam = 37: global landunit index = 461
37:iam = 37: global gridcell index = 146
37:iam = 37: gridcell longitude    =   30.0000000
37:iam = 37: gridcell latitude     =   30.0000000
37:iam = 37: pft      type         = 67
37:iam = 37: column   type         = 267
37:iam = 37: landunit type         = 2
40:  nstep =           97  htop =   0.000000000000000E+000
40:iam = 40: local  patch    index = 73
40:iam = 40: global patch    index = 6729
40:iam = 40: global column   index = 4643
40:iam = 40: global landunit index = 474
40:iam = 40: global gridcell index = 149
40:iam = 40: gridcell longitude    =   75.0000000
40:iam = 40: gridcell latitude     =   30.0000000
40:iam = 40: pft      type         = 23
40:iam = 40: column   type         = 223
40:iam = 40: landunit type         = 2
0: shr_file_mod.F90         912
0: This routine is depricated - use shr_log_setLogUnit instead        -132

lnd.log:

 hist_htapes_wrapup : Writing current time sample to local history file
 ./ERP_D_Ld10_P108x2.f10_f10_mg37.IHistClm51BgcCrop.cheyenne_intel.clm-ciso_decS
 tart.GC.ctsm51d145cesm3n3chlist.clm2.h1.2002-01-01-00000.nc at nstep =
          96  for history time interval beginning at    1.66666666666667
  and ending at    2.00000000000000


 hist_htapes_wrapup : Closing local history file
 ./ERP_D_Ld10_P108x2.f10_f10_mg37.IHistClm51BgcCrop.cheyenne_intel.clm-ciso_decS
 tart.GC.ctsm51d145cesm3n3chlist.clm2.h0.2002-01-01-00000.nc at nstep =
          96


 hist_htapes_wrapup : Closing local history file
 ./ERP_D_Ld10_P108x2.f10_f10_mg37.IHistClm51BgcCrop.cheyenne_intel.clm-ciso_decS
 tart.GC.ctsm51d145cesm3n3chlist.clm2.h1.2002-01-01-00000.nc at nstep =
          96

(shr_orb_params) ------ Computed Orbital Parameters ------
(shr_orb_params) Eccentricity      =   1.670285E-02
(shr_orb_params) Obliquity (deg)   =   2.343951E+01
(shr_orb_params) Obliquity (rad)   =   4.090966E-01
(shr_orb_params) Long of perh(deg) =   1.029298E+02
(shr_orb_params) Long of perh(rad) =   4.938056E+00
(shr_orb_params) Long at v.e.(rad) =  -3.246623E-02
(shr_orb_params) -----------------------------------------
 Get data for variable PCT_NAT_PFT for year         2002
 Get data for variable PCT_CROP for year         2002
 Get data for variable PCT_CFT for year         2002
 Get data for variable FERTNITRO_CFT for year         2002
 Get data for variable HARVEST_VH1 for year         2002
 Get data for variable HARVEST_VH2 for year         2002
 Get data for variable HARVEST_SH1 for year         2002
 Get data for variable HARVEST_SH2 for year         2002
 Get data for variable HARVEST_SH3 for year         2002
 Get data for variable PCT_LAKE for year         2002
 Get data for variable PCT_URBAN for year         2002
@ekluzek ekluzek added the bug something is working incorrectly label Nov 6, 2023
@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 6, 2023

I found this on the CESM3_dev branch, but the same problem should exist on main-dev. I assume it goes back to ctsm5.1.dev137 when Meier came in. We should also test it on the Meier branch to see if this was an issue on the branch and possibly worked at first, but then became broken at some point.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 30, 2023

Try this on dev154 or 155

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Dec 7, 2023

Answer to a question from today's (2023/12/7) software meeting:
The PEM/ERP issue was issue #2219 and got fixed in #2258 that got to main with dev154.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 8, 2023

Since, this had threading I figured it was good to test it as well as the previous tests verified. I tried this in ctsm5.1.dev155 and it PASSes, so I'm marking this as resolved with ctsm5.1.dev154

@ekluzek ekluzek closed this as completed Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Projects
None yet
Development

No branches or pull requests

2 participants