-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meier roughness changes answers on processor count #2219
Comments
This comment was marked as outdated.
This comment was marked as outdated.
…excess ice off, add tests to expected fails
@slevis-lmwg and I discussed this and thought of the following list of possible things to look into... Testing the code:
Examining the code itself:
Checking the code history:
Git bisect if there is a previous working version:
@slevis-lmwg will start looking into this... |
Initial attempts on izumi with dev145
|
On the first one it's a known issue we need to use the _D i.e. DEBUG option on. Try a grid somewhere in-between... |
This comment was marked as outdated.
This comment was marked as outdated.
Starting over because I was not setting
The first one still passes on izumi: ...so instead I reproduced Erik's result on cheyenne (test a bit modified from Erik's): Now faster tests for more efficient troubleshooting. The first passed, the second reproduced Erik's result: The same test in tag dev137 (where the roughness length mods came in) reproduces Erik's result: FAIL COMPARE_base_modpes Now I will move through PR #2045 filling this list with most recent commit at the top: |
I thought I'd check the "List of common problems" wiki page for other ideas of things to check, but discovered that we didn't have a section there on answer changes with changing processor count. So I added a section and did a bit of a brain dump of things I could think of: I looked briefly through the latest version of master for uses of meier, looking for each of these issues, and nothing popped out to me, but I didn't spend long... and maybe there were changes on the cesm3 dev branch relative to master in this respect. Please add anything else you can think of to that wiki page - especially once you find the cause of the problem! I don't know if this is something you're already doing, but if not, one other thing that might be helpful is to do vector (1-d) output of one of the variables that differs, then in python analyze which landunit type(s) are showing differences between the two cases. It looks like Meier roughness stuff has separate implementations for a few different landunit types, so this vector output might help narrow it down to a particular subroutine. |
I have two directories with Meier development work. /glade/work/oleson/ctsm_MeierZ0_part2 I just ran this test on both of these and they both failed with COMPARE_base_rest: ERP_P72x2_Lm25.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_intel.clm-monthly Looking back at my notes, it may be that I only ran the full test suite with ZengWang to confirm bfb. It looks like I may have run just a few SMS and ERS tests with Meier on, which passed. |
Using this test (same as posted above) |
I just found this:
The forc_hgt_u_patch etc on the righthand side should be indexed by "p" not "g" |
In BareGroundFluxes.F90 |
Sorry, actually, they should be, e.g., forc_hgt_u_patch(p) = forc_hgt_u(g) + z0mg_patch(p) + displa(p) |
Great catch, @olyson !!! We used to be able to rely on debug tests to catch issues like that, but now that indexing always starts at 1 for every subgrid level, indexing a patch variable by g just leads to wrong answers, not a model crash. I'll add this to the wiki page on things to look for. |
Well, I'm not going to celebrate because this is likely my fault. Sorry @slevis-lmwg and @ekluzek , that you spent so much time on this. |
Well, in the past we didn't need to be quite as careful about things like this because this type of error would typically be caught in debug runs. Losing the ability to catch these indexing problems was one of the main downsides of moving our array indexing to being 1-based. I just added this section in the wiki, along with some suggested git grep commands: The first git grep command turns up these lines you found. It occurred to me that this could be a good target for a CTSM-specific static analysis check that's run as a GitHub action: if these git greps turn up any hits, the check would fail. |
But I just realized that most variables in the code have their suffix stripped off in the associate statement, making the checks I suggested a lot less useful. It's kind of fortunate that the suffix was not stripped in this particular case. |
It looks like the bug came in with this: but then as @slevis-lmwg mentioned, it also came in later with this: So I don't quite understand that. |
Keith, you're excellent at finding bugs in the code (even you you put them in sometimes)! Thanks for digging into this. |
@olyson, thank you! It would likely have taken me hours of concentrated code review to spot that, even after having isolated the offending commit. The test now passes in dev145. |
Yay! Great work everyone! As Bill has been suggesting, let's do our best
to learn from this one and identify ways to try to catch similar ones in
the future. Seems difficult to do, but ...
…On Thu, Oct 26, 2023 at 4:13 PM Samuel Levis ***@***.***> wrote:
@olyson <https://github.com/olyson>, thank you! It would likely have
taken me hours of concentrated code review to spot that, even after having
isolated the offending commit.
The test now passes in dev145.
—
Reply to this email directly, view it on GitHub
<#2219 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFABYVGZF2WX5ILMY7M4F5DYBLN7HAVCNFSM6AAAAAA6KI7FYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBRHE3DEMZSGA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I created tag branch_tags/CESM3_dev.n02_ctsm5.1.dev145 to fix this on the CESM3_dev branch. |
New params files for Meier roughness, MIMICS, SNICAR, and with changes to leafcn and k*_nonmyc 1) Start using existing new params file for Meier roughness: /glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/paramdata/ctsm51_params.RMz0.c231011.nc and include bug-fix ESCOMP#2219 2) Update forcing heights per ESCOMP#2071. 3) Update params file for MIMICS per ESCOMP#1845. 4) Make leafcn for pfts 15 and 16 the same per ESCOMP#2184. 5) Switch the values of params kc_nonmyc and kn_nonmyc per ESCOMP#2120. 6) Move SNICAR parameters to params file per ESCOMP#2247. Changes answers. Details in PR ESCOMP#2258 and in the ChangeLog.
…excess ice off, add tests to expected fails (cherry picked from commit c1cadb3)
@samsrabin this is the PEM issue that we discussed (in case it helps with your similar issue). |
I met with @samsrabin and it looks like the issue he's running into is this issue as his version is dev142 where Meier was turned on. It turns out @billsacks had a wiki doc that talks about this issue and had good thoughts. @samsrabin tried one of the suggestions and it pulled up the Meier issue. |
Yes, thanks Bill! |
This was even virtual Bill, so he didn't know. Here's the page he made... https://github.com/ESCOMP/CTSM/wiki/List-of-common-problems I had trouble finding it because we need to organize some sub-pages to put similar groups of pages together. But, the list is well done and helpful. |
Brief summary of bug
PEM and ERP tests fail for exact answers when Meier roughness is turned on.
General bug information
This was found on the CESM3_dev branch
CTSM version you are using: ctsm5.1.dev143-350-gda5beb800
Does this bug cause significantly incorrect results in the model's science? Probably
This kind of error can indicate that something is wrong in the science, so I added that label, but the cause and impact will have to be ascertained to find out if that really is the case. Since, the changes are large (see below) it seems like that there is a significant issue here for the science when Meier roughness is being used.
Configurations affected: When Meier roughness is turned on
Details of bug
PEM test shows...
FAIL PEM_D_Ld9.ne30pg3_t232.I1850Clm51BgcCrop.derecho_intel.clm-clm51cam6LndTuningMode COMPARE_base_modpe
ERP test shows a similar fail for COMPARE_base_rest
Important details of your setup / configuration so we can reproduce the bug
All of the files compared show differences. Here's a bit on the clm history file the changes are quite large, which likely indicates a problem...
The text was updated successfully, but these errors were encountered: