Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

externals update broke NEON features #2437

Closed
wwieder opened this issue Mar 25, 2024 · 27 comments · Fixed by #2482
Closed

externals update broke NEON features #2437

wwieder opened this issue Mar 25, 2024 · 27 comments · Fixed by #2482
Assignees
Labels
bug something is working incorrectly science Enhancement to or bug impacting science

Comments

@wwieder
Copy link
Contributor

wwieder commented Mar 25, 2024

Brief summary of bug

previously we were able to run postad cases using run_neon, but that functionality seems to have broken with externals update

General bug information

CTSM version you are using:
Feature works as intended with dev171, but is broken with dev172

Does this bug cause significantly incorrect results in the model's science? [Yes / No]
Yes, removes a helpful functionality in running neon cases

Configurations affected: [Fill this in if known.]
neon cases, but also potentially testing issues (@slevis-lmwg is noting with the sasu PR)

Details of bug

the same command using dev171 works

[~/CTSM/tools/site_and_regional] (dev172)
$./run_neon --neon-sites TOOL --run-type postad --output-root /glade/derecho/scratch/wwieder/neon_AK
---- building a base case -------
---- creating a base case -------
---- base case created ------
---- base case setup ------
---- base case build ------
--- This may take a while and you may see WARNING messages ---
WARNING: No .input_data_list files found in dir 'Buildconf'
^[[O^[[ITime required to building the base case: 427.0643048286438 s.
using this version: latest
---- cloning the base case in /glade/derecho/scratch/wwieder/neon_AK/TOOL.postad
Traceback (most recent call last):
  File "./run_neon", line 48, in <module>
    main(__doc__)
  File "/glade/u/home/wwieder/CTSM/tools/site_and_regional/../../python/ctsm/site_and_regional/run_neon.py", line 238, in main
    experiment,
  File "/glade/u/home/wwieder/CTSM/tools/site_and_regional/../../python/ctsm/site_and_regional/neon_site.py", line 299, in run_case
    self.set_ref_case(case)
  File "/glade/u/home/wwieder/CTSM/tools/site_and_regional/../../python/ctsm/site_and_regional/neon_site.py", line 356, in set_ref_case
    symlink_force(reffile, os.path.join(rundir, os.path.basename(reffile)))
  File "/glade/u/home/wwieder/CTSM/cime/CIME/utils.py", line 1473, in symlink_force
    raise e
  File "/glade/u/home/wwieder/CTSM/cime/CIME/utils.py", line 1467, in symlink_force
    os.symlink(target, link_name)
FileNotFoundError: [Errno 2] No such file or directory: '/glade/derecho/scratch/wwieder/neon_AK/TOOL.ad/run/TOOL.ad.clm2.r.1018-01-01-00000.nc' -> '/glade/derecho/scratch/wwieder/neon_AK/TOOL.postad/run/TOOL.ad.clm2.r.1018-01-01-00000.nc'

Important output or errors that show the problem

Previously noted in #2433, but creating a new issue here, as it's unrelated to the issue @TeaganKing addressed in #2435.

@slevis-lmwg
Copy link
Contributor

SASU PR #640 has a fix for the missing /run directory, but now gets a missing /timing directory.

@slevis-lmwg
Copy link
Contributor

A big picture question:
Do CIME developers test whether their mods break things elsewhere? If not, does it seem that they should?

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 25, 2024

This is a good question @slevis-lmwg. There is internal CIME testing, and if there is functionality that we can isolate and need to be tested, we could add it to that level. However, I don't think that applies here. But, whenever it does it is something that we should make sure we have in place. Although I think it's likely that there should be more testing of CIME internals than is currently being done.

The thing we are doing here is using CIME internals in ways that aren't tested. The testing I mention above could help with that. Or not using things at this internal level. But, I think there's good reasons we have for using things at this level for this work.

The other kind of testing that can help find this in the CESM development process would be adding CESM prealpha testing. This would be caught when new CESM alpha tags are made due to external updates. The thing that we need there is to have a test that runs neon from aux_clm. I thought we had an issue for doing that, but I can't find it.

The thing we have for now is #2276, so I'm make a new issue for this.

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 25, 2024

@TeaganKing it looks like the kind of checking that should be added here (in latest CTSM it will go in tower_site.py rather than run_neon) is to check if reffile exists (or at least it's parent directory), and also that rundir exists and can be written to. So a little bit of extra checking before the symlink_force. It's possible that reffile doesn't exist at the point this is done, so then you just need to make sure it's parent directory exists. This is applying what @slevis-lmwg did to run_neon....

@TeaganKing
Copy link
Contributor

Thanks for pointing this out, @ekluzek . I'll plan to make these changes on the next PLUMBER PR.

@slevis-lmwg
Copy link
Contributor

My most recent test suggests that the code change that I shared earlier today did not fix anything, so it was a false positive. I will continue looking.

@slevis-lmwg
Copy link
Contributor

The latest version of my failing test
SSPMATRIXCN_Ly5_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianRs.izumi_intel.clm-ciso_monthly.20240325_185731_ajymv0
has now passed! I pushed a new commit with the code update to #640. The last two commits combined represent the fix:

--- a/cime_config/SystemTests/sspmatrixcn.py
+++ b/cime_config/SystemTests/sspmatrixcn.py
@@ -207,6 +207,8 @@ class SSPMATRIXCN(SystemTestsCommon):
                      linkfile = os.path.join(rundir, os.path.basename(item))
                      if os.path.exists(linkfile):
                          os.remove( linkfile )
+                     if not os.path.isdir(rundir):
+                         os.makedirs(rundir)
                      os.symlink(item, linkfile )

@ekluzek this had been your original suggestion, i.e. to just make the run directory before calling symlink.

@wwieder
Copy link
Contributor Author

wwieder commented Mar 29, 2024

here's my issue running a transient case with dev175 and dev171

File /glade/derecho/scratch/wwieder/neon_AK/MLBS.transient/LockedFiles/env_build.xml has been modified
found difference in CALENDAR : case 'GREGORIAN' locked 'NO_LEAP'
Setting build complete to False

@wwieder
Copy link
Contributor Author

wwieder commented Mar 29, 2024

I confirmed that this works in dev171

./run_neon --neon-sites MOAB --overwrite --experiment dev171_noRuntype --output-root /glade/derecho/scratch/wwieder/neon_AK

The base case uses NO_LEAP and the .transient case uses GREGORIAN

@wwieder
Copy link
Contributor Author

wwieder commented Mar 29, 2024

forcing the base case in dev175 works~

$./run_neon --neon-sites TALL --run-type transient --experiment dev175 --output-root /glade/derecho/scratch/wwieder/neon_AK --overwrite

now the base case and transient case are both Gregorian

@wwieder
Copy link
Contributor Author

wwieder commented Mar 29, 2024

now we still need to set up base cases and for ad and postad that are no_leap

@wwieder wwieder added tag: bug - impacts science next this should get some attention in the next week or two. Normally each Thursday SE meeting. and removed next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Apr 4, 2024
@ekluzek
Copy link
Collaborator

ekluzek commented Apr 5, 2024

Some ideas that I have, that I'll go over with @TeaganKing:

  • Differences in cime/CIME and cime/CIME/tests
    -- mpi-serial changes?
    -- threading shouldn't matter..
    -- performance testing changes shouldn't matter
    -- cprnc changes don't matter...

Do know that now aborts at the clone step because the basecase is NO_LEAP (I didn't see this change in behavior from the CIME diff)

  • unit/sys test of run_neon CIME behavior we rely on
    -- Things like: create a base case, setup, change settings, build (too long), run (too long), setup follow on cases that clone the base case
    -- Can probably do the steps that take too long with mocks, stubs, or cheats (set BUILD_COMPLETE to TRUE for example).

  • Add more error checking after doing CIME operations to make sure what's expected actually happens
    -- Check that new directories/files are created as expected
    -- Check any error codes or return values that occur
    -- Check that XML settings that are set do actually get set and verify that from a query (the names can change)
    -- Check that errors don't come up (run within try/catch statements?)

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 5, 2024

In meeting with @TeaganKing we thought about next steps as:

  • @TeaganKing will track down the cime version with the behavior change in ctsm5.1.dev172. The two behavior changes that we know about now are: when rundir's are created and not being able to clone from a base case with a different CALENDAR.
  • @TeaganKing will update the NEON/PLUMBER2 branches to use the Externals.cfg from ctsm5.1.dev171 so that she can make progress there separated from this problem.
  • We will make a branch of cime to fix behavior needed for NEON (once we figure out what to do)
  • @ekluzek will create a branch of cime with a possible fix for the CALENDAR behavior issue
  • @TeaganKing will try out the cime branch and see if it fixes the behavior (this will help guide future plans on this)
  • @ekluzek and CTSM SE team will add a CTSM System test for aux_clm (and prealpha) that will test setting up and running a base case, with a follow on clone of it that changes the calendar. Since, we've had trouble with a system test of run_neon, this is something we could do sooner. This is from @TeaganKing thinking about this being a more general issue, and not just run_neon which is something a bit down the chain of standard running of CTSM (and this problem will happen for others doing that workflow).
  • As we work on this we will reach out to the CESM cime developers for review and advice on what we are doing.
  • @ekluzek and CTSM SE team will start working on unit/sys testing for CIME behavior. This might start in CTSM, but should be moved to CIME when appropriate.
  • @TeaganKing will possibly also look into adding behavior checking inside of python tools that use CIME as appropriate.
  • @ekluzek and CTSM SE team will also work with @TeaganKing on these sorts of things as well.
  • At some point we will create a PR for changes in cime
  • We'll need to update CTSM to use that PR

@wwieder
Copy link
Contributor Author

wwieder commented Apr 5, 2024

Thanks for working through all of this. let me know how I can help moving forward.

@TeaganKing
Copy link
Contributor

TeaganKing commented Apr 11, 2024

Hi @ekluzek , in testing various CIME versions to try to isolate this error, I'm running into other errors before the point at which we would see the calendar error.

In versions cime6.0174-195, I'm getting an error related to NODENAME_REGEX that occurs before the base case is created.
In versions cime6.0.196-197, I see an error No variable SMP_PRESENT found in case that also occurs before the base case is created.
Versions cime6.0.198-200 have the calendar error, so it is definitely introduced before cime6.0.198.

With this information, I'm not entirely sure if it is introduced at some point between 175 and 198. I think you had previously mentioned that the error was introduced at some point between cime6.0.175 and cime6.0.217. How did you initially determine that it was after 175? Do I need to check out other externals to a particular version in order to avoid the NODENAME_REGEX and SMP_PRESENT errors?

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 11, 2024

@TeaganKing excellent work here. I was afraid this might be a difficult task. But, your work of going through this methodically is really helpful to figure this out. So the information you have is good, it's just not as helpful as it would be if you could easily find the exact version where it fails.

Do I need to check out other externals to a particular version in order to avoid the NODENAME_REGEX and SMP_PRESENT errors?

As I look at your results I see that we also need to pair appropriate ccs_config versions with the cime versions. So we can likely figure that out and get the pre-cime6.0.198 versions working. I think it's only ccs_config that's needed, but it's possible other's might be required as well. There's a couple of things I can do to look up what externals I think will work together. So let me look at that and get back to you.

How did you initially determine that it was after 175?

The external for cime used in ctsm5.1.dev171 was cime6.0.175. And then the next CTSM version ctsm5.1.dev172 jumps to a branch tag off of cime6.0.217. You can see this by looking at the Externals.cfg file at the top of CTSM.

I'm in today, so we could do a quick meeting if that would help. I could also show you how I check for externals that work together...

@TeaganKing
Copy link
Contributor

OK, thanks @ekluzek ! I found the externals.cfg file within cime, so I'll try updating the ccs_config to the tag specified when a particular cime tag is checked out. I'll let you know if I have questions, but hopefully I'll have more information soon!

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 11, 2024

@TeaganKing actually you want to look at the one above cime for CTSM. The one within cime, you just leave like it is for each version you test with. Although that also likely means you need to update the overall externals, by running manage_externals each time. That's a good thing to do to make sure...

@TeaganKing
Copy link
Contributor

In discussion with @ekluzek , it seems that this is not necessarily an issue related to CIME and ccs_config compatibility. We are noticing that cmeps is also involved in this issue and includes some additional calendar information. Testing the most recent CTSM version with the following tags was successful: cime6.0.175, cmeps0.14.46, and ccs_config_cesm0.0.84 . I'll test various cime and cmeps versions, and do a diff on the cmeps version to see if we can isolate the error.

@TeaganKing
Copy link
Contributor

cime6.0.197 also is free of calendar issues when used with cmeps0.14.46, and ccs_config_cesm0.0.84. However, cime6.0.198 has this calendar error when used with cmeps0.14.46, and ccs_config_cesm0.0.84. So, I think that cime6.0.198 might have introduced this error. The comparison of cime 197 and 198 can be found here. In those files, I'm not finding any changes related to the CALENDAR...

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 12, 2024

@TeaganKing I'm a little confused by your findings. I think cime6.0.198 needs a different ccs_config version than cime6.0.197, but above they are listed as the same. Can you verify which version of ccs_config you used in the cime6.0.198 test that was working?

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 12, 2024

You are right most of the cime changes don't relate to CALENDAR, but there are some subtle things I wonder about. If the CMEPS version also changed that might give us another thing to look at. I'm wondering if that's where the problem really is here...

@TeaganKing
Copy link
Contributor

We plan to double check that these versions did really isolate the change. We also noticed the removal of a case_setup() command in this comparison, which could possibly be a source of error in run_neon; we could try adding this setup call within run_neon.

This was referenced Apr 12, 2024
@ekluzek
Copy link
Collaborator

ekluzek commented Apr 14, 2024

Based on what @TeaganKing figured out the change to try would be:

diff --git a/python/ctsm/site_and_regional/tower_site.py b/python/ctsm/site_and_regional/tower_site.py
index 1679df83e..4569fad10 100644
--- a/python/ctsm/site_and_regional/tower_site.py
+++ b/python/ctsm/site_and_regional/tower_site.py
@@ -364,6 +364,7 @@ def run_case(
                 # See https://github.com/ESCOMP/CTSM/pull/1872#pullrequestreview-1169407493
                 #
                 basecase.create_clone(case_root, keepexe=True, user_mods_dirs=user_mods_dirs)
+                basecase.setup()
 
         with Case(case_root, read_only=False) as case:
             if run_type != "transient":

@wwieder
Copy link
Contributor Author

wwieder commented Apr 15, 2024

Adding the line with basecase.setup() to tower_site.py produced the error below when trying use run_neon. Am I oversimplifying your suggestion @ekluzek. Were you able to get this to work @TeaganKing?

Traceback (most recent call last):
  File "./run_neon", line 48, in <module>
    main(__doc__)
  File "/glade/u/home/wwieder/CTSM/tools/site_and_regional/../../python/ctsm/site_and_regional/run_neon.py", line 241, in main
    experiment,
  File "/glade/u/home/wwieder/CTSM/tools/site_and_regional/../../python/ctsm/site_and_regional/neon_site.py", line 103, in run_case
    base_case_root, run_type, prism, run_length, user_version, tower_type, user_mods_dirs
  File "/glade/u/home/wwieder/CTSM/tools/site_and_regional/../../python/ctsm/site_and_regional/tower_site.py", line 367, in run_case
    basecase.setup()
AttributeError: 'Case' object has no attribute 'setup'

BTW, I'm testing this is on my b4b-neon branch in ~/CESM (but I'm switching between branches a bunch...

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 15, 2024

Hi @wwieder sorry for suggesting bad code. Neither of us tried this, which is likely why it doesn't work. Personally I was hoping to have someone else try it out and see what it does.

In looking at the actual change it should be

basecase.case_setup()

I missed the case_ part in the subroutine name. So try that and see what it does.

@wwieder
Copy link
Contributor Author

wwieder commented Apr 15, 2024

Thanks @ekluzek, that was enough to help me understand that it's not the basecase that needs a run directory, but the actual postad cases.

I suspect the same will be true of running a transient case from a postad run, so will make the following changes on the b4b-neon branch.

diff --git a/python/ctsm/site_and_regional/tower_site.py b/python/ctsm/site_and_regional/tower_site.py
index 1679df83e..1f0bb93b2 100644
--- a/python/ctsm/site_and_regional/tower_site.py
+++ b/python/ctsm/site_and_regional/tower_site.py
@@ -391,11 +391,13 @@ class TowerSite:
                 case.set_value("RUN_TYPE", "hybrid")
 
             if run_type == "postad":
+                case.case_setup()
                 self.set_ref_case(case)
                 case.set_value("STOP_N", run_length)
 
             # For transient cases STOP will be set in the user_mod_directory
             if run_type == "transient":
+                case.case_setup()
                 if self.finidat:
                     case.set_value("RUN_TYPE", "startup")
                 else:

@samsrabin samsrabin added bug something is working incorrectly science Enhancement to or bug impacting science labels Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly science Enhancement to or bug impacting science
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants