-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
externals update broke NEON features #2437
Comments
SASU PR #640 has a fix for the missing /run directory, but now gets a missing /timing directory. |
A big picture question: |
This is a good question @slevis-lmwg. There is internal CIME testing, and if there is functionality that we can isolate and need to be tested, we could add it to that level. However, I don't think that applies here. But, whenever it does it is something that we should make sure we have in place. Although I think it's likely that there should be more testing of CIME internals than is currently being done. The thing we are doing here is using CIME internals in ways that aren't tested. The testing I mention above could help with that. Or not using things at this internal level. But, I think there's good reasons we have for using things at this level for this work. The other kind of testing that can help find this in the CESM development process would be adding CESM prealpha testing. This would be caught when new CESM alpha tags are made due to external updates. The thing that we need there is to have a test that runs neon from aux_clm. I thought we had an issue for doing that, but I can't find it. The thing we have for now is #2276, so I'm make a new issue for this. |
@TeaganKing it looks like the kind of checking that should be added here (in latest CTSM it will go in tower_site.py rather than run_neon) is to check if reffile exists (or at least it's parent directory), and also that rundir exists and can be written to. So a little bit of extra checking before the symlink_force. It's possible that reffile doesn't exist at the point this is done, so then you just need to make sure it's parent directory exists. This is applying what @slevis-lmwg did to run_neon.... |
Thanks for pointing this out, @ekluzek . I'll plan to make these changes on the next PLUMBER PR. |
My most recent test suggests that the code change that I shared earlier today did not fix anything, so it was a false positive. I will continue looking. |
The latest version of my failing test
@ekluzek this had been your original suggestion, i.e. to just make the run directory before calling symlink. |
here's my issue running a transient case with dev175 and dev171 File /glade/derecho/scratch/wwieder/neon_AK/MLBS.transient/LockedFiles/env_build.xml has been modified |
I confirmed that this works in dev171 ./run_neon --neon-sites MOAB --overwrite --experiment dev171_noRuntype --output-root /glade/derecho/scratch/wwieder/neon_AK The base case uses NO_LEAP and the .transient case uses GREGORIAN |
forcing the base case in dev175 works~ $./run_neon --neon-sites TALL --run-type transient --experiment dev175 --output-root /glade/derecho/scratch/wwieder/neon_AK --overwrite now the base case and transient case are both Gregorian |
now we still need to set up base cases and for ad and postad that are no_leap |
Some ideas that I have, that I'll go over with @TeaganKing:
Do know that now aborts at the clone step because the basecase is NO_LEAP (I didn't see this change in behavior from the CIME diff)
|
In meeting with @TeaganKing we thought about next steps as:
|
Thanks for working through all of this. let me know how I can help moving forward. |
Hi @ekluzek , in testing various CIME versions to try to isolate this error, I'm running into other errors before the point at which we would see the calendar error. In versions With this information, I'm not entirely sure if it is introduced at some point between 175 and 198. I think you had previously mentioned that the error was introduced at some point between |
@TeaganKing excellent work here. I was afraid this might be a difficult task. But, your work of going through this methodically is really helpful to figure this out. So the information you have is good, it's just not as helpful as it would be if you could easily find the exact version where it fails.
As I look at your results I see that we also need to pair appropriate ccs_config versions with the cime versions. So we can likely figure that out and get the pre-cime6.0.198 versions working. I think it's only ccs_config that's needed, but it's possible other's might be required as well. There's a couple of things I can do to look up what externals I think will work together. So let me look at that and get back to you.
The external for cime used in ctsm5.1.dev171 was cime6.0.175. And then the next CTSM version ctsm5.1.dev172 jumps to a branch tag off of cime6.0.217. You can see this by looking at the Externals.cfg file at the top of CTSM. I'm in today, so we could do a quick meeting if that would help. I could also show you how I check for externals that work together... |
OK, thanks @ekluzek ! I found the externals.cfg file within cime, so I'll try updating the ccs_config to the tag specified when a particular cime tag is checked out. I'll let you know if I have questions, but hopefully I'll have more information soon! |
@TeaganKing actually you want to look at the one above cime for CTSM. The one within cime, you just leave like it is for each version you test with. Although that also likely means you need to update the overall externals, by running manage_externals each time. That's a good thing to do to make sure... |
In discussion with @ekluzek , it seems that this is not necessarily an issue related to CIME and ccs_config compatibility. We are noticing that cmeps is also involved in this issue and includes some additional calendar information. Testing the most recent CTSM version with the following tags was successful: cime6.0.175, cmeps0.14.46, and ccs_config_cesm0.0.84 . I'll test various cime and cmeps versions, and do a diff on the cmeps version to see if we can isolate the error. |
|
@TeaganKing I'm a little confused by your findings. I think cime6.0.198 needs a different ccs_config version than cime6.0.197, but above they are listed as the same. Can you verify which version of ccs_config you used in the cime6.0.198 test that was working? |
You are right most of the cime changes don't relate to CALENDAR, but there are some subtle things I wonder about. If the CMEPS version also changed that might give us another thing to look at. I'm wondering if that's where the problem really is here... |
We plan to double check that these versions did really isolate the change. We also noticed the removal of a case_setup() command in this comparison, which could possibly be a source of error in run_neon; we could try adding this setup call within run_neon. |
Based on what @TeaganKing figured out the change to try would be: diff --git a/python/ctsm/site_and_regional/tower_site.py b/python/ctsm/site_and_regional/tower_site.py
index 1679df83e..4569fad10 100644
--- a/python/ctsm/site_and_regional/tower_site.py
+++ b/python/ctsm/site_and_regional/tower_site.py
@@ -364,6 +364,7 @@ def run_case(
# See https://github.com/ESCOMP/CTSM/pull/1872#pullrequestreview-1169407493
#
basecase.create_clone(case_root, keepexe=True, user_mods_dirs=user_mods_dirs)
+ basecase.setup()
with Case(case_root, read_only=False) as case:
if run_type != "transient": |
Adding the line with
BTW, I'm testing this is on my b4b-neon branch in ~/CESM (but I'm switching between branches a bunch... |
Hi @wwieder sorry for suggesting bad code. Neither of us tried this, which is likely why it doesn't work. Personally I was hoping to have someone else try it out and see what it does. In looking at the actual change it should be basecase.case_setup() I missed the case_ part in the subroutine name. So try that and see what it does. |
Thanks @ekluzek, that was enough to help me understand that it's not the basecase that needs a run directory, but the actual postad cases. I suspect the same will be true of running a transient case from a postad run, so will make the following changes on the b4b-neon branch.
|
Brief summary of bug
previously we were able to run postad cases using run_neon, but that functionality seems to have broken with externals update
General bug information
CTSM version you are using:
Feature works as intended with dev171, but is broken with dev172
Does this bug cause significantly incorrect results in the model's science? [Yes / No]
Yes, removes a helpful functionality in running neon cases
Configurations affected: [Fill this in if known.]
neon cases, but also potentially testing issues (@slevis-lmwg is noting with the sasu PR)
Details of bug
the same command using dev171 works
Important output or errors that show the problem
Previously noted in #2433, but creating a new issue here, as it's unrelated to the issue @TeaganKing addressed in #2435.
The text was updated successfully, but these errors were encountered: