-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run time error for mpi-serial case on cheyenne_intel when created with aux_clm create_test #1793
Comments
Yep, redoing over allows it to work. The case that fails is in: /glade/p/work/erik/clm_chkimpexpndepunits/cime/scripts/ERS_D_Ld7_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropGs.cheyenne_intel.clm-decStart1851_noinitial.GC.clm4_5_16_r253intel And the case that works is in: |
The CISL ticket for this is: |
I seem to remember seeing issues similar to this when building on the share queue. I've gone back to building on the login nodes and that seems to clear up issues like this. |
I'd say this is almost certainly a system problem. If you agree, then we should close this cime issue. |
Was there ever a resolution to the CISL ticket above? I cannot seem to either search for that issue number or go to the URL (I get a 'timed-out' message even if I am logged into the system). |
Dick Valent tried to do some work on it, but didn't figure anything out. He closed it, because he didn't hear back from me, and then I opened it again. But, he's closed it a second time now. I figured I'd let him know it's still a problem, but I'm not sure if I should reopen the ticket. |
Tests which pass on a do-over can either be race condition or system problem. What is the evidence pointing to system problem over race condition? |
@ekluzek says this happens all the time for him on cheyenne and/or yellowstone. It doesn't happen for me when I run the clm test suite, though - with a slightly different CLM version, though (I think) the same cime version. That makes me no longer suspect a one-time system problem. I wonder if there could be something in Erik's environment that is making this behave badly for him??? |
After thinking about this, and looking at the shared build directory structure I suspect I may have the reason I see this. The reason is that is that I often send both cheyenne and yellowstone tests out with the same test id. And since cheyenne and yellowstone have shared file-systems, but slightly different compiler configurations -- there's likely a race condition between the two builds that either allows it to work or fail. So my workaround that I'm going to do is to add an identifier for the machine on my test submissions. A more robust change to the system (if I can show that this is indeed the problem) would be to have the shared build add an identifier for the machine as well as the compiler for the shared build directories. Having shared file-systems across several machines is a common situation, so it's not a unique problem to NCAR. And it's not obvious to users that the two builds may conflict (many/most wouldn't know a shared build is being done). But, I'm willing to hear opinion from others on this @mvertens @jedwards4b @rljacob @gold2718 @jgfouca @fischer-ncar . The change I'm proposing is fundamentally pretty simple the shared build would have a subdirectory named (machine)_(compiler) rather than just a subdirectory named by (compiler). I haven't looked into how hard that would be to do, but my guess is it can't be too hard. |
It would be easy to add machine to the shared library path - but in any case it seems a very risky practice to submit two test suites with the same test-id. |
I agree with @jedwards4b . I have no problem adding the machine to the sharedlib path if it is indeed easy. But at the same time, we should very strongly discourage people from ever submitting multiple runs of create_test with the same testid. See also the discussion in #582 - though it looks like we never added anything to the create_test documentation saying that testids need to be unique. |
Just opened #1933 which I'll take (addressing the documentation of testids). |
My only comment is a concern about adding to the length of the test name. Didn't we just have an issue (#1914) where long test names was causing problems? |
@gold2718 what I'm proposing would only affect the shared build directory structure. The test names already have the machine/compiler combination in them (which is part of why I didn't think they would interfere), so the testname directory's won't change in length at all. But, under the shared build you have directories that look like: $CIME_OUTPUT_ROOT/sharedlibroot.$TESTID/intel/mpi-serial/debug/nothreads/mct|pio|gptl What I'm proposing is that would change to... $CIME_OUTPUT_ROOT/sharedlibroot.$TESTID/cheyenne_intel/mpi-serial/debug/nothreads/mct|pio|gptl There's nothing under those directories that have the $TESTID in them, so the paths are relatively short, and adding the machine name to them won't make much of a difference. |
(From skimming back through this issue, @mvertens @jedwards4b and I felt this could be closed as a wontfix.) |
I saw this before, but thought it might be a system problem. And maybe it still is, so I'm also having CISL look into this. The time I saw it before was before cheyenne was taken down, and I thought that work might have fixed it. When I run create_test for aux_clm on cheyenne several of the mpi-serial tests fail first in the build and then I have to build and run again. One of them still fails:
ERS_D_Ld7_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropGs.cheyenne_intel.clm-decStart1851_noinitial
It gives the following runtime error in the cesm.log.
What worked before was to redo the test case from scratch, so I'm trying that now.
The text was updated successfully, but these errors were encountered: