-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some yellowstone tests that abort in model run do not exit #383
Comments
I noticed the same behavior recently - but thought it was a system problem. On Thu, Aug 11, 2016 at 9:31 AM, Bill Sacks [email protected]
|
I tried to reproduce this using SMS_D and introducing a shr_sys_abort call in cesm_comp_mod.F90, this exited correctly and did not wait for wallclock time. |
@jedwards4b : If I'm looking in the right place, It looks like your run died with:
I did one ERS test last night from (I believe) cime5.0.7, and it also exited promptly, with that same error message. No other runs that I have done recently have had that error message, and no other runs I have done recently have exited promptly. I'm not sure if there's a connection. |
Further testing seems to indicate that this problem is related to the gnu compiler on yellowstone. |
@jedwards4b : Thank you very much for your investigation of this. I'm fine with this issue being closed, then, if you are. The exception error that I referenced above still seems weird, but that's probably a different issue. |
I have seen this behavior using Intel. It seems to happen when only a subset of tasks calls endrun. I have had jobs crash after less than a minute but chew up the whole wallclock allotment (the danger of firing off jobs just before going to bed). I opened an issue with CISL who said:
|
@gold2718 I don't think that its as simple as that. The test that I did above that hung with gnu but not intel was aborting on all tasks. I tried the following simple code with gnu and it does not hang:
|
@jedwards4b I agree that it is not simple in that different issues happen at different times with different compilers. In addition, the machine's behavior is not consistent (the same executable will sometimes exit and sometimes hang). What I was trying to say in my last message is that it seems to me (i.e., personal anecdotal evidence) that Intel jobs are more likely to hang when not all PEs call endrun. |
I did bring it up with cisl but I couldn't reproduce the problem consistently. |
Renaming from "yellowstone-gnu tests that abort in model run do not exit, and are reported as PEND" to simply "yellowstone-gnu tests that abort in model run do not exit"; the reporting as PEND has been moved to a new issue: #610 |
I just observed this for an intel test: I'm renaming this issue accordingly. (Actually, I guess that's what Steve was saying all along.) |
Won't fix since yellowstone is almost retired. |
OK, I saw problems with this same thing for yellowstone_gnu for LII cases. It runs two cases, and the first aborts with an error -- but the job remains in the queue. We decided to close this since it seemed to just be yellowstone_gnu. So I'll ignore those for now. But, I'm adding this continues to be a problem. LII_D_Ld3.f09_g17.ICLM45BGC.yellowstone_gnu.clm-defaultf09IC with cime5.3.0-alpha.21. |
In doing some testing of my new test infrastructure, I introduced a call to endrun in CLM to make sure that an abort is reported correctly. The behavior was not what I expected. I have confirmed that the same behavior exists at the head of master (5df46a2).
Specifically: At least on yellowstone, when the model aborts, the job does not exit from the machine, but instead remains running until it hits its wallclock limit. In addition, the final status is reported as:
PEND LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop RUN
whereas I would have expected:
FAIL LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop RUN
I tested this with
LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop
with the following code modifications: Either(which forces an abort in the first run of the LII test)
or
(which forces an abort in the second run of the LII test).
The text was updated successfully, but these errors were encountered: