-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests that are running or have hung are reported as PEND #610
Comments
cc @ekluzek |
@billsacks A test should never be left in the PEND state if it's not running. A test that gets killed due to a hang should ideally be left in the FAIL state. I'm not an expert in batch systems... when a job exceeds its allocated time, what does the batch system due? Does it hit the submitted script with a SIG_KILL? |
I think so - easy to test, just add --walltime 00:01 |
I guess there are two somewhat-related issues here:
|
According to this: http://slurm.schedmd.com/scancel.html I will also try to address (1) if it looks like it won't add too much complexity. |
Sounds good, thanks. If it turns out to be easier to address (2) than (1), then I'm fine with that. Or, to say it another way: Given these three possibilities: a. Pending in the queue b. Currently running c. Job killed due to a hang, running out of wallclock time I'd at least like (a) and (c) to be reported differently from each other. If all three can be reported differently from each other, then great - but if not, then I don't care much whether (b) is reported the same as (a) or (c). |
It appears that the status for the run phase is listed as PEND until the run completes. This is a departure from cime4, in which a run was given the status of RUN once it started running. I prefer the cime4 behavior: It's helpful to see which tests are truly pending in the queue and which are running. This is particularly helpful for tests that have exited due to hanging and running out of wallclock time: such tests currently have a final status of PEND (rather than RUN in cime4).
I have a nagging feeling that this was discussed at some point, but I can't remember the details.... There may have been an argument about not introducing more status codes, but I'd personally prefer to have one more status code that prevents this misleading PEND status.
An example test that currently hangs in CESM is
SMS_D_Ld1_P24x1.f10_f10.ICRUCLM45.hobart_nag.clm-af_bias_v5
(See also #383 for some initial discussion of this issue.)
The text was updated successfully, but these errors were encountered: