Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests that are running or have hung are reported as PEND #610

Closed
billsacks opened this issue Sep 29, 2016 · 6 comments
Closed

Tests that are running or have hung are reported as PEND #610

billsacks opened this issue Sep 29, 2016 · 6 comments
Assignees

Comments

@billsacks
Copy link
Member

billsacks commented Sep 29, 2016

It appears that the status for the run phase is listed as PEND until the run completes. This is a departure from cime4, in which a run was given the status of RUN once it started running. I prefer the cime4 behavior: It's helpful to see which tests are truly pending in the queue and which are running. This is particularly helpful for tests that have exited due to hanging and running out of wallclock time: such tests currently have a final status of PEND (rather than RUN in cime4).

I have a nagging feeling that this was discussed at some point, but I can't remember the details.... There may have been an argument about not introducing more status codes, but I'd personally prefer to have one more status code that prevents this misleading PEND status.

An example test that currently hangs in CESM is SMS_D_Ld1_P24x1.f10_f10.ICRUCLM45.hobart_nag.clm-af_bias_v5

(See also #383 for some initial discussion of this issue.)

@billsacks
Copy link
Member Author

cc @ekluzek

@jgfouca jgfouca self-assigned this Nov 2, 2016
@jgfouca
Copy link
Contributor

jgfouca commented Nov 2, 2016

@billsacks A test should never be left in the PEND state if it's not running. A test that gets killed due to a hang should ideally be left in the FAIL state. I'm not an expert in batch systems... when a job exceeds its allocated time, what does the batch system due? Does it hit the submitted script with a SIG_KILL?

@jedwards4b
Copy link
Contributor

I think so - easy to test, just add --walltime 00:01

@billsacks
Copy link
Member Author

I guess there are two somewhat-related issues here:

  1. What state does a test have if it is currently running: currently it seems this state is PEND; I'd prefer something different, like RUN
  2. What state does a test have if it dies due to hanging and running out of wallclock time. Ideally this would be labeled as FAIL, but I realize that may not be easy. I'm okay with this being kept at whatever status code is used for (1)... mostly this issue is about renaming that status to something like RUN rather than PEND for greater clarity and distinction from tests that are still pending.

@jgfouca
Copy link
Contributor

jgfouca commented Nov 2, 2016

According to this: http://slurm.schedmd.com/scancel.html
... at least for slurm (2) is very doable, we just need to handle SIGTERM.

I will also try to address (1) if it looks like it won't add too much complexity.

@billsacks
Copy link
Member Author

Sounds good, thanks. If it turns out to be easier to address (2) than (1), then I'm fine with that. Or, to say it another way: Given these three possibilities:

a. Pending in the queue

b. Currently running

c. Job killed due to a hang, running out of wallclock time

I'd at least like (a) and (c) to be reported differently from each other. If all three can be reported differently from each other, then great - but if not, then I don't care much whether (b) is reported the same as (a) or (c).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants