Need ability to detect and kill hung jobs #386

rljacob · 2016-08-11T18:54:01Z

Partly inspired by #383 but this has been a huge problem for ACME production runs on NERSC.

When a job is running fine, files in the $RUNDIR get updated at least every minute. If there are no updates after X minutes, CIME should assume the job is hung and kill it before core-hours are wasted waiting for the job to time out.

jgfouca · 2016-08-11T18:57:38Z

Why would a run become hung? Deadlock? Infinite loop?

rljacob · 2016-08-11T18:59:09Z

See #383. Some mpi-tasks may die which does deadlock the other tasks but doesn't result in the scheduler killing the job.

jedwards4b · 2016-08-11T18:59:26Z

typically deadlock but what cime process would monitor and kill a hung job. This is an exceedingly difficult requirement.

rljacob · 2016-08-11T19:01:25Z

I'm not expecting a quick, or any, solution but wanted to discuss the issue.

jedwards4b · 2016-08-11T19:02:58Z

okay I believe that this is a function of the job scheduler and that we should open tickets with hpc centers on schedulers that do not behave correctly.

worleyph · 2016-08-11T19:04:04Z

A fusion project that I work with typically uses the whole system on Titan, this issue of failures not aborting the jobs has serious consequences for 12 hour jobs. The OLCF came up with a solution for them. I'll send the code directly to Rob - perhaps there are some ideas there that would work.

There have been similar discussions on detecting slow performance, and killing off the code if the progress is too slow to justify continuing. I have seen code in driver that looks at whether a prescribed throughput rate is being achieved (and killed otherwise). Don't know the author of this code (Tony?) or whether anyone uses it.

jedwards4b · 2016-08-11T19:13:15Z

I am author of that driver code. But it doesn't work on a hang only on a detectable slowdown.

worleyph · 2016-08-11T19:25:59Z

I am author of that driver code.

My apologies for the misattribution. I was referring to the second goal - of identifying slowdowns - sorry to have confused the discussion.

I e-mailed the example script to @rljacob . It looks like it could be emulated on more than just Titan. I have no experience using it though, so do not know how robust it is. Looks like it might be worth trying though.

rljacob · 2016-08-11T19:29:18Z

Thanks, Pat. A gist with the script is here: https://gist.github.com/rljacob/c51a1bdf49f7600cb54d0d32e3bed250

rljacob · 2016-08-11T19:30:53Z

Jim, is that code for detecting a slowdown always on? I don't think we've seen it tripped.

jedwards4b · 2016-08-11T19:41:35Z

The drv_in namelist variable is max_cplstep_time and if it is > 0 then if any coupler step is greater than that time the model will abort. If it is < 0 then -(max_cplstep_time)*cktime is used where cktime is the first coupler timestep time. The default value is 0 which means this feature is not used.

worleyph · 2016-08-11T19:50:11Z

Thanks @jedwards4b . @rljacob , maybe we should ask @ndkeen to oversee trying this on Edison, for some of the production jobs where we have expected performance metrics. We'd have to capture coupler step performance (in runs that are currently categorized as 'fast') - right now Noel primarily looks at the performance metrics output in cpl.log.XXX when he is monitoring throughput.

ndkeen · 2016-08-11T22:33:26Z

Sure.
Not that you asked me, but I think this would be a great "feature" to have, but a) it may not be easy and b) what if it kills a job you really didn't want killed.

worleyph · 2016-08-11T22:46:14Z

@ndkeen - looking at a single timestep and deciding whether it is too slow is pretty fine grain, and we could easily have false positives. A full simulated day seems like a more robust metric, but even this is not perfect. @jedwards4b , did you try different intervals when developing this logic? And maybe we would want to wait until violations had occurred over a longish interval of time, say a simulated day's worth of violations? Eventually this will run into issues in which some steps include more I/O than others?

@jedwards4b , I really know too little about what you have implemented to comment on it. I should shut up. If you'd like @ndkeen to exercise this, please tell Noel what he should do (and what he should expect), if more than what you indicated above. If this is too low a priority for you right now - which I assume that it probably is - we can wait until later.

jedwards4b · 2016-08-11T23:04:58Z

It checks at the coupler timestep interval - I believe that you should know beforehand what the longest expected time is - it usually occurs on the 1st or 15th of each month - then set this variable to twice that value. You set the variable in user_nl_cpl

jedwards4b · 2016-08-11T23:05:56Z

The value that you want is the dt that is output to the cpl.log file.

worleyph · 2016-08-11T23:19:16Z

Thanks Jim. ... Just thinking out loud here. The problem on Edison is primarily jobs that start slow and stay slow (Noel can correct me here). Periodic slow periods in the middle of a run are annoying, but are not necessarily sufficient reason to kill the job (given how long it takes to get jobs rescheduled).
I'm rambling here ... @ndkeen, if this sounds interesting to you, please go ahead and give it a try.

This probably needs to transition to the "slow Edison" github issue page, but we can summarize the capability that Jim has implemented over there if you (Noel) want to try it out, and then decide if we need to customize it for this particular performance problem.

ndkeen · 2016-08-11T23:22:25Z

Yes. There was one famous job that started fast and turned slow, but others are either slow or fast to begin with.

There are also jobs that "hang" for whatever reason. Not doing anything. Surely there is a way for ACME to detect that. These jobs clearly need to be killed.

worleyph · 2016-08-12T13:47:12Z

@ndkeen , read from the top of the issue. There are two "technologies" being discussed here: (a) a job script approach (developed at the OLCF for a fusion code) that monitors lengths of output files, as a means to monitor progress, and kills the job if nothing happens for some period of time; (b) a CESM (and ACME - occurred before the split) capability to define a max acceptable coupled timestep cost, and to abort if this exceeded.

(a) will work for hung jobs.

(b) focuses on performance slowdowns for jobs that we know what performance to expect. Might be too sensitiive to performance blips though for use out-of-the-box on Edison (based on our recent experiences).

jedwards4b · 2016-08-12T13:55:01Z

With respect to sensitivity to performance blips - this is why I recommend a setting of 2x the longest time step. If a model slows down to this extent I believe that it is unlikely to recover.

worleyph · 2016-08-12T14:01:48Z

Yeah, but Edison has been behaving very poorly, and I am not sure what upper bound makes sense on that system at the moment. Maybe when they reformat/update lustre software in the scratch directories, everything will get better. That is happening soon.

Even on Titan (or maybe especially on Titan) there can be network "storms" (making up the term) that can impact MPI overhead and/or I/O rates for appreciable periods of time, but then clear up later. The frequency of these versus "always slow" runs is not something that we have taken the time to determine, though we may actually have the data to do this.

@jedwards4b , how widely used is the max timestep cost monitoring capability? Do you use this for production runs on Yellowstone or at NERSC or elsewhere? Any feedback from the community on this feature? Thanks.

worleyph · 2016-08-12T14:12:14Z

@ndkeen , perhaps the first step is to look at the performance data (dt time, as @jedwards4b suggested) and determine what multiple of the slowest timestep in a fast run would be sufficient to identify slow runs. I'm fixating on the boundary situations. On Edison, things have been pretty binary (fast or 10X slower), and Jim's advice could be right on, though perhaps 3X or 4X would work equally well.

worleyph · 2016-08-20T02:49:05Z

Just an attribution - the script that @rljacob put at https://gist.github.com/rljacob/c51a1bdf49f7600cb54d0d32e3bed250 is due to Devesh Tiwari [email protected] .

rljacob · 2017-04-07T17:33:22Z

Closing this because Edison is now behaving better. Also can't really detect a hung job since the detector will also hang. Can detect jobs slowing down.

… isotope transient test run threaded, this fixes ESMCI#386

rljacob closed this as completed Apr 7, 2017

pesieber pushed a commit to pesieber/cime that referenced this issue Mar 15, 2023

Remove the two longest tests at two and a half hours, make a decStart…

fcfd91e

… isotope transient test run threaded, this fixes ESMCI#386

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need ability to detect and kill hung jobs #386

Need ability to detect and kill hung jobs #386

rljacob commented Aug 11, 2016

jgfouca commented Aug 11, 2016

rljacob commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

rljacob commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

worleyph commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

worleyph commented Aug 11, 2016

rljacob commented Aug 11, 2016

rljacob commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

worleyph commented Aug 11, 2016 •

edited

Loading

ndkeen commented Aug 11, 2016

worleyph commented Aug 11, 2016 •

edited

Loading

jedwards4b commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

worleyph commented Aug 11, 2016

ndkeen commented Aug 11, 2016

worleyph commented Aug 12, 2016 •

edited

Loading

jedwards4b commented Aug 12, 2016

worleyph commented Aug 12, 2016 •

edited

Loading

worleyph commented Aug 12, 2016 •

edited

Loading

worleyph commented Aug 20, 2016

rljacob commented Apr 7, 2017

Need ability to detect and kill hung jobs #386

Need ability to detect and kill hung jobs #386

Comments

rljacob commented Aug 11, 2016

jgfouca commented Aug 11, 2016

rljacob commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

rljacob commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

worleyph commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

worleyph commented Aug 11, 2016

rljacob commented Aug 11, 2016

rljacob commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

worleyph commented Aug 11, 2016 • edited Loading

ndkeen commented Aug 11, 2016

worleyph commented Aug 11, 2016 • edited Loading

jedwards4b commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

worleyph commented Aug 11, 2016

ndkeen commented Aug 11, 2016

worleyph commented Aug 12, 2016 • edited Loading

jedwards4b commented Aug 12, 2016

worleyph commented Aug 12, 2016 • edited Loading

worleyph commented Aug 12, 2016 • edited Loading

worleyph commented Aug 20, 2016

rljacob commented Apr 7, 2017

worleyph commented Aug 11, 2016 •

edited

Loading

worleyph commented Aug 11, 2016 •

edited

Loading

worleyph commented Aug 12, 2016 •

edited

Loading

worleyph commented Aug 12, 2016 •

edited

Loading

worleyph commented Aug 12, 2016 •

edited

Loading