Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need ability to detect and kill hung jobs #386

Closed
rljacob opened this issue Aug 11, 2016 · 24 comments
Closed

Need ability to detect and kill hung jobs #386

rljacob opened this issue Aug 11, 2016 · 24 comments

Comments

@rljacob
Copy link
Member

rljacob commented Aug 11, 2016

Partly inspired by #383 but this has been a huge problem for ACME production runs on NERSC.

When a job is running fine, files in the $RUNDIR get updated at least every minute. If there are no updates after X minutes, CIME should assume the job is hung and kill it before core-hours are wasted waiting for the job to time out.

@jgfouca
Copy link
Contributor

jgfouca commented Aug 11, 2016

Why would a run become hung? Deadlock? Infinite loop?

@rljacob
Copy link
Member Author

rljacob commented Aug 11, 2016

See #383. Some mpi-tasks may die which does deadlock the other tasks but doesn't result in the scheduler killing the job.

@jedwards4b
Copy link
Contributor

typically deadlock but what cime process would monitor and kill a hung job. This is an exceedingly difficult requirement.

@rljacob
Copy link
Member Author

rljacob commented Aug 11, 2016

I'm not expecting a quick, or any, solution but wanted to discuss the issue.

@jedwards4b
Copy link
Contributor

okay I believe that this is a function of the job scheduler and that we should open tickets with hpc centers on schedulers that do not behave correctly.

@worleyph
Copy link
Contributor

A fusion project that I work with typically uses the whole system on Titan, this issue of failures not aborting the jobs has serious consequences for 12 hour jobs. The OLCF came up with a solution for them. I'll send the code directly to Rob - perhaps there are some ideas there that would work.

There have been similar discussions on detecting slow performance, and killing off the code if the progress is too slow to justify continuing. I have seen code in driver that looks at whether a prescribed throughput rate is being achieved (and killed otherwise). Don't know the author of this code (Tony?) or whether anyone uses it.

@jedwards4b
Copy link
Contributor

I am author of that driver code. But it doesn't work on a hang only on a detectable slowdown.

@worleyph
Copy link
Contributor

I am author of that driver code.

My apologies for the misattribution. I was referring to the second goal - of identifying slowdowns - sorry to have confused the discussion.

I e-mailed the example script to @rljacob . It looks like it could be emulated on more than just Titan. I have no experience using it though, so do not know how robust it is. Looks like it might be worth trying though.

@rljacob
Copy link
Member Author

rljacob commented Aug 11, 2016

Thanks, Pat. A gist with the script is here: https://gist.github.com/rljacob/c51a1bdf49f7600cb54d0d32e3bed250

@rljacob
Copy link
Member Author

rljacob commented Aug 11, 2016

Jim, is that code for detecting a slowdown always on? I don't think we've seen it tripped.

@jedwards4b
Copy link
Contributor

The drv_in namelist variable is max_cplstep_time and if it is > 0 then if any coupler step is greater than that time the model will abort. If it is < 0 then -(max_cplstep_time)*cktime is used where cktime is the first coupler timestep time. The default value is 0 which means this feature is not used.

@worleyph
Copy link
Contributor

worleyph commented Aug 11, 2016

Thanks @jedwards4b . @rljacob , maybe we should ask @ndkeen to oversee trying this on Edison, for some of the production jobs where we have expected performance metrics. We'd have to capture coupler step performance (in runs that are currently categorized as 'fast') - right now Noel primarily looks at the performance metrics output in cpl.log.XXX when he is monitoring throughput.

@ndkeen
Copy link
Contributor

ndkeen commented Aug 11, 2016

Sure.
Not that you asked me, but I think this would be a great "feature" to have, but a) it may not be easy and b) what if it kills a job you really didn't want killed.

@worleyph
Copy link
Contributor

worleyph commented Aug 11, 2016

@ndkeen - looking at a single timestep and deciding whether it is too slow is pretty fine grain, and we could easily have false positives. A full simulated day seems like a more robust metric, but even this is not perfect. @jedwards4b , did you try different intervals when developing this logic? And maybe we would want to wait until violations had occurred over a longish interval of time, say a simulated day's worth of violations? Eventually this will run into issues in which some steps include more I/O than others?

@jedwards4b , I really know too little about what you have implemented to comment on it. I should shut up. If you'd like @ndkeen to exercise this, please tell Noel what he should do (and what he should expect), if more than what you indicated above. If this is too low a priority for you right now - which I assume that it probably is - we can wait until later.

@jedwards4b
Copy link
Contributor

It checks at the coupler timestep interval - I believe that you should know beforehand what the longest expected time is - it usually occurs on the 1st or 15th of each month - then set this variable to twice that value. You set the variable in user_nl_cpl

@jedwards4b
Copy link
Contributor

The value that you want is the dt that is output to the cpl.log file.

@worleyph
Copy link
Contributor

Thanks Jim. ... Just thinking out loud here. The problem on Edison is primarily jobs that start slow and stay slow (Noel can correct me here). Periodic slow periods in the middle of a run are annoying, but are not necessarily sufficient reason to kill the job (given how long it takes to get jobs rescheduled).
I'm rambling here ... @ndkeen, if this sounds interesting to you, please go ahead and give it a try.

This probably needs to transition to the "slow Edison" github issue page, but we can summarize the capability that Jim has implemented over there if you (Noel) want to try it out, and then decide if we need to customize it for this particular performance problem.

@ndkeen
Copy link
Contributor

ndkeen commented Aug 11, 2016

Yes. There was one famous job that started fast and turned slow, but others are either slow or fast to begin with.

There are also jobs that "hang" for whatever reason. Not doing anything. Surely there is a way for ACME to detect that. These jobs clearly need to be killed.

@worleyph
Copy link
Contributor

worleyph commented Aug 12, 2016

@ndkeen , read from the top of the issue. There are two "technologies" being discussed here: (a) a job script approach (developed at the OLCF for a fusion code) that monitors lengths of output files, as a means to monitor progress, and kills the job if nothing happens for some period of time; (b) a CESM (and ACME - occurred before the split) capability to define a max acceptable coupled timestep cost, and to abort if this exceeded.

(a) will work for hung jobs.

(b) focuses on performance slowdowns for jobs that we know what performance to expect. Might be too sensitiive to performance blips though for use out-of-the-box on Edison (based on our recent experiences).

@jedwards4b
Copy link
Contributor

With respect to sensitivity to performance blips - this is why I recommend a setting of 2x the longest time step. If a model slows down to this extent I believe that it is unlikely to recover.

@worleyph
Copy link
Contributor

worleyph commented Aug 12, 2016

Yeah, but Edison has been behaving very poorly, and I am not sure what upper bound makes sense on that system at the moment. Maybe when they reformat/update lustre software in the scratch directories, everything will get better. That is happening soon.

Even on Titan (or maybe especially on Titan) there can be network "storms" (making up the term) that can impact MPI overhead and/or I/O rates for appreciable periods of time, but then clear up later. The frequency of these versus "always slow" runs is not something that we have taken the time to determine, though we may actually have the data to do this.

@jedwards4b , how widely used is the max timestep cost monitoring capability? Do you use this for production runs on Yellowstone or at NERSC or elsewhere? Any feedback from the community on this feature? Thanks.

@worleyph
Copy link
Contributor

worleyph commented Aug 12, 2016

@ndkeen , perhaps the first step is to look at the performance data (dt time, as @jedwards4b suggested) and determine what multiple of the slowest timestep in a fast run would be sufficient to identify slow runs. I'm fixating on the boundary situations. On Edison, things have been pretty binary (fast or 10X slower), and Jim's advice could be right on, though perhaps 3X or 4X would work equally well.

@worleyph
Copy link
Contributor

Just an attribution - the script that @rljacob put at https://gist.github.com/rljacob/c51a1bdf49f7600cb54d0d32e3bed250 is due to Devesh Tiwari [email protected] .

@rljacob
Copy link
Member Author

rljacob commented Apr 7, 2017

Closing this because Edison is now behaving better. Also can't really detect a hung job since the detector will also hang. Can detect jobs slowing down.

@rljacob rljacob closed this as completed Apr 7, 2017
pesieber pushed a commit to pesieber/cime that referenced this issue Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants