-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need ability to detect and kill hung jobs #386
Comments
Why would a run become hung? Deadlock? Infinite loop? |
See #383. Some mpi-tasks may die which does deadlock the other tasks but doesn't result in the scheduler killing the job. |
typically deadlock but what cime process would monitor and kill a hung job. This is an exceedingly difficult requirement. |
I'm not expecting a quick, or any, solution but wanted to discuss the issue. |
okay I believe that this is a function of the job scheduler and that we should open tickets with hpc centers on schedulers that do not behave correctly. |
A fusion project that I work with typically uses the whole system on Titan, this issue of failures not aborting the jobs has serious consequences for 12 hour jobs. The OLCF came up with a solution for them. I'll send the code directly to Rob - perhaps there are some ideas there that would work. There have been similar discussions on detecting slow performance, and killing off the code if the progress is too slow to justify continuing. I have seen code in driver that looks at whether a prescribed throughput rate is being achieved (and killed otherwise). Don't know the author of this code (Tony?) or whether anyone uses it. |
I am author of that driver code. But it doesn't work on a hang only on a detectable slowdown. |
My apologies for the misattribution. I was referring to the second goal - of identifying slowdowns - sorry to have confused the discussion. I e-mailed the example script to @rljacob . It looks like it could be emulated on more than just Titan. I have no experience using it though, so do not know how robust it is. Looks like it might be worth trying though. |
Thanks, Pat. A gist with the script is here: https://gist.github.com/rljacob/c51a1bdf49f7600cb54d0d32e3bed250 |
Jim, is that code for detecting a slowdown always on? I don't think we've seen it tripped. |
The drv_in namelist variable is max_cplstep_time and if it is > 0 then if any coupler step is greater than that time the model will abort. If it is < 0 then -(max_cplstep_time)*cktime is used where cktime is the first coupler timestep time. The default value is 0 which means this feature is not used. |
Thanks @jedwards4b . @rljacob , maybe we should ask @ndkeen to oversee trying this on Edison, for some of the production jobs where we have expected performance metrics. We'd have to capture coupler step performance (in runs that are currently categorized as 'fast') - right now Noel primarily looks at the performance metrics output in cpl.log.XXX when he is monitoring throughput. |
Sure. |
@ndkeen - looking at a single timestep and deciding whether it is too slow is pretty fine grain, and we could easily have false positives. A full simulated day seems like a more robust metric, but even this is not perfect. @jedwards4b , did you try different intervals when developing this logic? And maybe we would want to wait until violations had occurred over a longish interval of time, say a simulated day's worth of violations? Eventually this will run into issues in which some steps include more I/O than others? @jedwards4b , I really know too little about what you have implemented to comment on it. I should shut up. If you'd like @ndkeen to exercise this, please tell Noel what he should do (and what he should expect), if more than what you indicated above. If this is too low a priority for you right now - which I assume that it probably is - we can wait until later. |
It checks at the coupler timestep interval - I believe that you should know beforehand what the longest expected time is - it usually occurs on the 1st or 15th of each month - then set this variable to twice that value. You set the variable in user_nl_cpl |
The value that you want is the dt that is output to the cpl.log file. |
Thanks Jim. ... Just thinking out loud here. The problem on Edison is primarily jobs that start slow and stay slow (Noel can correct me here). Periodic slow periods in the middle of a run are annoying, but are not necessarily sufficient reason to kill the job (given how long it takes to get jobs rescheduled). This probably needs to transition to the "slow Edison" github issue page, but we can summarize the capability that Jim has implemented over there if you (Noel) want to try it out, and then decide if we need to customize it for this particular performance problem. |
Yes. There was one famous job that started fast and turned slow, but others are either slow or fast to begin with. There are also jobs that "hang" for whatever reason. Not doing anything. Surely there is a way for ACME to detect that. These jobs clearly need to be killed. |
@ndkeen , read from the top of the issue. There are two "technologies" being discussed here: (a) a job script approach (developed at the OLCF for a fusion code) that monitors lengths of output files, as a means to monitor progress, and kills the job if nothing happens for some period of time; (b) a CESM (and ACME - occurred before the split) capability to define a max acceptable coupled timestep cost, and to abort if this exceeded. (a) will work for hung jobs. (b) focuses on performance slowdowns for jobs that we know what performance to expect. Might be too sensitiive to performance blips though for use out-of-the-box on Edison (based on our recent experiences). |
With respect to sensitivity to performance blips - this is why I recommend a setting of 2x the longest time step. If a model slows down to this extent I believe that it is unlikely to recover. |
Yeah, but Edison has been behaving very poorly, and I am not sure what upper bound makes sense on that system at the moment. Maybe when they reformat/update lustre software in the scratch directories, everything will get better. That is happening soon. Even on Titan (or maybe especially on Titan) there can be network "storms" (making up the term) that can impact MPI overhead and/or I/O rates for appreciable periods of time, but then clear up later. The frequency of these versus "always slow" runs is not something that we have taken the time to determine, though we may actually have the data to do this. @jedwards4b , how widely used is the max timestep cost monitoring capability? Do you use this for production runs on Yellowstone or at NERSC or elsewhere? Any feedback from the community on this feature? Thanks. |
@ndkeen , perhaps the first step is to look at the performance data (dt time, as @jedwards4b suggested) and determine what multiple of the slowest timestep in a fast run would be sufficient to identify slow runs. I'm fixating on the boundary situations. On Edison, things have been pretty binary (fast or 10X slower), and Jim's advice could be right on, though perhaps 3X or 4X would work equally well. |
Just an attribution - the script that @rljacob put at https://gist.github.com/rljacob/c51a1bdf49f7600cb54d0d32e3bed250 is due to Devesh Tiwari [email protected] . |
Closing this because Edison is now behaving better. Also can't really detect a hung job since the detector will also hang. Can detect jobs slowing down. |
… isotope transient test run threaded, this fixes ESMCI#386
Partly inspired by #383 but this has been a huge problem for ACME production runs on NERSC.
When a job is running fine, files in the $RUNDIR get updated at least every minute. If there are no updates after X minutes, CIME should assume the job is hung and kill it before core-hours are wasted waiting for the job to time out.
The text was updated successfully, but these errors were encountered: