Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using check_cmd_status, nhc leaks watchdog processes #104

Closed
ghost opened this issue Jul 14, 2021 · 4 comments · Fixed by #132
Closed

When using check_cmd_status, nhc leaks watchdog processes #104

ghost opened this issue Jul 14, 2021 · 4 comments · Fixed by #132
Assignees
Labels
bug PRIORITY Urgent, high-priority issue/PR for 1.4.4, post-1.4.3 hotfix branch refactor Requires rethinking and/or rewriting of a large swath of code
Milestone

Comments

@ghost
Copy link

ghost commented Jul 14, 2021

A simple test case:

nhc.conf:

* || TIMEOUT=120
* || check_cmd_status -t 100 -r 0 /bin/true

Now if nhc is run on a node, the main process returns quickly, but it leaks a background nhc watchdog process which still has a TTY and is running sleep. After the sleep expires, it will trigger the watchdog and it will try to kill a process. In theory, if the machine is sufficiently busy or the timeout is sufficiently long, this can end up killing a process using a reused PID number.

from ps axf right after running nhc:

240166 pts/0    S      0:00 /bin/bash /usr/sbin/nhc
240167 pts/0    S      0:00  \_ sleep 100

This is a bit of an annoyance because it also causes this (because of it holding on to a TTY):

$ time ssh <node> "time nhc"

real    0m0.037s
user    0m0.023s
sys     0m0.005s

real    1m40.472s
user    0m0.026s
sys     0m0.005s
$

The watchdog process is spawned at https://github.com/mej/nhc/blob/master/scripts/lbnl_cmd.nhc#L169, and as far as I can see, there is nothing that tries to kill it after the command runs, and there probably should be.

In case bash version is relevant, this happens on version 4.2.46(2)-release (x86_64-redhat-linux-gnu).

@basvandervlies
Copy link
Contributor

Is there a solution? We have the same problem I did not notice this till I installed slurm 22.05 and now the srun hangs till this process are finished. In older slurm versions it was no problem

@mej mej self-assigned this Mar 4, 2023
@mej mej added bug PRIORITY Urgent, high-priority issue/PR for 1.4.4, post-1.4.3 hotfix branch refactor Requires rethinking and/or rewriting of a large swath of code labels Mar 4, 2023
@mej mej added this to the 1.4.4 Release milestone Mar 4, 2023
@mej
Copy link
Owner

mej commented Mar 4, 2023

We have not seen this on 22.05, to the best of my knowledge, but I will look again. But for what it's worth, going over the code again, I believe I can see exactly what the problem is. In fact, I'm not sure if I overlooked something or just wasn't paying close enough attention, but I don't really know how I expected it to work correctly at all, what with everything being in local variables...

Anyhoo...I will prioritize this fix. The gist of it is that nhcmain_watchdog_timer() needs to use global arrays, not local variables, for keeping track of the NHC/subcommand PIDs and the sleep PIDs, and the trap calls need to be set for all the relevant PIDs.

mej added a commit that referenced this issue Apr 19, 2023
For some reason (probably simplicity), the `check_cmd_status()` check
function was using a different method of spawning a subprocess with a
timout than `check_cmd_status()` was.  The `nhc_cmd_with_timeout()`
function was written specifically to facilitate consistency in coding
that exact functionality, but only `check_cmd_output()` was using it.
The `check_cmd_status()` check function was launching the subcommand
directly and trying to use `nhcmain_watchdog_timer()` to create its
watchdog timer process.

As observed in #104, `check_cmd_status()` (but *not*
`check_cmd_output()`) was leaving behind watchdog timer and `sleep`
processes that should have been terminated when the subcommand
exited.  Unfortunately, `nhcmain_watchdog_timer()` was not written
with this use case in mind, nor was `kill_watchdog()` expecting to
have to clean up multiple child processes.

This will be addressed in 2 ways.  For the "fix-only" 1.4.4 tree, I've
switched `check_cmd_status()` to using `nhc_cmd_with_timeout()` since
it uses a different mechanism and has not displayed this behavior.
The longer term fix will be to refactor the watchdog code in `nhc`
itself and use that code whenever launching subcommands is needed.

This should address #104 for the 1.4.4 branch but will be applied to
both for now, once tested and verified.
mej added a commit that referenced this issue Apr 19, 2023
See commits 1dd668e and b08769b for further details.

As referenced in the above commits, the longer-term fix for the 1.5+
branch is a refactoring of all the watchdog timer code in `nhc` so
that multiple distinct timers can be managed simultaneously, including
their termination in case of successful subprocess/program exit.  Lack
of proper cleanup was ultimately the key cause of #104's leaked
shell+`sleep` processes.

**NOTE**:  The `nhc` script itself does *not* keep track of all the
PIDs for all the timers it has spawned off, only the main one (for the
top-level `nhc` process).  Any other timer PIDs must be tracked by
whatever spawned them.  In particular, `nhc_cmd_with_timeout()` tracks
both the task and the timer PIDs and ensures that both processes have
exited before it returns.
mej added a commit that referenced this issue Apr 19, 2023
For some reason (probably simplicity), the `check_cmd_status()` check
function was using a different method of spawning a subprocess with a
timout than `check_cmd_status()` was.  The `nhc_cmd_with_timeout()`
function was written specifically to facilitate consistency in coding
that exact functionality, but only `check_cmd_output()` was using it.
The `check_cmd_status()` check function was launching the subcommand
directly and trying to use `nhcmain_watchdog_timer()` to create its
watchdog timer process.

As observed in #104, `check_cmd_status()` (but *not*
`check_cmd_output()`) was leaving behind watchdog timer and `sleep`
processes that should have been terminated when the subcommand
exited.  Unfortunately, `nhcmain_watchdog_timer()` was not written
with this use case in mind, nor was `kill_watchdog()` expecting to
have to clean up multiple child processes.

This will be addressed in 2 ways.  For the "fix-only" 1.4.4 tree, I've
switched `check_cmd_status()` to using `nhc_cmd_with_timeout()` since
it uses a different mechanism and has not displayed this behavior.
The longer term fix will be to refactor the watchdog code in `nhc`
itself and use that code whenever launching subcommands is needed.

This should address #104 for the 1.4.4 branch but will be applied to
both for now, once tested and verified.

(cherry picked from commit 1dd668e)
@mej mej linked a pull request Apr 19, 2023 that will close this issue
@mej
Copy link
Owner

mej commented Apr 20, 2023

I would love to get your feedback, either on the b08769b workaround/fix on the 1.4.4-dev branch or the new 1.5 code on the PR #132 branch! I believe either will fix this problem.

mej added a commit that referenced this issue Apr 25, 2023
For some reason (probably simplicity), the `check_cmd_status()` check
function was using a different method of spawning a subprocess with a
timout than `check_cmd_status()` was.  The `nhc_cmd_with_timeout()`
function was written specifically to facilitate consistency in coding
that exact functionality, but only `check_cmd_output()` was using it.
The `check_cmd_status()` check function was launching the subcommand
directly and trying to use `nhcmain_watchdog_timer()` to create its
watchdog timer process.

As observed in #104, `check_cmd_status()` (but *not*
`check_cmd_output()`) was leaving behind watchdog timer and `sleep`
processes that should have been terminated when the subcommand
exited.  Unfortunately, `nhcmain_watchdog_timer()` was not written
with this use case in mind, nor was `kill_watchdog()` expecting to
have to clean up multiple child processes.

This will be addressed in 2 ways.  For the "fix-only" 1.4.4 tree, I've
switched `check_cmd_status()` to using `nhc_cmd_with_timeout()` since
it uses a different mechanism and has not displayed this behavior.
The longer term fix will be to refactor the watchdog code in `nhc`
itself and use that code whenever launching subcommands is needed.

This should address #104 for the 1.4.4 branch but will be applied to
both for now, once tested and verified.
mej added a commit that referenced this issue Apr 25, 2023
See commits b08769b and 8fe57c0 for further details.

As referenced in the above commits, the longer-term fix for the 1.5+
branch is a refactoring of all the watchdog timer code in `nhc` so
that multiple distinct timers can be managed simultaneously, including
their termination in case of successful subprocess/program exit.  Lack
of proper cleanup was ultimately the key cause of #104's leaked
shell+`sleep` processes.

**NOTE**:  The `nhc` script itself does *not* keep track of all the
PIDs for all the timers it has spawned off, only the main one (for the
top-level `nhc` process).  Any other timer PIDs must be tracked by
whatever spawned them.  In particular, `nhc_cmd_with_timeout()` tracks
both the task and the timer PIDs and ensures that both processes have
exited before it returns.
@mej
Copy link
Owner

mej commented Jul 26, 2023

In the absence of further information to the contrary, I believe this to be fixed, so I am closing this Issue. Please feel free to reopen if you're still seeing this behavior!

@mej mej closed this as completed Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug PRIORITY Urgent, high-priority issue/PR for 1.4.4, post-1.4.3 hotfix branch refactor Requires rethinking and/or rewriting of a large swath of code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants