When using check_cmd_status, nhc leaks watchdog processes #104

ghost · 2021-07-14T13:21:46Z

A simple test case:

nhc.conf:

* || TIMEOUT=120
* || check_cmd_status -t 100 -r 0 /bin/true

Now if nhc is run on a node, the main process returns quickly, but it leaks a background nhc watchdog process which still has a TTY and is running sleep. After the sleep expires, it will trigger the watchdog and it will try to kill a process. In theory, if the machine is sufficiently busy or the timeout is sufficiently long, this can end up killing a process using a reused PID number.

from ps axf right after running nhc:

240166 pts/0    S      0:00 /bin/bash /usr/sbin/nhc
240167 pts/0    S      0:00  \_ sleep 100

This is a bit of an annoyance because it also causes this (because of it holding on to a TTY):

$ time ssh <node> "time nhc"

real    0m0.037s
user    0m0.023s
sys     0m0.005s

real    1m40.472s
user    0m0.026s
sys     0m0.005s
$

The watchdog process is spawned at https://github.com/mej/nhc/blob/master/scripts/lbnl_cmd.nhc#L169, and as far as I can see, there is nothing that tries to kill it after the command runs, and there probably should be.

In case bash version is relevant, this happens on version 4.2.46(2)-release (x86_64-redhat-linux-gnu).

The text was updated successfully, but these errors were encountered:

basvandervlies · 2022-08-05T07:33:20Z

Is there a solution? We have the same problem I did not notice this till I installed slurm 22.05 and now the srun hangs till this process are finished. In older slurm versions it was no problem

mej · 2023-03-04T06:43:47Z

We have not seen this on 22.05, to the best of my knowledge, but I will look again. But for what it's worth, going over the code again, I believe I can see exactly what the problem is. In fact, I'm not sure if I overlooked something or just wasn't paying close enough attention, but I don't really know how I expected it to work correctly at all, what with everything being in local variables...

Anyhoo...I will prioritize this fix. The gist of it is that nhcmain_watchdog_timer() needs to use global arrays, not local variables, for keeping track of the NHC/subcommand PIDs and the sleep PIDs, and the trap calls need to be set for all the relevant PIDs.

For some reason (probably simplicity), the `check_cmd_status()` check function was using a different method of spawning a subprocess with a timout than `check_cmd_status()` was. The `nhc_cmd_with_timeout()` function was written specifically to facilitate consistency in coding that exact functionality, but only `check_cmd_output()` was using it. The `check_cmd_status()` check function was launching the subcommand directly and trying to use `nhcmain_watchdog_timer()` to create its watchdog timer process. As observed in #104, `check_cmd_status()` (but *not* `check_cmd_output()`) was leaving behind watchdog timer and `sleep` processes that should have been terminated when the subcommand exited. Unfortunately, `nhcmain_watchdog_timer()` was not written with this use case in mind, nor was `kill_watchdog()` expecting to have to clean up multiple child processes. This will be addressed in 2 ways. For the "fix-only" 1.4.4 tree, I've switched `check_cmd_status()` to using `nhc_cmd_with_timeout()` since it uses a different mechanism and has not displayed this behavior. The longer term fix will be to refactor the watchdog code in `nhc` itself and use that code whenever launching subcommands is needed. This should address #104 for the 1.4.4 branch but will be applied to both for now, once tested and verified.

See commits 1dd668e and b08769b for further details. As referenced in the above commits, the longer-term fix for the 1.5+ branch is a refactoring of all the watchdog timer code in `nhc` so that multiple distinct timers can be managed simultaneously, including their termination in case of successful subprocess/program exit. Lack of proper cleanup was ultimately the key cause of #104's leaked shell+`sleep` processes. **NOTE**: The `nhc` script itself does *not* keep track of all the PIDs for all the timers it has spawned off, only the main one (for the top-level `nhc` process). Any other timer PIDs must be tracked by whatever spawned them. In particular, `nhc_cmd_with_timeout()` tracks both the task and the timer PIDs and ensures that both processes have exited before it returns.

For some reason (probably simplicity), the `check_cmd_status()` check function was using a different method of spawning a subprocess with a timout than `check_cmd_status()` was. The `nhc_cmd_with_timeout()` function was written specifically to facilitate consistency in coding that exact functionality, but only `check_cmd_output()` was using it. The `check_cmd_status()` check function was launching the subcommand directly and trying to use `nhcmain_watchdog_timer()` to create its watchdog timer process. As observed in #104, `check_cmd_status()` (but *not* `check_cmd_output()`) was leaving behind watchdog timer and `sleep` processes that should have been terminated when the subcommand exited. Unfortunately, `nhcmain_watchdog_timer()` was not written with this use case in mind, nor was `kill_watchdog()` expecting to have to clean up multiple child processes. This will be addressed in 2 ways. For the "fix-only" 1.4.4 tree, I've switched `check_cmd_status()` to using `nhc_cmd_with_timeout()` since it uses a different mechanism and has not displayed this behavior. The longer term fix will be to refactor the watchdog code in `nhc` itself and use that code whenever launching subcommands is needed. This should address #104 for the 1.4.4 branch but will be applied to both for now, once tested and verified. (cherry picked from commit 1dd668e)

mej · 2023-04-20T18:49:55Z

I would love to get your feedback, either on the b08769b workaround/fix on the 1.4.4-dev branch or the new 1.5 code on the PR #132 branch! I believe either will fix this problem.

For some reason (probably simplicity), the `check_cmd_status()` check function was using a different method of spawning a subprocess with a timout than `check_cmd_status()` was. The `nhc_cmd_with_timeout()` function was written specifically to facilitate consistency in coding that exact functionality, but only `check_cmd_output()` was using it. The `check_cmd_status()` check function was launching the subcommand directly and trying to use `nhcmain_watchdog_timer()` to create its watchdog timer process. As observed in #104, `check_cmd_status()` (but *not* `check_cmd_output()`) was leaving behind watchdog timer and `sleep` processes that should have been terminated when the subcommand exited. Unfortunately, `nhcmain_watchdog_timer()` was not written with this use case in mind, nor was `kill_watchdog()` expecting to have to clean up multiple child processes. This will be addressed in 2 ways. For the "fix-only" 1.4.4 tree, I've switched `check_cmd_status()` to using `nhc_cmd_with_timeout()` since it uses a different mechanism and has not displayed this behavior. The longer term fix will be to refactor the watchdog code in `nhc` itself and use that code whenever launching subcommands is needed. This should address #104 for the 1.4.4 branch but will be applied to both for now, once tested and verified.

See commits b08769b and 8fe57c0 for further details. As referenced in the above commits, the longer-term fix for the 1.5+ branch is a refactoring of all the watchdog timer code in `nhc` so that multiple distinct timers can be managed simultaneously, including their termination in case of successful subprocess/program exit. Lack of proper cleanup was ultimately the key cause of #104's leaked shell+`sleep` processes. **NOTE**: The `nhc` script itself does *not* keep track of all the PIDs for all the timers it has spawned off, only the main one (for the top-level `nhc` process). Any other timer PIDs must be tracked by whatever spawned them. In particular, `nhc_cmd_with_timeout()` tracks both the task and the timer PIDs and ensures that both processes have exited before it returns.

mej · 2023-07-26T17:23:01Z

In the absence of further information to the contrary, I believe this to be fixed, so I am closing this Issue. Please feel free to reopen if you're still seeing this behavior!

mej self-assigned this Mar 4, 2023

mej added bug PRIORITY Urgent, high-priority issue/PR for 1.4.4, post-1.4.3 hotfix branch refactor Requires rethinking and/or rewriting of a large swath of code labels Mar 4, 2023

mej added this to the 1.4.4 Release milestone Mar 4, 2023

mej mentioned this issue Apr 19, 2023

nhc: Refactor watchdog timer code for reuse #132

Merged

mej linked a pull request Apr 19, 2023 that will close this issue

nhc: Refactor watchdog timer code for reuse #132

Merged

mej mentioned this issue May 1, 2023

Manually drained node got resumed after watchdog timer was unable to terminate hung NHC process #134

Open

mej closed this as completed Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using check_cmd_status, nhc leaks watchdog processes #104

When using check_cmd_status, nhc leaks watchdog processes #104

ghost commented Jul 14, 2021

basvandervlies commented Aug 5, 2022

mej commented Mar 4, 2023

mej commented Apr 20, 2023 •

edited

Loading

mej commented Jul 26, 2023

When using check_cmd_status, nhc leaks watchdog processes #104

When using check_cmd_status, nhc leaks watchdog processes #104

Comments

ghost commented Jul 14, 2021

basvandervlies commented Aug 5, 2022

mej commented Mar 4, 2023

mej commented Apr 20, 2023 • edited Loading

mej commented Jul 26, 2023

mej commented Apr 20, 2023 •

edited

Loading