-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using check_cmd_status, nhc leaks watchdog processes #104
Comments
Is there a solution? We have the same problem I did not notice this till I installed slurm 22.05 and now the |
We have not seen this on 22.05, to the best of my knowledge, but I will look again. But for what it's worth, going over the code again, I believe I can see exactly what the problem is. In fact, I'm not sure if I overlooked something or just wasn't paying close enough attention, but I don't really know how I expected it to work correctly at all, what with everything being in local variables... Anyhoo...I will prioritize this fix. The gist of it is that |
For some reason (probably simplicity), the `check_cmd_status()` check function was using a different method of spawning a subprocess with a timout than `check_cmd_status()` was. The `nhc_cmd_with_timeout()` function was written specifically to facilitate consistency in coding that exact functionality, but only `check_cmd_output()` was using it. The `check_cmd_status()` check function was launching the subcommand directly and trying to use `nhcmain_watchdog_timer()` to create its watchdog timer process. As observed in #104, `check_cmd_status()` (but *not* `check_cmd_output()`) was leaving behind watchdog timer and `sleep` processes that should have been terminated when the subcommand exited. Unfortunately, `nhcmain_watchdog_timer()` was not written with this use case in mind, nor was `kill_watchdog()` expecting to have to clean up multiple child processes. This will be addressed in 2 ways. For the "fix-only" 1.4.4 tree, I've switched `check_cmd_status()` to using `nhc_cmd_with_timeout()` since it uses a different mechanism and has not displayed this behavior. The longer term fix will be to refactor the watchdog code in `nhc` itself and use that code whenever launching subcommands is needed. This should address #104 for the 1.4.4 branch but will be applied to both for now, once tested and verified.
See commits 1dd668e and b08769b for further details. As referenced in the above commits, the longer-term fix for the 1.5+ branch is a refactoring of all the watchdog timer code in `nhc` so that multiple distinct timers can be managed simultaneously, including their termination in case of successful subprocess/program exit. Lack of proper cleanup was ultimately the key cause of #104's leaked shell+`sleep` processes. **NOTE**: The `nhc` script itself does *not* keep track of all the PIDs for all the timers it has spawned off, only the main one (for the top-level `nhc` process). Any other timer PIDs must be tracked by whatever spawned them. In particular, `nhc_cmd_with_timeout()` tracks both the task and the timer PIDs and ensures that both processes have exited before it returns.
For some reason (probably simplicity), the `check_cmd_status()` check function was using a different method of spawning a subprocess with a timout than `check_cmd_status()` was. The `nhc_cmd_with_timeout()` function was written specifically to facilitate consistency in coding that exact functionality, but only `check_cmd_output()` was using it. The `check_cmd_status()` check function was launching the subcommand directly and trying to use `nhcmain_watchdog_timer()` to create its watchdog timer process. As observed in #104, `check_cmd_status()` (but *not* `check_cmd_output()`) was leaving behind watchdog timer and `sleep` processes that should have been terminated when the subcommand exited. Unfortunately, `nhcmain_watchdog_timer()` was not written with this use case in mind, nor was `kill_watchdog()` expecting to have to clean up multiple child processes. This will be addressed in 2 ways. For the "fix-only" 1.4.4 tree, I've switched `check_cmd_status()` to using `nhc_cmd_with_timeout()` since it uses a different mechanism and has not displayed this behavior. The longer term fix will be to refactor the watchdog code in `nhc` itself and use that code whenever launching subcommands is needed. This should address #104 for the 1.4.4 branch but will be applied to both for now, once tested and verified. (cherry picked from commit 1dd668e)
I would love to get your feedback, either on the b08769b workaround/fix on the 1.4.4-dev branch or the new 1.5 code on the PR #132 branch! I believe either will fix this problem. |
For some reason (probably simplicity), the `check_cmd_status()` check function was using a different method of spawning a subprocess with a timout than `check_cmd_status()` was. The `nhc_cmd_with_timeout()` function was written specifically to facilitate consistency in coding that exact functionality, but only `check_cmd_output()` was using it. The `check_cmd_status()` check function was launching the subcommand directly and trying to use `nhcmain_watchdog_timer()` to create its watchdog timer process. As observed in #104, `check_cmd_status()` (but *not* `check_cmd_output()`) was leaving behind watchdog timer and `sleep` processes that should have been terminated when the subcommand exited. Unfortunately, `nhcmain_watchdog_timer()` was not written with this use case in mind, nor was `kill_watchdog()` expecting to have to clean up multiple child processes. This will be addressed in 2 ways. For the "fix-only" 1.4.4 tree, I've switched `check_cmd_status()` to using `nhc_cmd_with_timeout()` since it uses a different mechanism and has not displayed this behavior. The longer term fix will be to refactor the watchdog code in `nhc` itself and use that code whenever launching subcommands is needed. This should address #104 for the 1.4.4 branch but will be applied to both for now, once tested and verified.
See commits b08769b and 8fe57c0 for further details. As referenced in the above commits, the longer-term fix for the 1.5+ branch is a refactoring of all the watchdog timer code in `nhc` so that multiple distinct timers can be managed simultaneously, including their termination in case of successful subprocess/program exit. Lack of proper cleanup was ultimately the key cause of #104's leaked shell+`sleep` processes. **NOTE**: The `nhc` script itself does *not* keep track of all the PIDs for all the timers it has spawned off, only the main one (for the top-level `nhc` process). Any other timer PIDs must be tracked by whatever spawned them. In particular, `nhc_cmd_with_timeout()` tracks both the task and the timer PIDs and ensures that both processes have exited before it returns.
In the absence of further information to the contrary, I believe this to be fixed, so I am closing this Issue. Please feel free to reopen if you're still seeing this behavior! |
A simple test case:
nhc.conf
:Now if nhc is run on a node, the main process returns quickly, but it leaks a background nhc watchdog process which still has a TTY and is running sleep. After the sleep expires, it will trigger the watchdog and it will try to kill a process. In theory, if the machine is sufficiently busy or the timeout is sufficiently long, this can end up killing a process using a reused PID number.
from
ps axf
right after running nhc:This is a bit of an annoyance because it also causes this (because of it holding on to a TTY):
The watchdog process is spawned at https://github.com/mej/nhc/blob/master/scripts/lbnl_cmd.nhc#L169, and as far as I can see, there is nothing that tries to kill it after the command runs, and there probably should be.
In case bash version is relevant, this happens on version 4.2.46(2)-release (x86_64-redhat-linux-gnu).
The text was updated successfully, but these errors were encountered: