Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address watchdogd service problem with generic script > 1s runtime #39 #40

Merged

Conversation

senthilnthangaraj
Copy link
Contributor

@senthilnthangaraj senthilnthangaraj commented Oct 24, 2023

This commit addresses a problem related to the watchdogd service. When trying to
activate a generic monitoring script (in this example it's /usr/sbin/my-script.sh) with a runtime exceeding one second,
it triggers an unintended system reboot.

Our configuration is as follows:

generic {
enabled = true
interval = 60
timeout = 20
warning = 1
critical = 10
monitor-script = "/usr/sbin/my-script.sh"
}

The error message we're encountering reads as below (even though my-script returns 0):

sv-isup my-service
watchdogd[661]: Monitor script PID 1056 still running after 20 sec
exit 0
watchdog: watchdog0: watchdog did not stop!

Upon further investigation, it was determined that the problem arises
from the fact that 'gs->script_runtime' is measured in milliseconds,
while 'gs->script_runtime_max' is maintained in seconds, as indicated by the source code here: link to source code. This commit rectifies the issue.

Unit test results

With the fix in this PR there is no failure in watchdog service and it works as expected, please see below the traces.

sv-isup my-service
exit 0

@troglobit
Copy link
Owner

Please update the commit message to state what it does, and more importantly why. Also mention how you have tested it to verify that it fixes the problem. Don't assume access to the GitHub issue tracker for this.

This commit addresses a problem related to the watchdogd service. When trying to
activate a generic monitoring script with a runtime exceeding one second,
it triggers an unintended system reboot.

The error message reports:

"Monitor script PID <XXXX> is still running after <YY> seconds.
watchdog: watchdog0: watchdog did not stop!"

After thorough investigation, it was determined that the problem arises
from the fact that 'gs->script_runtime' is measured in milliseconds,
while 'gs->script_runtime_max' is maintained in seconds. This commit rectifies the issue.

Signed-off-by: [email protected]
@senthilnthangaraj senthilnthangaraj changed the title Implements the fix proposed in #39 Address watchdogd service problem with generic script > 1s runtime #39 Oct 26, 2023
@senthilnthangaraj
Copy link
Contributor Author

Please update the commit message to state what it does, and more importantly why. Also mention how you have tested it to verify that it fixes the problem. Don't assume access to the GitHub issue tracker for this.

@troglobit I have updated the commit message and added the UT results to the PR. Let me know if this looks good, thanks.

@troglobit troglobit merged commit 33e32f8 into troglobit:master Oct 29, 2023
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants