-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve process shutdown handling #462
Conversation
If killing just the forked PID is not working, try to kill the entire process group. This should help in scenarios where a wrapper script is used, which doesn't pass the signal to all of its children.
db2b08e
to
8d90c8b
Compare
This needed a bigger refactoring to deal with import cycles: - Move exec_helpers to daemon package - Extract some helper functions into new helper package
Some collectors might be started wit a wrapper script, which runs the collector with elevated privileges (e.g. using sudo). Such collectors might not be killed from the sidecar. The restart func had to timeout loop to ignore that, but the logic was wrong and it would loop indefininatly once it passed the 5 second mark. Furthermore, the existing goroutine that is calling Wait() on the process might mess up the running state of the restarted collector. Thus we provide a channel that will abort the goroutine. Also improve the status of the backands reporting to Graylog. The GetRotatedLog() writers were closed immideatly after creating them. This had no effect and can be removed.
Fixes #463 |
Timeout after 30 seconds
And fix early review comments from @thll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement! Worked as expected in my tests.
I left some comments which are up for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make the process and sidecar timeouts configurable to let users adjust them if they have processes that shut down slowly.
- Use SIGTERM instead of SIGHUP - Immediatly signal the process group instead of the process Co-authored-by: Bernd Ahlers <[email protected]>
so we give the cmd.Wait() a chance to detect the finished process.
It's enough to have one wait loop in stop()
Otherwise we could only restart() hanging processes. stop() and start() would still fail.
`collector_shutdown_timeout: "10s"`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me and works well 👍
We don't need a separate timeout here, because each runner is guaranteed to stop with their individual stop timeouts.
There are scenarios in which the sidecar handles stopping / restarting it's forked collectors poorly:
This PR tries to improve handling this as much as possible.
Instead of killing just the forked PID, it also resorts to killing the entire process group.
This should cover case
1
For case
2
the sidecar can only avoid waiting forever on a process it is not allowed to kill. (lack of permissions)This is fixed by adjusting the timeout loop and introducing a terminate channel to abort the goroutine that is endlessly wait(2)ing.
The timeout between sending a
SIGTERM
andSIGKILL
can be adjusted with a new configration option:collector_shutdown_timeout: "10s"
In addition:
This had no effect and can be removed.