Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve process shutdown handling #462

Merged
merged 22 commits into from
Jan 12, 2023
Merged

Improve process shutdown handling #462

merged 22 commits into from
Jan 12, 2023

Conversation

mpfz0r
Copy link
Contributor

@mpfz0r mpfz0r commented Jan 3, 2023

There are scenarios in which the sidecar handles stopping / restarting it's forked collectors poorly:

  1. The collector is a wrapper script that forks additional processes
  2. The wrapper script runs the collector with elevated privileges (sudo)

This PR tries to improve handling this as much as possible.

Instead of killing just the forked PID, it also resorts to killing the entire process group.
This should cover case 1

For case 2 the sidecar can only avoid waiting forever on a process it is not allowed to kill. (lack of permissions)
This is fixed by adjusting the timeout loop and introducing a terminate channel to abort the goroutine that is endlessly wait(2)ing.

The timeout between sending a SIGTERM and SIGKILL can be adjusted with a new configration option:
collector_shutdown_timeout: "10s"

In addition:

  • improve the status of the backands reporting to Graylog
  • The GetRotatedLog() writers were closed immideatly after creating them.
    This had no effect and can be removed.
  • Don't prevent sidecar from shutting down with hanging collectors

If killing just the forked PID is not working,
try to kill the entire process group.

This should help in scenarios where a wrapper script
is used, which doesn't pass the signal to all of its children.
mpfz0r added 4 commits January 3, 2023 11:36
This needed a bigger refactoring to deal with import cycles:

 - Move exec_helpers to daemon package
 - Extract some helper functions into new helper package
Some collectors might be started wit a wrapper script,
which runs the collector with elevated privileges (e.g. using sudo).

Such collectors might not be killed from the sidecar.
The restart func had to timeout loop to ignore that, but
the logic was wrong and it would loop indefininatly once
it passed the 5 second mark.

Furthermore, the existing goroutine that is calling Wait()
on the process might mess up the running state of the restarted
collector. Thus we provide a channel that will abort the goroutine.

Also improve the status of the backands reporting to Graylog.

The GetRotatedLog() writers were closed immideatly after creating them.
This had no effect and can be removed.
@danotorrey
Copy link

Fixes #463

@mpfz0r mpfz0r marked this pull request as ready for review January 4, 2023 15:48
@mpfz0r mpfz0r requested review from thll and bernd January 4, 2023 15:48
@thll thll self-assigned this Jan 5, 2023
And fix early review comments from @thll
Copy link
Contributor

@thll thll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement! Worked as expected in my tests.

I left some comments which are up for discussion.

daemon/exec_runner.go Outdated Show resolved Hide resolved
daemon/exec_runner.go Outdated Show resolved Hide resolved
daemon/exec_helper.go Outdated Show resolved Hide resolved
daemon/exec_helper.go Outdated Show resolved Hide resolved
Copy link
Member

@bernd bernd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make the process and sidecar timeouts configurable to let users adjust them if they have processes that shut down slowly.

mpfz0r and others added 8 commits January 9, 2023 10:21
- Use SIGTERM instead of SIGHUP
- Immediatly signal the process group instead of the process

Co-authored-by: Bernd Ahlers <[email protected]>
so we give the cmd.Wait() a chance to detect the finished process.
It's enough to have one wait loop in stop()
Otherwise we could only restart() hanging processes.
stop() and start() would still fail.
`collector_shutdown_timeout: "10s"`
@mpfz0r
Copy link
Contributor Author

mpfz0r commented Jan 9, 2023

@bernd @thll I think I've addressed all your comments. Thank you 👍

@mpfz0r mpfz0r requested review from thll and bernd January 9, 2023 11:31
Copy link
Contributor

@thll thll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me and works well 👍

mpfz0r and others added 4 commits January 10, 2023 10:56
We don't need a separate timeout here, because
each runner is guaranteed to stop with their individual stop timeouts.
@mpfz0r mpfz0r merged commit 028e2f5 into master Jan 12, 2023
@mpfz0r mpfz0r deleted the fix/restart-loop branch January 12, 2023 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants