Improve process shutdown handling #462

mpfz0r · 2023-01-03T10:29:34Z

There are scenarios in which the sidecar handles stopping / restarting it's forked collectors poorly:

The collector is a wrapper script that forks additional processes
The wrapper script runs the collector with elevated privileges (sudo)

This PR tries to improve handling this as much as possible.

Instead of killing just the forked PID, it also resorts to killing the entire process group.
This should cover case 1

For case 2 the sidecar can only avoid waiting forever on a process it is not allowed to kill. (lack of permissions)
This is fixed by adjusting the timeout loop and introducing a terminate channel to abort the goroutine that is endlessly wait(2)ing.

The timeout between sending a SIGTERM and SIGKILL can be adjusted with a new configration option:
collector_shutdown_timeout: "10s"

In addition:

improve the status of the backands reporting to Graylog
The GetRotatedLog() writers were closed immideatly after creating them.
This had no effect and can be removed.
Don't prevent sidecar from shutting down with hanging collectors

If killing just the forked PID is not working, try to kill the entire process group. This should help in scenarios where a wrapper script is used, which doesn't pass the signal to all of its children.

This needed a bigger refactoring to deal with import cycles: - Move exec_helpers to daemon package - Extract some helper functions into new helper package

Some collectors might be started wit a wrapper script, which runs the collector with elevated privileges (e.g. using sudo). Such collectors might not be killed from the sidecar. The restart func had to timeout loop to ignore that, but the logic was wrong and it would loop indefininatly once it passed the 5 second mark. Furthermore, the existing goroutine that is calling Wait() on the process might mess up the running state of the restarted collector. Thus we provide a channel that will abort the goroutine. Also improve the status of the backands reporting to Graylog. The GetRotatedLog() writers were closed immideatly after creating them. This had no effect and can be removed.

danotorrey · 2023-01-04T15:27:42Z

Fixes #463

Timeout after 30 seconds

@thll

And fix early review comments from @thll

thll

Nice improvement! Worked as expected in my tests.

I left some comments which are up for discussion.

daemon/exec_runner.go

daemon/exec_helper.go

bernd

We should make the process and sidecar timeouts configurable to let users adjust them if they have processes that shut down slowly.

- Use SIGTERM instead of SIGHUP - Immediatly signal the process group instead of the process Co-authored-by: Bernd Ahlers <[email protected]>

so we give the cmd.Wait() a chance to detect the finished process.

It's enough to have one wait loop in stop()

Otherwise we could only restart() hanging processes. stop() and start() would still fail.

`collector_shutdown_timeout: "10s"`

mpfz0r · 2023-01-09T11:31:49Z

@bernd @thll I think I've addressed all your comments. Thank you 👍

thll

Looks good to me and works well 👍

We don't need a separate timeout here, because each runner is guaranteed to stop with their individual stop timeouts.

Improve process shutdown handling

8d90c8b

If killing just the forked PID is not working, try to kill the entire process group. This should help in scenarios where a wrapper script is used, which doesn't pass the signal to all of its children.

mpfz0r force-pushed the fix/restart-loop branch from db2b08e to 8d90c8b Compare January 3, 2023 10:30

mpfz0r added 4 commits January 3, 2023 11:36

remove stale code

9c66d67

Fix build for windows

ecc35a7

This needed a bigger refactoring to deal with import cycles: - Move exec_helpers to daemon package - Extract some helper functions into new helper package

Don't log wait errors as a warning

ef02ced

Don't prevent sidecar from shutting down with hanging collectors

8b00166

Timeout after 30 seconds

mpfz0r marked this pull request as ready for review January 4, 2023 15:48

mpfz0r requested review from thll and bernd January 4, 2023 15:48

thll self-assigned this Jan 5, 2023

Adjust timeouts

0a70658

And fix early review comments from @thll

thll requested changes Jan 6, 2023

View reviewed changes

daemon/exec_runner.go Outdated Show resolved Hide resolved

daemon/exec_runner.go Outdated Show resolved Hide resolved

daemon/exec_helper.go Outdated Show resolved Hide resolved

daemon/exec_helper.go Outdated Show resolved Hide resolved

bernd reviewed Jan 9, 2023

View reviewed changes

mpfz0r and others added 8 commits January 9, 2023 10:21

Refactor KillProcess

e37056b

- Use SIGTERM instead of SIGHUP - Immediatly signal the process group instead of the process Co-authored-by: Bernd Ahlers <[email protected]>

be less verbose

c102f5e

Sleep a tiny bit after we send SIGKILL

049b101

so we give the cmd.Wait() a chance to detect the finished process.

log if we finally failed to stop the process

b38b85b

Remove second wait loop in restart

024980c

It's enough to have one wait loop in stop()

Move cmd.Wait() abortion into stop()

e9fc465

Otherwise we could only restart() hanging processes. stop() and start() would still fail.

Make collector shutdown timeout configurable

4aab929

`collector_shutdown_timeout: "10s"`

Determine backend stopped status in wait goroutine

3bbe855

mpfz0r requested review from thll and bernd January 9, 2023 11:31

read termination ack

be24a0d

thll approved these changes Jan 9, 2023

View reviewed changes

mpfz0r and others added 2 commits January 9, 2023 15:36

Add changelog

59e1cc7

Change log statements from Infof to Warnf

298bd7c

mpfz0r and others added 4 commits January 10, 2023 10:56

Moar changelogs

a8a1c39

add dots

8b8db9e

Improve log message formatting

cb9b378

Remove timout from sidecar shutdown routine

9c588ac

We don't need a separate timeout here, because each runner is guaranteed to stop with their individual stop timeouts.

bernd approved these changes Jan 11, 2023

View reviewed changes

mpfz0r merged commit 028e2f5 into master Jan 12, 2023

mpfz0r deleted the fix/restart-loop branch January 12, 2023 12:51

mpfz0r mentioned this pull request Jan 12, 2023

Sidecar hangs during restart #463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve process shutdown handling #462

Improve process shutdown handling #462

mpfz0r commented Jan 3, 2023 •

edited

Loading

danotorrey commented Jan 4, 2023

thll left a comment

bernd left a comment

mpfz0r commented Jan 9, 2023

thll left a comment

Improve process shutdown handling #462

Improve process shutdown handling #462

Conversation

mpfz0r commented Jan 3, 2023 • edited Loading

danotorrey commented Jan 4, 2023

thll left a comment

Choose a reason for hiding this comment

bernd left a comment

Choose a reason for hiding this comment

mpfz0r commented Jan 9, 2023

thll left a comment

Choose a reason for hiding this comment

mpfz0r commented Jan 3, 2023 •

edited

Loading