-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait on the watcher at startup instead of just releasing resources associated with it #4834
Conversation
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
This pull request does not have a backport label. Could you fix it @cmacknz? 🙏
NOTE: |
log.Infow("Starting upgrade watcher", "path", cmd.Path, "args", cmd.Args, "env", cmd.Env, "dir", cmd.Dir) | ||
if err := cmd.Start(); err != nil { | ||
return nil, fmt.Errorf("failed to start Upgrade Watcher: %w", err) | ||
} | ||
|
||
upgradeWatcherPID := cmd.Process.Pid | ||
agentPID := os.Getpid() | ||
|
||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't added a test for this, because there is a test in #4822 that catches this 100% of the time.
Specifically it is this block of fixture_install.go:
f.t.Cleanup(func() {
// check for running agents after uninstall had a chance to run
processes := getElasticAgentProcesses(f.t)
// there can be a single agent left when using --develop mode
if f.installOpts != nil && f.installOpts.Develop {
assert.LessOrEqual(f.t, len(processes), 1, "More than one agent left running at the end of the test when --develop was used: %v", processes)
// The agent left running has to be the non-development agent. The development agent should be uninstalled first as a convention.
if len(processes) > 0 {
assert.NotContains(f.t, processes[0].Cmdline, paths.DevelopmentInstallDirName,
"The agent installed with --develop was left running at the end of the test or was not uninstalled first: %v", processes)
}
return
}
assert.Empty(f.t, processes, "there should be no running agent at the end of the test")
})
The assert.LessOrEqual(f.t, len(processes), 1, ...
will fail without this change, because if there is an agent process running we alway pick up 2. One is the main process, and the other is the zombie created from the watcher.
That test can't be extracted from that PR since it depends on the --develop
option, and I didn't think it made sense to add another separate test, but let me know and I'll come up with something. Probably an integration test that calls elastic-agent restart
and checks to make sure there's only one process running afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change would make the current situation not worse for the watcher launched during the upgrade and avoid a zombie at agent startup if the watch ends before the main elastic-agent process
I still believe that to completely solve the zombie process problem we will end up with some C code (either written by ourselves or implemented by a dependency) to properly detach the watcher or we will have some service manager (like systemd) to do it for us...
Why would the watcher die if the parent dies? https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/upgrade/rollback_linux.go#L41 has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this improves the situation but doesn't solve the re-exec problem.
changelog/fragments/1717185708-Stop-creating-a-zombie-process-on-each-restart.yaml
Outdated
Show resolved
Hide resolved
…on-each-restart.yaml Co-authored-by: Blake Rouse <[email protected]>
|
Today every time the agent starts the watcher it results in a zombie process on Linux. This is because we never wait on it. There are two places where this happens:
elastic-agent/internal/pkg/agent/application/upgrade/upgrade.go
Lines 302 to 308 in 7b81fea
elastic-agent/internal/pkg/agent/cmd/run.go
Lines 243 to 247 in 7b81fea
I don't believe the first case is addressed by this change. This is because the agent process re-execs itself before the
Wait
can complete. I still have to do some experimenting to see exactly what happens in this case.In the second case, 99% of the time the watcher exits quickly because the restart is not after an upgrade, and so simply waiting will prevent the zombie. If we are rolled back we are in a similar situation to case 1, as we may reexec before the wait completes.
This PR made me read the implementation of process.Release() which on Unix makes no system calls at all, it just cleans up resources:
https://cs.opensource.google/go/go/+/refs/tags/go1.22.3:src/os/exec_unix.go;l=85-91
On Windows it makes a system call to close the handle:
https://cs.opensource.google/go/go/+/refs/tags/go1.22.3:src/os/exec_windows.go;l=65-77
The reason
release()
not making a system call is interesting is because this means that the watcher is still child of the agent process, meaning it dies when the agent dies. This cannot be changed without a system call happening on Unix. This means there is an opportunity where the agent is dead and the watcher is also dead, unable to roll back. I think we may need to use something like https://github.com/sevlyar/go-daemon?tab=readme-ov-file#how-it-works to fix this. I suspect our existing test for this case isn't precise enough. The watcher from the agent starting the upgrade would need to be killed, and the agent we upgraded to would have to exit before it launched the watcher. This is not how it works today. I don't think this PR makes anything any worse but it doesn't fix this problem.elastic-agent/testing/integration/upgrade_rollback_test.go
Lines 201 to 231 in 7b81fea
On Windows child processes do not share the lifetime of their parent process unless they are manually assigned to the same Job group, which we manually do when launching component sub-processes but do not do here. This means this should work as intended under all circumstances on Windows.
elastic-agent/pkg/core/process/process.go
Lines 169 to 177 in 7b81fea
I am going to do some manual testing and open an issue if I can confirm there is a window where the watcher isn't running and the agent process can have exited. Fixing that requires more work.