Wait on the watcher at startup instead of just releasing resources associated with it #4834

cmacknz · 2024-05-31T20:14:55Z

Relates The elastic agent watch command creates a zombie process each time it restarts #2190

Today every time the agent starts the watcher it results in a zombie process on Linux. This is because we never wait on it. There are two places where this happens:

As part of the upgrade process:

elastic-agent/internal/pkg/agent/application/upgrade/upgrade.go

Lines 302 to 308 in 7b81fea

    
           var watcherCmd *exec.Cmd 
        
           if watcherCmd, err = InvokeWatcher(u.log, watcherExecutable); err != nil { 
        
           	u.log.Errorw("Rolling back: starting watcher failed", "error.message", err) 
        
           	rollbackErr := rollbackInstall(ctx, u.log, paths.Top(), hashedDir, currentVersionedHome) 
        
           	return nil, goerrors.Join(err, rollbackErr) 
        
           }

At startup in case we are starting as a result of an upgrade:

elastic-agent/internal/pkg/agent/cmd/run.go

Lines 243 to 247 in 7b81fea

    
           // initiate agent watcher 
        
           if _, err := upgrade.InvokeWatcher(l, paths.TopBinaryPath()); err != nil { 
        
           	// we should not fail because watcher is not working 
        
           	l.Error(errors.New(err, "failed to invoke rollback watcher")) 
        
           }

I don't believe the first case is addressed by this change. This is because the agent process re-execs itself before the Wait can complete. I still have to do some experimenting to see exactly what happens in this case.

In the second case, 99% of the time the watcher exits quickly because the restart is not after an upgrade, and so simply waiting will prevent the zombie. If we are rolled back we are in a similar situation to case 1, as we may reexec before the wait completes.

This PR made me read the implementation of process.Release() which on Unix makes no system calls at all, it just cleans up resources:

https://cs.opensource.google/go/go/+/refs/tags/go1.22.3:src/os/exec_unix.go;l=85-91

func (p *Process) release() error {
	// NOOP for unix.
	p.Pid = -1
	// no need for a finalizer anymore
	runtime.SetFinalizer(p, nil)
	return nil
}

On Windows it makes a system call to close the handle:

https://cs.opensource.google/go/go/+/refs/tags/go1.22.3:src/os/exec_windows.go;l=65-77

func (p *Process) release() error {
	handle := atomic.SwapUintptr(&p.handle, uintptr(syscall.InvalidHandle))
	if handle == uintptr(syscall.InvalidHandle) {
		return syscall.EINVAL
	}
	e := syscall.CloseHandle(syscall.Handle(handle))
	if e != nil {
		return NewSyscallError("CloseHandle", e)
	}
	// no need for a finalizer anymore
	runtime.SetFinalizer(p, nil)
	return nil
}

The reason release() not making a system call is interesting is because this means that the watcher is still child of the agent process, meaning it dies when the agent dies. This cannot be changed without a system call happening on Unix. This means there is an opportunity where the agent is dead and the watcher is also dead, unable to roll back. I think we may need to use something like https://github.com/sevlyar/go-daemon?tab=readme-ov-file#how-it-works to fix this. I suspect our existing test for this case isn't precise enough. The watcher from the agent starting the upgrade would need to be killed, and the agent we upgraded to would have to exit before it launched the watcher. This is not how it works today. I don't think this PR makes anything any worse but it doesn't fix this problem.

elastic-agent/testing/integration/upgrade_rollback_test.go

Lines 201 to 231 in 7b81fea

    
           // A few seconds after the upgrade, deliberately restart upgraded Agent a 
        
           // couple of times to simulate Agent crashing. 
        
           for restartIdx := 0; restartIdx < 3; restartIdx++ { 
        
           	time.Sleep(10 * time.Second) 
        
           	topPath := paths.Top() 
        
           	t.Logf("Stopping agent via service to simulate crashing") 
        
           	err = install.StopService(topPath) 
        
           	if err != nil && runtime.GOOS == define.Windows && strings.Contains(err.Error(), "The service has not been started.") { 
        
           		// Due to the quick restarts every 10 seconds its possible that this is faster than Windows 
        
           		// can handle. Decrementing restartIdx means that the loop will occur again. 
        
           		t.Logf("Got an allowed error on Windows: %s", err) 
        
           		err = nil 
        
           	} 
        
           	require.NoError(t, err) 
        
           	// ensure that it's stopped before starting it again 
        
           	var status service.Status 
        
           	var statusErr error 
        
           	require.Eventuallyf(t, func() bool { 
        
           		status, statusErr = install.StatusService(topPath) 
        
           		if statusErr != nil { 
        
           			return false 
        
           		} 
        
           		return status != service.StatusRunning 
        
           	}, 2*time.Minute, 1*time.Second, "service never fully stopped (status: %v): %s", status, statusErr) 
        
           	t.Logf("Stopped agent via service to simulate crashing") 
        
           	// start it again 
        
           	t.Logf("Starting agent via service to simulate crashing") 
        
           	err = install.StartService(topPath)

On Windows child processes do not share the lifetime of their parent process unless they are manually assigned to the same Job group, which we manually do when launching component sub-processes but do not do here. This means this should work as intended under all circumstances on Windows.

elastic-agent/pkg/core/process/process.go

Lines 169 to 177 in 7b81fea

    
           // Hook to JobObject on windows, noop on other platforms. 
        
           // This ties the application processes lifespan to the agent's. 
        
           // Fixes the orphaned beats processes left behind situation 
        
           // after the agent process gets killed. 
        
           if err := JobObject.Assign(cmd.Process); err != nil { 
        
           	_ = killCmd(cmd.Process) 
        
           	return nil, fmt.Errorf("failed job assignment %q: %w", path, err) 
        
           }

I am going to do some manual testing and open an issue if I can confirm there is a window where the watcher isn't running and the agent process can have exited. Fixing that requires more work.

elasticmachine · 2024-05-31T20:14:57Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

mergify · 2024-05-31T20:15:37Z

This pull request does not have a backport label. Could you fix it @cmacknz? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

cmacknz · 2024-05-31T20:19:22Z

internal/pkg/agent/application/upgrade/rollback.go

 	log.Infow("Starting upgrade watcher", "path", cmd.Path, "args", cmd.Args, "env", cmd.Env, "dir", cmd.Dir)
 	if err := cmd.Start(); err != nil {
 		return nil, fmt.Errorf("failed to start Upgrade Watcher: %w", err)
 	}

 	upgradeWatcherPID := cmd.Process.Pid
 	agentPID := os.Getpid()
+
+	go func() {


I haven't added a test for this, because there is a test in #4822 that catches this 100% of the time.

Specifically it is this block of fixture_install.go:

f.t.Cleanup(func() { // check for running agents after uninstall had a chance to run processes := getElasticAgentProcesses(f.t) // there can be a single agent left when using --develop mode if f.installOpts != nil && f.installOpts.Develop { assert.LessOrEqual(f.t, len(processes), 1, "More than one agent left running at the end of the test when --develop was used: %v", processes) // The agent left running has to be the non-development agent. The development agent should be uninstalled first as a convention. if len(processes) > 0 { assert.NotContains(f.t, processes[0].Cmdline, paths.DevelopmentInstallDirName, "The agent installed with --develop was left running at the end of the test or was not uninstalled first: %v", processes) } return } assert.Empty(f.t, processes, "there should be no running agent at the end of the test") })

The assert.LessOrEqual(f.t, len(processes), 1, ... will fail without this change, because if there is an agent process running we alway pick up 2. One is the main process, and the other is the zombie created from the watcher.

That test can't be extracted from that PR since it depends on the --develop option, and I didn't think it made sense to add another separate test, but let me know and I'll come up with something. Probably an integration test that calls elastic-agent restart and checks to make sure there's only one process running afterwards.

pchila

This change would make the current situation not worse for the watcher launched during the upgrade and avoid a zombie at agent startup if the watch ends before the main elastic-agent process

I still believe that to completely solve the zombie process problem we will end up with some C code (either written by ourselves or implemented by a dependency) to properly detach the watcher or we will have some service manager (like systemd) to do it for us...

blakerouse · 2024-06-07T12:49:13Z

The reason release() not making a system call is interesting is because this means that the watcher is still child of the agent process, meaning it dies when the agent dies.

Why would the watcher die if the parent dies? https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/upgrade/rollback_linux.go#L41 has Pdeathsig set to SIGINT. That means that when the starting process dies that SIGINT is sent to the child (I believe that is the meaning). The watcher would then exit? If that is the case we should not set that and it would remain. @ycombinator did a lot of adjustments with the watcher and systemd as well to ensure that the watcher stays running. So understanding the exact scenario you are speaking of and ensuring that we fix it, is important.

blakerouse

Agree this improves the situation but doesn't solve the re-exec problem.

changelog/fragments/1717185708-Stop-creating-a-zombie-process-on-each-restart.yaml

…on-each-restart.yaml Co-authored-by: Blake Rouse <[email protected]>

elastic-sonarqube · 2024-06-07T20:25:31Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

Wait on the watcher at startup instead of releasing.

7b81fea

cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 31, 2024

cmacknz self-assigned this May 31, 2024

cmacknz requested a review from a team as a code owner May 31, 2024 20:14

cmacknz requested review from kaanyalti and michel-laterman May 31, 2024 20:14

cmacknz requested review from blakerouse and pchila May 31, 2024 20:15

Add changelog.

30c0422

mergify bot added the backport-skip label May 31, 2024

cmacknz mentioned this pull request May 31, 2024

Allow installing a second agent on the same machine for development #4822

Merged

cmacknz commented May 31, 2024

View reviewed changes

pchila approved these changes Jun 6, 2024

View reviewed changes

blakerouse approved these changes Jun 7, 2024

View reviewed changes

changelog/fragments/1717185708-Stop-creating-a-zombie-process-on-each-restart.yaml Outdated Show resolved Hide resolved

Update changelog/fragments/1717185708-Stop-creating-a-zombie-process-…

52e910d

…on-each-restart.yaml Co-authored-by: Blake Rouse <[email protected]>

cmacknz merged commit e2b816b into elastic:main Jun 7, 2024
11 of 14 checks passed

cmacknz deleted the wait-on-watcher-at-startup branch June 7, 2024 21:35

cmacknz mentioned this pull request Jul 30, 2024

[E2E] Agent recovers gracefully when upgrade is interrupted unexpectedly #5217

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait on the watcher at startup instead of just releasing resources associated with it #4834

Wait on the watcher at startup instead of just releasing resources associated with it #4834

cmacknz commented May 31, 2024 •

edited

Loading

elasticmachine commented May 31, 2024

mergify bot commented May 31, 2024

cmacknz May 31, 2024

pchila left a comment

blakerouse commented Jun 7, 2024

blakerouse left a comment

elastic-sonarqube bot commented Jun 7, 2024

	var watcherCmd *exec.Cmd
	if watcherCmd, err = InvokeWatcher(u.log, watcherExecutable); err != nil {
	u.log.Errorw("Rolling back: starting watcher failed", "error.message", err)
	rollbackErr := rollbackInstall(ctx, u.log, paths.Top(), hashedDir, currentVersionedHome)
	return nil, goerrors.Join(err, rollbackErr)
	}

	// initiate agent watcher
	if _, err := upgrade.InvokeWatcher(l, paths.TopBinaryPath()); err != nil {
	// we should not fail because watcher is not working
	l.Error(errors.New(err, "failed to invoke rollback watcher"))
	}

	// A few seconds after the upgrade, deliberately restart upgraded Agent a
	// couple of times to simulate Agent crashing.
	for restartIdx := 0; restartIdx < 3; restartIdx++ {
	time.Sleep(10 * time.Second)
	topPath := paths.Top()

	t.Logf("Stopping agent via service to simulate crashing")
	err = install.StopService(topPath)
	if err != nil && runtime.GOOS == define.Windows && strings.Contains(err.Error(), "The service has not been started.") {
	// Due to the quick restarts every 10 seconds its possible that this is faster than Windows
	// can handle. Decrementing restartIdx means that the loop will occur again.
	t.Logf("Got an allowed error on Windows: %s", err)
	err = nil
	}
	require.NoError(t, err)

	// ensure that it's stopped before starting it again
	var status service.Status
	var statusErr error
	require.Eventuallyf(t, func() bool {
	status, statusErr = install.StatusService(topPath)
	if statusErr != nil {
	return false
	}
	return status != service.StatusRunning
	}, 2time.Minute, 1time.Second, "service never fully stopped (status: %v): %s", status, statusErr)
	t.Logf("Stopped agent via service to simulate crashing")

	// start it again
	t.Logf("Starting agent via service to simulate crashing")
	err = install.StartService(topPath)

	// Hook to JobObject on windows, noop on other platforms.
	// This ties the application processes lifespan to the agent's.
	// Fixes the orphaned beats processes left behind situation
	// after the agent process gets killed.
	if err := JobObject.Assign(cmd.Process); err != nil {
	_ = killCmd(cmd.Process)
	return nil, fmt.Errorf("failed job assignment %q: %w", path, err)
	}

Wait on the watcher at startup instead of just releasing resources associated with it #4834

Wait on the watcher at startup instead of just releasing resources associated with it #4834

Conversation

cmacknz commented May 31, 2024 • edited Loading

elasticmachine commented May 31, 2024

mergify bot commented May 31, 2024

cmacknz May 31, 2024

Choose a reason for hiding this comment

pchila left a comment

Choose a reason for hiding this comment

blakerouse commented Jun 7, 2024

blakerouse left a comment

Choose a reason for hiding this comment

elastic-sonarqube bot commented Jun 7, 2024

Quality Gate passed

cmacknz commented May 31, 2024 •

edited

Loading