Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Agent: Child processes management issues, beats uncompleted uninstall, skipped/corrupted install. #30067

Closed
aleksmaus opened this issue Jan 27, 2022 · 4 comments · Fixed by #30388
Assignees
Labels
bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@aleksmaus
Copy link
Contributor

aleksmaus commented Jan 27, 2022

We are seeing some cases in the field with osquerybeat where the install is corrupted on Windows.
https://discuss.elastic.co/t/osquery-manger-integration-wont-work-on-windows/295529/3

The osquerybeat runs a couple of child processes so the whole chain looks like this
agent->osquerybeat->osqueryd->osquery-extension

On windows it looks like when the osquerybeat deleted/uninstalled the process could have been killed by the agent, leaving osqueryd.exe orphaned running, so the install directory can not be deleted especially on windows since the file is in use.
When the next time the agent is to install osquerybeat it skips the install step because the osquerybeat install directory is already there. Osquerybeat install ends up being corrupted and osquerybeat.exe can't be started because it doesn't exists on the disk.

The Osquerybeat implementation on windows uses the following approach to kill the whole process tree if needed:

exec.Command("taskkill", "/F", "/T", "/PID", fmt.Sprint(cmd.Process.Pid)).Run()

Maybe agent should do something similar, which would help the cases where the agent just kills the intermediate child?

It seems there are a couple of things that could be done to improve the situation:

  1. Better tracking of child processes and cleaner process tree kill.
  2. Maybe, some install state metadata on the disk that would allow to properly reinstall the product even in the cases where the install directory was not properly deleted cleaned.
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 27, 2022
@aleksmaus aleksmaus changed the title Elastic Agent: Child processes management issues Elastic Agent: Child processes management issues, beats uncompleted uninstall, skipped/corrupted install. Jan 27, 2022
@aleksmaus aleksmaus added the Team:Elastic-Agent Label for the Agent team label Jan 27, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 27, 2022
@cmacknz cmacknz added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team and removed Team:Elastic-Agent Label for the Agent team labels Jan 27, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@jlind23 jlind23 added the bug label Jan 28, 2022
@blakerouse
Copy link
Contributor

@aleksmaus This is closed by the Windows Job work correct?

@aleksmaus
Copy link
Contributor Author

@aleksmaus This is closed by the Windows Job work correct?

the child processes kills on windows should be handled now with this merged
#30254

there is one more improvement we could do as mentioned above: the beats install code to detect corrupt installs.
I added that to my TODO list: to learn how it's done and see what we can do to improve that.

Before doing that though I'm looking at the slow agent shutdown issue, that could potentially help with the cases where system just kills the service process after timeout.

so we can close this one and open another tracker for install improvement, or keep it and will start looking at install code as soon as I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants