-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cmd/opampsupervisor] Supervisor may leak collector process #32189
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Sounds like a valid bug to me. I'll defer to others regarding the proper approach here, but I've removed |
This is a good catch. The approach you've outlined here makes sense to me. I don't know enough about Windows to make any suggestions, but at least for Unix-like operating systems, I think having the extension handle this will be the most reliable and portable approach. |
Apparently there is something called job that can help with this on Windows, but I think the general approach with polling parent pid should work on Windows too. |
…been orphaned (#32564) **Description:** <Describe what has changed.> * Allows the process to monitor a passed in ppid, which should be the parent process ID for the collector. When the parent process ID exits, the extension emits a fatal error event, which triggers a collector shutdown. **Link to tracking Issue:** This is part of #32189 - It does not resolve this issue because the supervisor still needs changes to pass the its PID in. **Testing:** Added some unit tests. I've manually tested it on my macbook with this PR: observIQ#4715 (running supervisor, kill -9 the supervisor, and take a look at the agent logs to see it shut down). I've tried testing this out on Windows, but the supervisor doesn't get past bootstrapping (the Commander's Stop function does not work on windows), so I wasn't able to fully test it. **Documentation:** Added new parameter to README
…sion (#32875) **Description:** <Describe what has changed.> * Configures the PPID of the opamp extension in the supervisor. This allows the collector to detect if the supervisor exits and shut itself down. **Link to tracking Issue:** Closes #32189 **Testing:** <Describe what testing was performed and which tests were added.> * Manually tested by starting the supervisor, then kill -9'ing the supervisor. The collector previously would have still been running, but now shuts itself down. Doing this you can also see the following log: ``` 2024-05-06T14:52:31.010-0400 error [email protected]/collector.go:278 Asynchronous error received, terminating process {"error": "collector was orphaned, process with pid 38908 does not exist"} ``` **Documentation:** <Describe the documentation added.> Added `orphan_detection_interval` to the spec as a configurable option --------- Co-authored-by: Tiffany Hrabusa <[email protected]> Co-authored-by: Andrzej Stencel <[email protected]>
Component(s)
cmd/opampsupervisor, extension/opamp
Describe the issue you're reporting
In its current form, the Supervisor may leak a collector process if it is unexpectedly killed and doesn't get a chance to stop the collector. You can see this by issuing a
kill -9
to the supervisor and observing that the collector process is still running. This will block subsequent startups of the supervisor, as multiple collectors will try to occupy the same ports (8888, any user-configured components with a port)Under normal circumstance, we shouldn't leak the collector process, but I think we can make this more robust so that when the supervisor unexpectedly dies, the collector dies too.
I thought through a couple ideas (collector PID file, doing some black magic stuff per-os to have the OS auto-kill the children), but I think they all end up a way more complex than just having the collector monitor it's ppid and exiting if it changes.
In one of our old managed agents, this is how we would handle the cases where the supervisor may shutdown without properly killing the child process.
To fully outline the proposal, it would be something like:
supervisor_pid
supervisor_pid
is configured, the OpAMP extension will poll (maybe every ~5 seconds) and ensure that the value ofos.Getppid()
equals the value ofsupervisor_pid
Note: I think the second step here might need different logic on Windows, as the orphaned process may not be re-parented like it is on other systems.
The text was updated successfully, but these errors were encountered: