-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[extension/subprocess] Create subprocess extension (#6467) #6512
[extension/subprocess] Create subprocess extension (#6467) #6512
Conversation
* Created overall structure of the new component **Link to tracking Issue:** Issue open-telemetry#6467
|
A couple quick notes before a more detailed code review. You only tested unix values here. Windows process creation works very diferently. No where do I see anyone tracking SIGHUP / KILL and making sure subprocesses are stopped as well. Since they aren't forked and seems to just be commands that are exec they wont die if the collector does. Where do their logs go? Systemd just simplfies this with journald. How do I debug this? If that process dies and cannot restart should the whole pipeline fail? There are lots of very important questions about bringing processes execution and monitoring into OTel that this PR does not answer. I have a high level question of the need for this for but thats not really as big of deal, clearly you saw something and made it. I think as an architecture we would be better served by a domain specific lang (or just Lua) to provide scripting needs / extensions to pipelines. Also, we can control the execution context there as well. Maybe you can explain to me the problem you are trying to solve so I can understand why just executing a binary brings value. Also, if your goal is to launch and monitor a process, I believe thats outside the scope of OTel and belongs in systemd / init.d /s6 or dozens of other systems that are more robust and better suited. Apologies if I sounded a bit "attacking" I appreciate your will to put in the work and contribute and I could be completely wrong about a lot of this. It's the start of a discussion. cc @kbrockhoff EDIT: After doing some digging it looks like this solves a specific use case you have at Sentry. That is well and good but I do believe that something like this leads our less experienced users towards the path of peril when it comes to availability, decent architecture, and separation of concerns. An example "I need to start a process to update our OAuth creds! oh man, dealing with systemd or init is painful, I can just use the pipeline to execute it there!" In short OTel isn't a process manager, it's not core to what we do. Let's use tools who that is their core job to do it. |
@boostchicken thank you for your feedback, what you've said is very constructive. IMHO, I see things a little differently, I think a mechanism like this in Otel would let a less-experienced user (or not) easily launch the sub process without having to deal with external tools or with the system, moreover, the user will not have to tune the execution of the sub process on different platforms windows, linux and maybe others architectures, since the code of the extension should manage the execution on all the platforms. Of course there are plenty of tools to achieve this, but having this in the pipeline will surely save users a lot of time and it's a good idea to have a more complete solution. From a technical point of view, I intend to launch a process via a command line and then manage its states (Starting, Started, Shutting Down, Errored, Restarting, Stopped) with a state machine. I intend also to let the user decide if the extension should retry the execution of the process after an Errored state, and how many retries the extension attempts to run. The restart delay can also be configurable. We can terminate the process with Process.Signal but if Otel stops abruptly, the control will be lost, it is the same case for any management system unless if it is running in a HA mode and when a node falls the other takes over to continue managing the sub process. Other techniques exist, for example the sub process can detect when System.in is closed, in which case it terminates itself immediately. Let me know your thoughts. |
The benefit of this extension is great: we could bundle inside an OpenTelemetry Collector any monitoring process capable of emitting OTLP signals. Currently, you can build a collector with Go code only. But with this extension, you can build a Collector that spawns a Java process (which itself uses the Otel SDK to emit OTLP traces, metrics and logs) for example. Or spawn Node.js, or Ruby, or Python, or... should I go on? 😅 Instead of having 2 separate processes or services running side-by-side, we can have the OpenTelemetry Collector be the main process, which orchestrates the monitoring pipeline, which starts with an external process (not written in Go) that emits metrics, logs and traces. Also particularly useful when there's a strong coupling between the OpenTelemetry Collector and the external process (one shouldn't run without the other). |
@bertysentry The idea of a polygot is a good one. Like a said a DSL or scripting language that we can include to help people easily is extend is great. I feel that this doesn't achieve that though. Also, the performance implications are signifincant, I just wrote a reciever, and believe me when I saw, just plugging stuff in here can be very dangerous to the overall OTel-Collector. I am all for adding some sort of Polygot system for customization, however not via process forking / management. If we want to make Rust, Js, C#, Java C++ / whatever bindings that is all well treaded territory in Go. Or if we want to design a DSL. But that is something we need to discuss as the whole OTel community, with a proposal, an API, some review, some voting. EDIT: Also within CNCF Lua has been accepted as the tool of choice for these kind of tasks (see Envoy and it's Lua integration) |
Also upon further thinking, what you just descrbied is the OTEL collector, it takes data from one process, translates it, sends it to another and does this man times anad very well. Having another generic subprocess in that list that isn't typed, verified, is dangerous. There are security implications for sure. I would not let that plugin in my binary on prod. |
Putting aside the should this be here, if we do this I think you should rely on other mechanisms to start stop and manage the lifecycle of this these binaries and maybe focus more on the communication between them (named pipes, tcp, unix sockets) and otel . Retry on error sounds good, but what happens if it doesn't error out and just brings a pipeline to halt? How do we identify and view the state of whatever binary is running? ALSO: How do you prevent injection of bad executables if someone gets access to modify the config, how do you prevent this being used from nefarious ways. A lot of what I like about OTEL is the pipelines a tested, and hook together in a defined way and you can just throw code in there |
@boostchicken Indeed it does, but as @NassimBtk just said, it's a use case that is already present in 2 important modules of the OpenTelemetry Collector Contrib: Each module has its own process management implementation and they are not consistent (I fixed an issue recently on prometheusexecreceiver and I suspect I'd need to duplicate the fix on fluentbitextension). With the subprocess extension, we could potentially deprecate both prometheusexecreceiver and fluentbitextension (at least, their process management part).
I fail to see how we could interact with Java code without spawning a JVM, or run JS without spawning Node.js (and I need to read more about LUA, it seems 😅) If you guys don't approve the addition of this extension, we will keep it private in our own distribution of OpenTelemetry Collector. We were just thinking of giving back. 😉 |
Sure, but once again, thats not the core of what OTel is trying to do, we have to use some external binaries there. Fluent is that by its very nature. Also, those do a lot more than just manage processes. Its a small piece of what they do. |
To be clear I am newer around here, I really think I am gonna hold off on further comment until some more seasons OTel contributors chime in. If the greater group feels this is an excellent pattern to go with I of course yield to the group, and honestly I am not trying to even get rid of it. I just believe we should focus on secure, tested, native integrations that we can trust to run in perform in the biggest businesses in the world. |
However, if we do keep it, I would like the ability to disable this plugin as a whole. This would not pass security muster at my place. If security saw a way for someone to add their own code to our telemetry pipeline with a config change they would not allow it. We would need a way 1. Diable it all together, or a way to enforce that only signed binaries can be run. The signer can be configurable ( we control the private keys and only our private keys can sign binaries to be run here) SOmeone could just have it subexec bash and go nuts otherwise. I wanted to be a bit bit more clear on the security concern, lets say this is running on an AWS instance and the instance profile has some powers, someone deploys a config to call the aws binary and bypass their allowed security bounadary. The had no otherway to execute arbitrary code on the OTel box until this was there, or they just query the metadata service with this and leak some credentials. |
@boostchicken No extension (nor receivers, processors or exporters) from opentelemetry-collector-contrib are enabled by default. A collector distribution must be compiled and packaged specifically with the extension before it can be enabled. Then, it must be enabled in the pipeline.
2 other modules already have the ability to run arbitrary processes whose command line is specified in the config YAML file. So this subprocess extension is in no way an increased security threat. |
You are right for fluentbitextension, but prometheusexecreceiver is really just about:
That's why I'm saying that a *subprocess extension could then be leveraged by both fluentbitextension and prometheusexecreceiver, thus removing duplicate code, and reducing the attack surface. But again, we're not going to fight over this. We'll bring this discussion to Slack. |
Some clarifications: the contrib distribution contains everything. The core distribution contains only a few selected components. If you need anything between that, like selecting only a few components you trust to be part of your "golden" binary, you can use the builder. We have had the idea of having multiple official distributions in the future, like a "sidecar" distribution. Another idea that circulated is having "stable" distributions, including all the core+contrib components that are deemed stable. We are missing a definition of stable, though :-)
I believe the JMX receiver would be the third on this list. That said, I wouldn't classify any of them as "rock-solid" components. Both the JMX and the Prometheus one are common sources of flaky tests, and I don't think the problem is on the tests themselves: I believe this is a reflection of how fragile the general approach is. Given my minimal exposure with the Prometheus Exec Receiver, I'll abstain from commenting on its actual usefulness. In general though, as someone who managed systems before, I would certainly prefer to manage external processes via process managers and configure the two parts (Prometheus exporter and otelcol) to exchange data via standard formats, like OpenMetrics. At the same time, I do see the appeal of starting a JVM on behalf of the collector to get data that would otherwise be hard to do in Go, due to all the underlying protocols used only in the Java world. In general, I would say that executing an external binary is the last resort and should be weighed carefully but as we have three components doing that already, I see a benefit in having a common facility for that. cc @rmfitzpatrick, as the owner of the JMX receiver. |
@jpkrohling Thank you for the clarification! @boostchicken This means that you should absolutely avoid the "full" contrib distribution as it includes several modules that do allow spawning a process from the configuration file, and your security admins won't allow that. I'd recommend using the builder in your case. |
Yeah I am absolutely aware of the the ability to build, and also there are multiple "certified" distros (AWS, Splunk) etc. @jpkrohling I think your response was well put. It's not that we should not be executing binaries, it's that it should be the last resort. To the discuss of JMX / Promo / Fluent, thats where something more of a Polygot becomes useful. We want to utilize code or libraries that are not available in Go. I personally think the future would be brighter if we had ways to integrate with libraries / processes outside of the Go ecosystem that are more robust than sub-processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small feedback on the code.
"go.opentelemetry.io/collector/config/configtest" | ||
) | ||
|
||
func TestLoadConfig(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I ask you to add in t.Parallel()
, just to help speed up the number of tests that can be executed concurrently
extension/subprocess/config_test.go
Outdated
|
||
func TestLoadConfig(t *testing.T) { | ||
factories, err := componenttest.NopFactories() | ||
assert.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert.NoError(t, err) | |
require.NoError(t, err) |
You want to use require
here since the test can not progress if this fails.
extension/subprocess/process.go
Outdated
shutdownSignal chan struct{} | ||
} | ||
|
||
func newProcessManager(conf *Config, logger *zap.Logger) *processManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should ideally return the interface for the extension instead of an unexported type from this package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest also adding a compile time check to see if the type matches the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's return the component.Extension
interface.
So I discussed this with one of my other architects at SIE, just to get another opinion and have a check if I am sort of off base here and he brought up something interesting. In an increasingly containerized world, does it ever make sense to be doing any subprocess management? Shouldn't our dependencies be sidecars? If we do subprocess management should we also be writing ways to manage container lifecycles? Obviously outside of containers that doesn't help much but I do believe it goes back to the point that there are a lot of ways already to manage processes, should we be adding to that pool? |
@NassimBtk A couple more techincal thoughts. Does this draw distinctions between long lived processes and short lived ones? Also, applications that might need pre-warming or be shared across multiple pipelines? To define what I mean by short-lived. Is this an execution per call, or execution per collector lifecycle. (Is this a singleton or a factory (: ) Also, we should provide ways to customize the execution environment vars. Not just args. |
There are no certified distributions at the moment, as there's no certification process. They might be supported by the individual companies, though.
Is there something like that already for Go? |
Yes short and long lived processes will be handled. This PR doesn't define the full code but only the structure of the component as recommended in contributing , we intend to add other options to customize the execution of the process in the next PR. |
* The interface `component.Extension` is returned now when the factory invokes `newProcessManager` * Applied suggested fixes to config_test.go
@NassimBtk I think that what @boostchicken is referring to is the ability to execute periodically the specified command. But then we would need to introduce cron-like features, in which case I agree with @boostchicken that it's not the Otel Collector's role to handle that. Users can use The subprocess extension would however support the case where the process is spawned once (on collector startup), does its job and terminates normally and doesn't need to be restarted. The subprocess extension can be configured not to restart the process once it terminated. @boostchicken When it comes to containers, such things are handled when you compose your container. |
To get back to the real use case at hand: @sentrysoftware is building an OpenTelemetry Collector whose role is to collect hardware health metrics for servers, network switches and storage systems using various protocols (SNMP, SSH, WMI, WBEM, IPMI, REST, typically). The main engine code has been written in Java. It used to be a Prometheus Exporter and we combined it with Otel Collector with the below pipeline: prometheusexecreceiver > hardware-sentry (Java process) > prometheusexecreceiver (scrape) > groupbyattrs processor > any exporter (prometheusremotewrite, datadog, etc.) Now, instead of exposing our metrics using the flat Prometheus format, converting it to OpenTelemetry format, and then converting it back to Prometheus, we're simplifying so that our Java code now emits OpenTelemetry metrics using the gRPC protocol (through the OpenTelemetry Java SDK): subprocess extension > hardware-sentry (Java process) > otlpreceiver (gRPC) > any exporter So, the use case is really about spawning a program not written in Go that emits metrics using the OpenTelemetry format and protocol. Note that the subprocess extension's role is not to communicate with the process through stdin/stdout, as we would do in 1990 😉. How else would we be able to have a Java, Ruby, Python or JS program send metrics in OpenTelemetry format and protocol? |
Ok, got it now, the sub process extension is not a task scheduler, it can run an instance of a particular executable that starts and stops, or keeps running in that case it monitors the states of the sub process to restart it if required. |
I agree that subprocess management is a helpful capability that provides a means of linking external collection agents into a custom collector distribution and would be surprised if this pattern isn't repeated several times over as adoption increases. I can also see some of the major security implications of an exposed, top-level component if given the ability to modify config or add Perhaps it would be a viable compromise to implement a "blessed", exported subprocess manager that could be used in all applicable contrib and custom distribution components? This implementation could include a method of hardcoding checksums and the refusal to exec if the target doesn't match one, or any similar security principles to be baked in. |
I share the concerns expressed about this extension and am not sure it is the appropriate method of achieving the stated goal.
If this is the core of the use case and the goal is to have programs independent of the collector that can emit telemetry via OTLP that is received by the collector, why does that require the collector to manage the process lifecycle for those programs? How is it any different from an application running on another host sending OTLP over the wire? There are many systems that deal with process management and will do so far better than the collector ever will since it is not a core competency of the collector. We should allow users the flexibility to choose one that works for them. |
Is the completed code available for review somewhere? I may be in the minority among approvers here, but I find it hard to accept a new component without having an understanding of how it will operate and preferably having seen the complete implementation. |
Consider a program, whose sole output is to send metrics over gRPC to a collector, whose sole purpose is to get these metrics. One shouldn't run without the other, they are tightly coupled. Very much like the current implementation of prometheusexecreceiver spawns the process of the Prometheus process it then scrapes. This extension is exactly like prometheusexecreceiver... but using native OTLP metrics and protocol, rather than Prometheus. Now I don't understand why this proposal raises so much concern while 3 other components already do equivalent things like spawning a process, but we will not force-feed it to the OpenTelemetry Collector and will keep this for ourselves. If at some point you guys change your mind, you'll know where to find us 😉 @NassimBtk I suggest we close this PR for now as well as the corresponding issue (#6467). |
I guess the main point is: while this feature is kept within the individual components, we can always remove them or classify them individually as "unstable". An extension that explicitly adds process management is a bigger commitment, as multiple components are going to depend on it.
While some of us are skeptical, I think all of us would be interested in reading a design doc and perhaps taking a peek at a PoC before committing to accepting this extension. |
The full extension code is available here: |
This. Also, for my part, I wasn't involved in the decision to allow those other components into the repo and I'd have similar concerns if they were presented to me today.
I don't agree. The point of having a well-defined network interface is to decouple things. A program that outputs nothing but metrics over gRPC need not even live on the same host as the collector it sends metrics to in order to be useful. A program that emits metrics over gRPC may be useful in the temporary absence of a collector as it may be able to buffer those metrics to send to a collector once it is available, so I wouldn't say that one shouldn't run without the other. |
Let me rephrase, I was unclear: a software architect could decide that a given gRPC metric emitter must not run unless the receiver is running. It's just an example, I didn't mean it as a rule. |
I think there is value in having an ability to execute external commands and capture their OTLP output in Collector. It is certainly convenient. However, this can be very dangerous since this particular implementation allows executing arbitrary code with arbitrary command line parameters. If we couple this we the upcoming ability to supply configuration remotely then this is a recipe for a malicious actor to perform arbitrary remote code execution by the Collector. Features like this need to be reviewed critically from security perspective and the necessary safeguards added that prevent or mitigate the problem and/or recommend best practices from security perspective. I have posted some of my thoughts on this topic earlier in this proposal: open-telemetry/opentelemetry-collector#2374 It is true that we already have similar capabilities in prometheus exec and fluentibit extension. I am going to submit an issue to review these 2 components and modify/restrict what they do to ensure we do not have glaring gaps in our security. For now, let's pause this PR until we have a clear understanding of how we can add this feature safely. cc @bogdandrutu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am blocking this for now, let's continue the discussion in #6722 and make a general decision about all components that need to execute something.
Recording something we discussed during the SIG call but that doesn't seem to belong to the linked issue: the work on this PR here would be equivalent to implementing a plugin mechanism for the collector. Having a well-established interface in the form of a plugin is certainly a better approach, IMO. Something like hashicorp's go-plugin (via gRPC) would be a valid solution here. |
Thank you for discussing the issue! 😊 Well, the go-plugin mechanism seems just like a mechanism to spawn a subprocess and define interactions with it through gRPC. That's kinda exactly what we're doing here:
Our proposal limits the interaction with the Collector to OTLP-over-gRPC and doesn't require Go, which is an important point here (as there are multiple implementations of OTLP over gRPC: in Java, JS, Ruby, Python, etc.) So, I totally agree with the analogy, but the only thing that hashicorp's go-plugin mechanism will help us with, is the process management. We could re-use their work, but we already have a fully working implementation of process management (actually, we even have several of them, in 3 components 😉). |
Neither does Hashicorp's Go gRPC plugin. We recently added support for remote plugins for Jaeger. And this is from the go-plugin readme:
I might be the minority here, but I'd rather use something which exists already and is well-maintained by another community than rewrite it here.
That's a great point. I would be in favor of converting one of those to be the first "plugin" following this proposal. |
Oh, I missed that part! So the question that remains is only whether integrating this plugin mechanism requires less or more work than this PR, and whether it solves the security concerns that have been raised. We'll let you guys ponder and decide! |
I have been doing some internal prototyping around a plugin receiver around hashicorp/go-plugin, and I think it's got an advantage in that I think we need a bit more of a protocol around external processes. While I think that launching the process is similar in both cases, we are going to need to be able to do some things like health checks/heartbeats and graceful shutdown. It's not that we can't do that with an arbitrary process, but go-plugin gives a clearer framework as a plugin is essentially a grpc server implementation. For the plugin case I've been investigating it's also easier to perform some other tasks like configuration validation too. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
The guidelines are necessary to explain how we want to generally approach external process execution. This was recently brought up in open-telemetry/opentelemetry-collector-contrib#6512 If we accept these guidelines the following should happen: - The prometheusexecreceiver should be modified to allow only a hard-coded list of extensions. - The fluentbitextension should be either removed or significantly limited in terms of what locations and what executable file names it can allow. - open-telemetry/opentelemetry-collector-contrib#6512 will be rejected, possibly substituted by a plugin system that @zenmoto referred to in open-telemetry/opentelemetry-collector-contrib#6512 (comment) if we find useful to have such plugin system.
The guidelines are necessary to explain how we want to generally approach external process execution. This was recently brought up in open-telemetry/opentelemetry-collector-contrib#6512 If we accept these guidelines the following should happen: - The prometheusexecreceiver should be modified to allow only a hard-coded list of exporters. - The fluentbitextension should be either removed or significantly limited in terms of what locations and what executable file names it can allow. - open-telemetry/opentelemetry-collector-contrib#6512 will be rejected, possibly substituted by a plugin system that @zenmoto referred to in open-telemetry/opentelemetry-collector-contrib#6512 (comment) if we find useful to have such plugin system.
The guidelines are necessary to explain how we want to generally approach external process execution. This was recently brought up in open-telemetry/opentelemetry-collector-contrib#6512 If we accept these guidelines the following should happen: - The prometheusexecreceiver should be modified to allow only a hard-coded list of exporters. - The fluentbitextension should be either removed or significantly limited in terms of what locations and what executable file names it can allow. - open-telemetry/opentelemetry-collector-contrib#6512 will be rejected, possibly substituted by a plugin system that @zenmoto referred to in open-telemetry/opentelemetry-collector-contrib#6512 (comment) if we find useful to have such plugin system.
The guidelines are necessary to explain how we want to generally approach external process execution. This was recently brought up in open-telemetry/opentelemetry-collector-contrib#6512 If we accept these guidelines the following should happen: - The prometheusexecreceiver should be modified to allow only a hard-coded list of exporters. - The fluentbitextension should be either removed or significantly limited in terms of what locations and what executable file names it can allow. - open-telemetry/opentelemetry-collector-contrib#6512 will be rejected, possibly substituted by a plugin system that @zenmoto referred to in open-telemetry/opentelemetry-collector-contrib#6512 (comment) if we find useful to have such plugin system.
Signed-off-by: Bogdan <[email protected]> Signed-off-by: Bogdan <[email protected]>
Description:
Adding a feature - New extension (subprocess)
Creating the component structure including readme, configuration, and factory implementation
Link to tracking Issue: Issue #6467
Testing:
Tested config.go and factory.go
Documentation:
The goal is to add a mechanism in the OpenTelemetry Collector to launch and monitor a sub process.
The sub process is configured with an executable path, the arguments and a working directory.
For more information, see issue #6467