-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate Host.ReportFatalError(), replace by Component Health Reporting #6344
Comments
@open-telemetry/collector-approvers @open-telemetry/collector-contrib-approvers please let me know what you think. |
cc @portertech |
Does not work like that, "Start" is blocking call, so not sure how and when you decide to "return", welcome to golang world. Unless you provide a viable solution for this, your proposal from my opinion is not possible to implement. |
You should also look into #5304, which you may have to implement and use an Extension for your use-case. |
Yes, I know "Start" is a blocking call. It must block until the server is up, i.e. it can listen on the port. It doesn't need to block after that and any error that happen after that do not need to be fatal errors. I mean this: func (c *MyComponent) Start(_ context.Context, host component.Host) error {
server := http.Server{...}
ln, err := c.config.TCPAddr.Listen()
if err != nil {
return err
}
go func() {
err :=server.Serve(ln)
if err != nil && !errors.Is(err, http.ErrServerClosed) {
// Something happened that was not supposed to happen. Listen succeeded so Serve() should work.
c.logger.Error(err) // Just log. There is no need to call ReportFatalError() here like we often do
host.ReportBadHealth(c, err) // Use the new component health reporting feature.
}
}()
} There are some paths that http.Server.Serve() may still fail (e.g. HTTP/2 config) after Listen() is successful. We may need to check if it matters for our purposes. |
So your proposal, is not necessary to remove the "ReportFatalError" but to try to report synchronously as many failures as possible, and we still need a "heath/status" channel. |
Yes, exactly. I mean instead of ReportFatalError we should report "bad health". That's it. |
In some components, it may be useful to refactor the code to move certain parts that currently happen asynchronously after |
**Description:** Remove `host.ReportFatalError`. It has been deprecated since 0.87.0. **Link to tracking Issue:** #6344 --------- Co-authored-by: Pablo Baeyens <[email protected]> Co-authored-by: Pablo Baeyens <[email protected]>
host.ReportFatalError() was deprecated and removed. Update Windows named pipe startup code to use ReportStatus() instead. open-telemetry/opentelemetry-collector#6344 open-telemetry/opentelemetry-collector#9506 Signed-off-by: Jeff Hostetler <[email protected]>
The Problem
Currently many components use Host.ReportFatalError() to indicate problems during the startup.
ReportFatalError() is called asynchronously after the component's Start() function returns.
Unfortunately this creates a problem for anyone who wants to know whether the Collector has started successfully or no. It is currently simply impossible to know. You may have a Collector that starts all the pipelines, you see nice output in the log with zero error messages and assume all is good. However, arbitrary time after that the Collector can fail with a fatal error.
There is no point in time when you can be sure that you have a running Collector that is not going to crash the next second due some component calling ReportFatalError() when it pleases so.
This makes the following capability difficult or impossible: to know if the configuration that the Collector is using a good one. This is necessary for a notion of "last known good config" that we want to use when the Collector is reconfigured by config.Provider watchers.
Proposal
I suggest to get rid of Host.ReportFatalError() altogether.
The Start() function must block until it is certain that the component is up and running.
I looked at our usage of Host.ReportFatalError(). Vast majority of the calls are from the failures to start an HTTP Server. This is totally unnecessary. The errors to start an HTTP Server can be reported synchronously from Start().
Of course Start() function should not block the startup indefinitely. However, problems that can happen at an unknown time after Start() is invoked are not startup problems. They are health problems. Such problems should not result in failing the entire Collector. They should result in the component reporting that it is unhealthly.
My proposal is the following:
Related to #6226
The text was updated successfully, but these errors were encountered: