Eliminate Host.ReportFatalError(), replace by Component Health Reporting #6344

tigrannajaryan · 2022-10-18T19:39:22Z

The Problem

Currently many components use Host.ReportFatalError() to indicate problems during the startup.

ReportFatalError() is called asynchronously after the component's Start() function returns.

Unfortunately this creates a problem for anyone who wants to know whether the Collector has started successfully or no. It is currently simply impossible to know. You may have a Collector that starts all the pipelines, you see nice output in the log with zero error messages and assume all is good. However, arbitrary time after that the Collector can fail with a fatal error.

There is no point in time when you can be sure that you have a running Collector that is not going to crash the next second due some component calling ReportFatalError() when it pleases so.

This makes the following capability difficult or impossible: to know if the configuration that the Collector is using a good one. This is necessary for a notion of "last known good config" that we want to use when the Collector is reconfigured by config.Provider watchers.

Proposal

I suggest to get rid of Host.ReportFatalError() altogether.

The Start() function must block until it is certain that the component is up and running.

I looked at our usage of Host.ReportFatalError(). Vast majority of the calls are from the failures to start an HTTP Server. This is totally unnecessary. The errors to start an HTTP Server can be reported synchronously from Start().

Of course Start() function should not block the startup indefinitely. However, problems that can happen at an unknown time after Start() is invoked are not startup problems. They are health problems. Such problems should not result in failing the entire Collector. They should result in the component reporting that it is unhealthly.

My proposal is the following:

Deprecate Host.ReportFatalError()
In all components which call Host.ReportFatalError() to report HTTP Server start failure replace it by proper synchronous return of the error from Start(). This is vast majority of uses.
In components which call Host.ReportFatalError() for other reasons carefully analyze the usage. If it is clearly a startup failure that can happen within known limited time (e.g. 10 seconds) then make sure it is blocking the Start() and return an error from Start(). Otherwise report it as an error in the log and indicate component's bad health (using the proposed component health reporting capability).
Change Host.ReportFatalError() to log an error instead of terminating the Collector.
After some graceful period remove Host.ReportFatalError().
Optionally: modify the Collector startup to call Start() functions concurrently to avoid one serializing the blocking operations. We still need to honour the startup sequence of pipelines (exporters->processors->receivers).

Related to #6226

tigrannajaryan · 2022-10-18T19:39:49Z

@open-telemetry/collector-approvers @open-telemetry/collector-contrib-approvers please let me know what you think.

tigrannajaryan · 2022-10-18T19:40:46Z

cc @portertech

bogdandrutu · 2022-10-18T21:02:49Z

In all components which call Host.ReportFatalError() to report HTTP Server start failure replace it by proper synchronous return of the error from Start(). This is vast majority of uses.

Does not work like that, "Start" is blocking call, so not sure how and when you decide to "return", welcome to golang world.

Unless you provide a viable solution for this, your proposal from my opinion is not possible to implement.

bogdandrutu · 2022-10-18T21:26:28Z

You should also look into #5304, which you may have to implement and use an Extension for your use-case.

tigrannajaryan · 2022-10-19T19:11:29Z

In all components which call Host.ReportFatalError() to report HTTP Server start failure replace it by proper synchronous return of the error from Start(). This is vast majority of uses.

Does not work like that, "Start" is blocking call, so not sure how and when you decide to "return", welcome to golang world.

Unless you provide a viable solution for this, your proposal from my opinion is not possible to implement.

Yes, I know "Start" is a blocking call. It must block until the server is up, i.e. it can listen on the port. It doesn't need to block after that and any error that happen after that do not need to be fatal errors.

I mean this:

func (c *MyComponent) Start(_ context.Context, host component.Host) error {
  server := http.Server{...}
  ln, err := c.config.TCPAddr.Listen()
  if err != nil {
    return err
  }
  go func() {
    err :=server.Serve(ln)
    if err != nil && !errors.Is(err, http.ErrServerClosed) {
      // Something happened that was not supposed to happen. Listen succeeded so Serve() should work. 
      c.logger.Error(err) // Just log. There is no need to call ReportFatalError() here like we often do
      host.ReportBadHealth(c, err) // Use the new component health reporting feature. 
    }
  }()
}

There are some paths that http.Server.Serve() may still fail (e.g. HTTP/2 config) after Listen() is successful. We may need to check if it matters for our purposes.

bogdandrutu · 2022-10-19T21:07:51Z

So your proposal, is not necessary to remove the "ReportFatalError" but to try to report synchronously as many failures as possible, and we still need a "heath/status" channel.

tigrannajaryan · 2022-10-19T21:33:18Z

So your proposal, is not necessary to remove the "ReportFatalError" but to try to report synchronously as many failures as possible, and we still need a "heath/status" channel.

Yes, exactly. I mean instead of ReportFatalError we should report "bad health". That's it.

tigrannajaryan · 2022-10-19T22:58:07Z

So your proposal, is not necessary to remove the "ReportFatalError" but to try to report synchronously as many failures as possible, and we still need a "heath/status" channel.

Yes, exactly. I mean instead of ReportFatalError we should report "bad health". That's it.

In some components, it may be useful to refactor the code to move certain parts that currently happen asynchronously after Start() returns to happen instead within Start() if we believe they are a critically necessary condition for starting the Collector. I don't know if such code exist, just speculating at the moment. Again, I believe vast majority of existing ReportFatalError() calls are because of failed Server.Serve(), which is unnecessary.

**Description:** Remove `host.ReportFatalError`. It has been deprecated since 0.87.0. **Link to tracking Issue:** #6344 --------- Co-authored-by: Pablo Baeyens <[email protected]> Co-authored-by: Pablo Baeyens <[email protected]>

host.ReportFatalError() was deprecated and removed. Update Windows named pipe startup code to use ReportStatus() instead. open-telemetry/opentelemetry-collector#6344 open-telemetry/opentelemetry-collector#9506 Signed-off-by: Jeff Hostetler <[email protected]>

tigrannajaryan mentioned this issue Oct 18, 2022

Configuration reloading needs rethinking from error handling perspective #6226

Closed

tigrannajaryan mentioned this issue Oct 24, 2022

New component: OpAMP Configuration Provider. open-telemetry/opentelemetry-collector-contrib#15295

Closed

2 tasks

tigrannajaryan changed the title ~~Eliminate Host.ReportFatalError()~~ Eliminate Host.ReportFatalError(), replace by Component Health Reporting Nov 2, 2022

tigrannajaryan added this to the OpAMP milestone Nov 21, 2022

djaglowski mentioned this issue Apr 8, 2023

New component: Failover Connector open-telemetry/opentelemetry-collector-contrib#20766

Closed

2 tasks

atoulme mentioned this issue Feb 7, 2024

[component] Remove host.ReportFatalError #9506

Merged

bogdandrutu closed this as completed Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate Host.ReportFatalError(), replace by Component Health Reporting #6344

Eliminate Host.ReportFatalError(), replace by Component Health Reporting #6344

tigrannajaryan commented Oct 18, 2022

tigrannajaryan commented Oct 18, 2022

tigrannajaryan commented Oct 18, 2022

bogdandrutu commented Oct 18, 2022 •

edited

Loading

bogdandrutu commented Oct 18, 2022

tigrannajaryan commented Oct 19, 2022

bogdandrutu commented Oct 19, 2022

tigrannajaryan commented Oct 19, 2022

tigrannajaryan commented Oct 19, 2022

Eliminate Host.ReportFatalError(), replace by Component Health Reporting #6344

Eliminate Host.ReportFatalError(), replace by Component Health Reporting #6344

Comments

tigrannajaryan commented Oct 18, 2022

The Problem

Proposal

tigrannajaryan commented Oct 18, 2022

tigrannajaryan commented Oct 18, 2022

bogdandrutu commented Oct 18, 2022 • edited Loading

bogdandrutu commented Oct 18, 2022

tigrannajaryan commented Oct 19, 2022

bogdandrutu commented Oct 19, 2022

tigrannajaryan commented Oct 19, 2022

tigrannajaryan commented Oct 19, 2022

bogdandrutu commented Oct 18, 2022 •

edited

Loading