Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[poc] status reporting #5158

Closed
wants to merge 9 commits into from
Closed

Conversation

mwear
Copy link
Member

@mwear mwear commented Apr 6, 2022

Description:
I am trying to improve the startup behavior of components so that when the service they monitor is not up, they do not crash the collector. This work has been ongoing in the colletor-contrib repo. See:

A proposal was suggested in a contrib issue: open-telemetry/opentelemetry-collector-contrib#8816 (comment)

The proposal is then to change the component.Host to allow components to report its state (and continue reporting it), adding another function allowing components to register for callbacks, to be called whenever there's a report. This way, components such as the load balancing exporter and the health check extension can register for callbacks, whereas scrapers and components like the OTLP exporter can report their connection failures in real time.

This PR is an attempt at that proposal. I added a HealthNotifications subsystem with a reference to it on the Host interface. See below:

type HealthNotifications interface {
      Subscribe() <-chan (HealthEvent)
      Unsubscribe(subscription <-chan (HealthEvent))
      Send(event HealthEvent)
      Stop()
}
type Host interface {
     ...

     HealthNotifications() HealthNotifications
     
     ...
}

I also added two new error types permanent, and recoverable to the componenterror package. These changes allow components, such as the healthcheck extension to subscribe to health notifications, and other components, such as receivers to send health eventws as needed. Below are snippets of how this would look in each case.

Subscribe

sub := host.HealthNotifications().Subscribe()

go func() {
    for {
        event, ok := <-sub
        if !ok {
            return
        }
        switch err := event.Error; {
        case componenterror.IsPermanent(err):
            //handle permanent
        case componenterror.IsRecoverable(err):
            //handle recoverable
        default:
            //handle unknown
        }
    }
}()

Unsubscribe

// pass-in previously obtained subscription
host.HealthNotifications().Unsubscribe(sub)

Send health event

if err := s.connect(); err != nil {
    host.HealthNotifications().Send(component.HealthEvent{
        ComponentID: s.config.ID(),
        Error:       componenterror.NewRecoverable(err),
    })
}

Link to tracking Issue:
This comment is the closest thing to a tracking issue at this time: open-telemetry/opentelemetry-collector-contrib#8816 (comment)

Testing:

  • Unit tests
  • Manual test building collector contrib

Documentation:
Docs are currently incomplete. I'm looking for feedback to validate the approach first.

Checklist

In order to make sure we're on the same page as to what work is being done, and what direction this is heading, I made this check list. Right now it is based on PR comments and previous discussions. If anything is missing, or something looks off, let me know. I'll keep it up to date with completed and future tasks.

  • Improve naming (take 1)
  • Add unsubscribe
  • Make HealthNotifications an interface in the host package; move the implementation to service
  • Deprecate and consolidate ReportFatalError
  • Consolidate duplication of error types in the consumer / component packages
  • Consolidate PipelineWatcher functionality

@codecov
Copy link

codecov bot commented Apr 6, 2022

Codecov Report

Merging #5158 (6113594) into main (71eed9e) will decrease coverage by 33.58%.
The diff coverage is 97.29%.

@@             Coverage Diff             @@
##             main    #5158       +/-   ##
===========================================
- Coverage   90.34%   56.75%   -33.59%     
===========================================
  Files         182      200       +18     
  Lines       11031    23501    +12470     
===========================================
+ Hits         9966    13339     +3373     
- Misses        840     9107     +8267     
- Partials      225     1055      +830     
Impacted Files Coverage Δ
component/componenttest/nop_host.go 81.81% <0.00%> (-18.19%) ⬇️
component/componenterror/permanent.go 100.00% <100.00%> (ø)
component/componenterror/recoverable.go 100.00% <100.00%> (ø)
service/health.go 100.00% <100.00%> (ø)
service/service.go 56.17% <100.00%> (+3.16%) ⬆️
service/config_provider.go 90.22% <0.00%> (-4.12%) ⬇️
service/command.go 80.00% <0.00%> (-2.36%) ⬇️
consumer/consumertest/err.go 100.00% <0.00%> (ø)
consumer/consumertest/nop.go 100.00% <0.00%> (ø)
internal/otlptext/metrics.go 100.00% <0.00%> (ø)
... and 81 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71eed9e...6113594. Read the comment docs.

@mwear mwear force-pushed the status_reporting branch from c9ff557 to 8bdaf50 Compare April 6, 2022 01:35
@pmm-sumo
Copy link
Contributor

pmm-sumo commented Apr 6, 2022

I like this idea. To add on what was already brought, we had a related discussion a few months ago (in context of better collector status reporting) and this would largely solve that one as well

@@ -40,3 +40,7 @@ func (nh *nopHost) GetExtensions() map[config.ComponentID]component.Extension {
func (nh *nopHost) GetExporters() map[config.DataType]map[config.ComponentID]component.Exporter {
return nil
}

func (nh *nopHost) RegisterStatusReporter(reporter component.StatusReportFunc) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name StatusReporter does not sound right to me. This is not a reporter of the status, it is a consumer or receiver of the status, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Reporter/Listener as a possible alternative?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To recap what we discussed during the SIG call: this here would be a sync reporter, calling the listeners whenever new events are received.

If I understood @bogdandrutu right, he mentioned as well that he'd like to get a function that returns a list of last status for each component. So, if "otlp/1" reported "fail" and then "success", calling a "GetLastStatus() ComponentState" would return only the last status. The ComponentState would probably be a struct with component.Component and the status from the component.StatusReport.

@@ -29,6 +29,10 @@ type Host interface {
// before Component.Shutdown() begins.
ReportFatalError(err error)

ReportStatus(report StatusReport)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an expectation that the Host should react in a specific way to the StatusReport besides propagating to the registered consumers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. At most, it would store the last known state for the components, but I don't see any further action than notifying listeners.

@@ -29,6 +29,10 @@ type Host interface {
// before Component.Shutdown() begins.
ReportFatalError(err error)

ReportStatus(report StatusReport)

RegisterStatusReporter(reporter StatusReportFunc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is there going to be an "unregister" as well?
  2. What happens during restarts? Are these registrations erased automatically and the components need to re-register?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an unregister func option, good point.

About restarts: component restarts, or host restarts? Host restarts should indeed clean up the state of the host, including the removal of any listeners there might be. Component restarts should not have any effect here, I believe.

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start. During the SIG call, the PipelineWatcher was mentioned, but I'm not sure it's related to this or whether this here should deprecate and eventually replace that.

// See the License for the specific language governing permissions and
// limitations under the License.

package componenterror // import "go.opentelemetry.io/collector/component/componenterror"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would those errors here replaced the ones in the consumererror package?

https://github.com/open-telemetry/opentelemetry-collector/tree/main/consumer/consumererror

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good, common place we can put them for use in the component and consumer packages? If so, we should try to consolidate them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does permanent error mean here? I think the discussion was to have this more or less like an error with a "code/state" rather than wrapping error.

@@ -40,3 +40,7 @@ func (nh *nopHost) GetExtensions() map[config.ComponentID]component.Extension {
func (nh *nopHost) GetExporters() map[config.DataType]map[config.ComponentID]component.Exporter {
return nil
}

func (nh *nopHost) RegisterStatusReporter(reporter component.StatusReportFunc) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Reporter/Listener as a possible alternative?

@@ -40,3 +40,7 @@ func (nh *nopHost) GetExtensions() map[config.ComponentID]component.Extension {
func (nh *nopHost) GetExporters() map[config.DataType]map[config.ComponentID]component.Exporter {
return nil
}

func (nh *nopHost) RegisterStatusReporter(reporter component.StatusReportFunc) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To recap what we discussed during the SIG call: this here would be a sync reporter, calling the listeners whenever new events are received.

If I understood @bogdandrutu right, he mentioned as well that he'd like to get a function that returns a list of last status for each component. So, if "otlp/1" reported "fail" and then "success", calling a "GetLastStatus() ComponentState" would return only the last status. The ComponentState would probably be a struct with component.Component and the status from the component.StatusReport.

@@ -29,6 +29,10 @@ type Host interface {
// before Component.Shutdown() begins.
ReportFatalError(err error)

ReportStatus(report StatusReport)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. At most, it would store the last known state for the components, but I don't see any further action than notifying listeners.

@@ -29,6 +29,10 @@ type Host interface {
// before Component.Shutdown() begins.
ReportFatalError(err error)

ReportStatus(report StatusReport)

RegisterStatusReporter(reporter StatusReportFunc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an unregister func option, good point.

About restarts: component restarts, or host restarts? Host restarts should indeed clean up the state of the host, including the removal of any listeners there might be. Component restarts should not have any effect here, I believe.

@@ -29,6 +29,10 @@ type Host interface {
// before Component.Shutdown() begins.
ReportFatalError(err error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR, @bogdandrutu mentioned that this should be deprecated in favor of the new functions.

@@ -34,6 +34,7 @@ type service struct {
config *config.Config
telemetry component.TelemetrySettings
zPagesSpanProcessor *zpages.SpanProcessor
statusReporters *component.StatusReporters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as a single status reporter, reporting the state down to all the registered listeners.


type StatusReportFunc func(status StatusReport)

type StatusReporters struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of having this as a public API. Perhaps it should be moved to the service package and kept local to it?

@bogdandrutu
Copy link
Member

This looks like a great start. During the SIG call, the PipelineWatcher was mentioned, but I'm not sure it's related to this or whether this here should deprecate and eventually replace that.

The only user for that is the Health check extension, which based on that determines the status of the collector. We should probably change Health to use status from here, and then because no users remove PipelineWatcher

@mwear also we mentioned a consolidation between https://github.com/open-telemetry/opentelemetry-collector/blob/main/component/host.go#L30 and ReportStatus.

@mwear
Copy link
Member Author

mwear commented Apr 11, 2022

I made some changes based on the feedback I've received so far and a handful other improvements as I was working through it. Here's a summary or what changed, let me know if any are headed in the wrong direction.

  • I renamed most things. After seeing this google doc about heath event reporting, I renamed things to align better with what was there.
    • This subsystem is now called HealthNotifications instead of StatusReporters.
    • The the data that flows between components are HealthEvents instead of StatusReports
    • I could also see calling this StatusNotifications and StatusEvents if that would be preferable. I'm open to other names too.
  • Instead of registering a callback, a caller will subscribe to HealthNotifications and receive a channel that HealthEvents are streamed over
    • I added unsubscribe functionality
    • Right now, a subscriber receives HealthEvents for all components. We can consider whether or not we would like to extend this for per-component subscriptions
  • Instead of having methods on host that delegate to HealthNotifications, HealthNotifications is exposed directly
    • For example, call host.HealthNotifications.Subscribe() or host.HealthNotifications.Send(...) instead of host.SubcribeToHealthNotifications or host.SendHealthEvent(...)
  • Right now a HealthEvent only has a ComponentID and an Error. I assume we'll want to expand HealthEvents with additional fields, if anyone has suggestions, let me know and I will start to incorporate them. I'd imagine we'd like some convenience methods to help construct them if they become more complex

I'll update the description to reflect the current changes and add a checklist to capture current and future work.

@tigrannajaryan
Copy link
Member

One general comment: please keep in mind the possibility of 2 kinds of restarts:

  • Entire service restarts because a new config is available. All pipelines are shutdown and recreated. How does this affect the health status?
  • Service restarts particular because a new config is available and the Service knows which pipelines/components changed and which did not. As a result of this some pipelines may not be touched at all, some pipelines may need to be fully shutdown and recreated (or new pipelines created) and some pipelines may need to be partially shutdown/recreated (e.g. only a receiver config is changed, which means we do not need to restart everything else in that pipeline).

The status reporting should not prohibit these scenarios.

@mwear
Copy link
Member Author

mwear commented Apr 12, 2022

@tigrannajaryan, I wanted to double check that I understand your concerns so that I can make sure to address them. So far, this is mainly a mechanism for components to publish status notifications between each other. The primary use case is for the health check extension to be a subscriber of the notifications, but the design allows for more others. I would expect any component that subscribes to notifcations to unsubscribe in their Shutdown methods. If the entire service is restarted (which I assume is a Shutdown followed by a Start), the HealthNotifications will unsubscribe all components during its Stop method. The restarted components should subscribe during the startup (in the same way they originally subscribed). If it turns out that we need to preserve any state between restarts, we can always move the HealthNotifications up from the service to the collector. We could move it now if it's a better fit there. Are there any immediate concerns with this design or any alternatives I should consider?

@mwear mwear force-pushed the status_reporting branch from 105cd64 to 6113594 Compare April 12, 2022 01:55
@tigrannajaryan
Copy link
Member

Are there any immediate concerns with this design or any alternatives I should consider?

I don't yet see how this will be consolidated with PipelineWatcher, so hard to tell how this is going to work.

The Service is currently the authority that decides that the pipelines are ready and signals this readiness to the healthcheck extension via PipelineWatcher interface.

It is not clear to me how the new approach will work. So far I only see a mechanism to sending/receiving notifications about the health of individual components, which is fine for just knowing the state of individual components, but it is not a replacement for the existing service healthcheck functionality yet. Once you show the full picture it will be clearer if the design works or no, hard to tell at the moment.

// See the License for the specific language governing permissions and
// limitations under the License.

package componenterror // import "go.opentelemetry.io/collector/component/componenterror"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does permanent error mean here? I think the discussion was to have this more or less like an error with a "code/state" rather than wrapping error.

Comment on lines +26 to +31
type HealthNotifications interface {
Subscribe() <-chan (HealthEvent)
Unsubscribe(subscription <-chan (HealthEvent))
Send(event HealthEvent)
Stop()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions:

  1. Why is this an interface instead of a struct? Should these be on the host directly?
  2. I know we discussed about this, but I am not convinced that we need a push systems for this. Majority of the cases are just "tell me the health for foo". The argument I heard that the LB exporter may use this, I think we said that for that case ConsumerError is enough to know (maybe I am wrong, but want to make sure we don't overextend and complicate the API)

@mwear mwear mentioned this pull request May 3, 2022
@mwear
Copy link
Member Author

mwear commented May 3, 2022

Thanks for the feedback and discussion on this POC. I am closing this PR in favor of #5304, which addresses many of the comments and suggestions from this PR, but does so by taking a slightly different approach. I thought a new PR would help facilitate further discussion.

@mwear mwear closed this May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants