-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[poc] status reporting #5158
[poc] status reporting #5158
Conversation
this code was borrowed from the consumererror package
Codecov Report
@@ Coverage Diff @@
## main #5158 +/- ##
===========================================
- Coverage 90.34% 56.75% -33.59%
===========================================
Files 182 200 +18
Lines 11031 23501 +12470
===========================================
+ Hits 9966 13339 +3373
- Misses 840 9107 +8267
- Partials 225 1055 +830
Continue to review full report at Codecov.
|
I like this idea. To add on what was already brought, we had a related discussion a few months ago (in context of better collector status reporting) and this would largely solve that one as well |
component/componenttest/nop_host.go
Outdated
@@ -40,3 +40,7 @@ func (nh *nopHost) GetExtensions() map[config.ComponentID]component.Extension { | |||
func (nh *nopHost) GetExporters() map[config.DataType]map[config.ComponentID]component.Exporter { | |||
return nil | |||
} | |||
|
|||
func (nh *nopHost) RegisterStatusReporter(reporter component.StatusReportFunc) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name StatusReporter
does not sound right to me. This is not a reporter of the status, it is a consumer or receiver of the status, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Reporter/Listener
as a possible alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To recap what we discussed during the SIG call: this here would be a sync reporter, calling the listeners whenever new events are received.
If I understood @bogdandrutu right, he mentioned as well that he'd like to get a function that returns a list of last status for each component. So, if "otlp/1" reported "fail" and then "success", calling a "GetLastStatus() ComponentState" would return only the last status. The ComponentState would probably be a struct with component.Component
and the status from the component.StatusReport
.
component/host.go
Outdated
@@ -29,6 +29,10 @@ type Host interface { | |||
// before Component.Shutdown() begins. | |||
ReportFatalError(err error) | |||
|
|||
ReportStatus(report StatusReport) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an expectation that the Host should react in a specific way to the StatusReport besides propagating to the registered consumers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. At most, it would store the last known state for the components, but I don't see any further action than notifying listeners.
component/host.go
Outdated
@@ -29,6 +29,10 @@ type Host interface { | |||
// before Component.Shutdown() begins. | |||
ReportFatalError(err error) | |||
|
|||
ReportStatus(report StatusReport) | |||
|
|||
RegisterStatusReporter(reporter StatusReportFunc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is there going to be an "unregister" as well?
- What happens during restarts? Are these registrations erased automatically and the components need to re-register?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be an unregister func option, good point.
About restarts: component restarts, or host restarts? Host restarts should indeed clean up the state of the host, including the removal of any listeners there might be. Component restarts should not have any effect here, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great start. During the SIG call, the PipelineWatcher was mentioned, but I'm not sure it's related to this or whether this here should deprecate and eventually replace that.
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
|
||
package componenterror // import "go.opentelemetry.io/collector/component/componenterror" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would those errors here replaced the ones in the consumererror package?
https://github.com/open-telemetry/opentelemetry-collector/tree/main/consumer/consumererror
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good, common place we can put them for use in the component
and consumer
packages? If so, we should try to consolidate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does permanent error mean here? I think the discussion was to have this more or less like an error with a "code/state" rather than wrapping error.
component/componenttest/nop_host.go
Outdated
@@ -40,3 +40,7 @@ func (nh *nopHost) GetExtensions() map[config.ComponentID]component.Extension { | |||
func (nh *nopHost) GetExporters() map[config.DataType]map[config.ComponentID]component.Exporter { | |||
return nil | |||
} | |||
|
|||
func (nh *nopHost) RegisterStatusReporter(reporter component.StatusReportFunc) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Reporter/Listener
as a possible alternative?
component/componenttest/nop_host.go
Outdated
@@ -40,3 +40,7 @@ func (nh *nopHost) GetExtensions() map[config.ComponentID]component.Extension { | |||
func (nh *nopHost) GetExporters() map[config.DataType]map[config.ComponentID]component.Exporter { | |||
return nil | |||
} | |||
|
|||
func (nh *nopHost) RegisterStatusReporter(reporter component.StatusReportFunc) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To recap what we discussed during the SIG call: this here would be a sync reporter, calling the listeners whenever new events are received.
If I understood @bogdandrutu right, he mentioned as well that he'd like to get a function that returns a list of last status for each component. So, if "otlp/1" reported "fail" and then "success", calling a "GetLastStatus() ComponentState" would return only the last status. The ComponentState would probably be a struct with component.Component
and the status from the component.StatusReport
.
component/host.go
Outdated
@@ -29,6 +29,10 @@ type Host interface { | |||
// before Component.Shutdown() begins. | |||
ReportFatalError(err error) | |||
|
|||
ReportStatus(report StatusReport) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. At most, it would store the last known state for the components, but I don't see any further action than notifying listeners.
component/host.go
Outdated
@@ -29,6 +29,10 @@ type Host interface { | |||
// before Component.Shutdown() begins. | |||
ReportFatalError(err error) | |||
|
|||
ReportStatus(report StatusReport) | |||
|
|||
RegisterStatusReporter(reporter StatusReportFunc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be an unregister func option, good point.
About restarts: component restarts, or host restarts? Host restarts should indeed clean up the state of the host, including the removal of any listeners there might be. Component restarts should not have any effect here, I believe.
@@ -29,6 +29,10 @@ type Host interface { | |||
// before Component.Shutdown() begins. | |||
ReportFatalError(err error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR, @bogdandrutu mentioned that this should be deprecated in favor of the new functions.
service/service.go
Outdated
@@ -34,6 +34,7 @@ type service struct { | |||
config *config.Config | |||
telemetry component.TelemetrySettings | |||
zPagesSpanProcessor *zpages.SpanProcessor | |||
statusReporters *component.StatusReporters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this as a single status reporter, reporting the state down to all the registered listeners.
component/status.go
Outdated
|
||
type StatusReportFunc func(status StatusReport) | ||
|
||
type StatusReporters struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of having this as a public API. Perhaps it should be moved to the service package and kept local to it?
The only user for that is the Health check extension, which based on that determines the status of the collector. We should probably change Health to use status from here, and then because no users remove PipelineWatcher @mwear also we mentioned a consolidation between https://github.com/open-telemetry/opentelemetry-collector/blob/main/component/host.go#L30 and ReportStatus. |
I made some changes based on the feedback I've received so far and a handful other improvements as I was working through it. Here's a summary or what changed, let me know if any are headed in the wrong direction.
I'll update the description to reflect the current changes and add a checklist to capture current and future work. |
One general comment: please keep in mind the possibility of 2 kinds of restarts:
The status reporting should not prohibit these scenarios. |
@tigrannajaryan, I wanted to double check that I understand your concerns so that I can make sure to address them. So far, this is mainly a mechanism for components to publish status notifications between each other. The primary use case is for the health check extension to be a subscriber of the notifications, but the design allows for more others. I would expect any component that subscribes to notifcations to unsubscribe in their |
I don't yet see how this will be consolidated with PipelineWatcher, so hard to tell how this is going to work. The Service is currently the authority that decides that the pipelines are ready and signals this readiness to the healthcheck extension via PipelineWatcher interface. It is not clear to me how the new approach will work. So far I only see a mechanism to sending/receiving notifications about the health of individual components, which is fine for just knowing the state of individual components, but it is not a replacement for the existing service healthcheck functionality yet. Once you show the full picture it will be clearer if the design works or no, hard to tell at the moment. |
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
|
||
package componenterror // import "go.opentelemetry.io/collector/component/componenterror" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does permanent error mean here? I think the discussion was to have this more or less like an error with a "code/state" rather than wrapping error.
type HealthNotifications interface { | ||
Subscribe() <-chan (HealthEvent) | ||
Unsubscribe(subscription <-chan (HealthEvent)) | ||
Send(event HealthEvent) | ||
Stop() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions:
- Why is this an interface instead of a struct? Should these be on the host directly?
- I know we discussed about this, but I am not convinced that we need a push systems for this. Majority of the cases are just "tell me the health for foo". The argument I heard that the LB exporter may use this, I think we said that for that case ConsumerError is enough to know (maybe I am wrong, but want to make sure we don't overextend and complicate the API)
Thanks for the feedback and discussion on this POC. I am closing this PR in favor of #5304, which addresses many of the comments and suggestions from this PR, but does so by taking a slightly different approach. I thought a new PR would help facilitate further discussion. |
Description:
I am trying to improve the startup behavior of components so that when the service they monitor is not up, they do not crash the collector. This work has been ongoing in the colletor-contrib repo. See:
A proposal was suggested in a contrib issue: open-telemetry/opentelemetry-collector-contrib#8816 (comment)
This PR is an attempt at that proposal. I added a
HealthNotifications
subsystem with a reference to it on theHost
interface. See below:I also added two new error types
permanent
, andrecoverable
to thecomponenterror
package. These changes allow components, such as the healthcheck extension to subscribe to health notifications, and other components, such as receivers to send health eventws as needed. Below are snippets of how this would look in each case.Subscribe
Unsubscribe
Send health event
Link to tracking Issue:
This comment is the closest thing to a tracking issue at this time: open-telemetry/opentelemetry-collector-contrib#8816 (comment)
Testing:
Documentation:
Docs are currently incomplete. I'm looking for feedback to validate the approach first.
Checklist
In order to make sure we're on the same page as to what work is being done, and what direction this is heading, I made this check list. Right now it is based on PR comments and previous discussions. If anything is missing, or something looks off, let me know. I'll keep it up to date with completed and future tasks.