-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[healthcheckextensionv2] Introduce health check extension based on component status reporting #30673
Conversation
05ee467
to
75fa8da
Compare
18b4a68
to
c24a335
Compare
8f4ed60
to
fcc2b7c
Compare
Thanks for all this work @mwear!
Can you elaborate on why that's needed? My understanding of the current healthcheck extension is that it doesn't actually work for checking the health of pipelines (see issue: #11780 and the readme warning: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension#health-check). Is the implementation in this PR not already suitable as a replacement in that case? I'm asking to better understanding if the PRs listed are necessary before moving this work forward. |
As you point out, there are currently some issues with the accuracy of the current health check extension. Beyond this, we now have component status reporting as a collector feature, which allows us to report status of individual components. The overall collector health and pipeline health can be derived by looking at the health of the underlying components. All components have a basic level of automated status reporting, but there are nuanced failure and recovery cases that need to be handled on a case by case basis. That is, some components will need to do some level of manual status reporting for their health to be observable. The two PRs I mentioned add manual status reporting to the core exporters based on error conditions they can encounter. They attempt to identify the RecoverableError conditions, when and if they have recovered, and the PermanentError conditions that will require human intervention to fix. Without this information, the health check extension will be able to tell you very little about the health of your exporters. You will be able to know if they are running, or not, at that's all. tl;dr: we need to add status reporting to components that we care about for purposes of collector health. Exporters are the highest priority, then receivers and processors |
.github/CODEOWNERS
Outdated
@@ -93,6 +93,7 @@ extension/encoding/textencodingextension/ @open-telemetry/collect | |||
extension/encoding/zipkinencodingextension/ @open-telemetry/collector-contrib-approvers @MovieStoreGuy @dao-jun | |||
extension/headerssetterextension/ @open-telemetry/collector-contrib-approvers @jpkrohling | |||
extension/healthcheckextension/ @open-telemetry/collector-contrib-approvers @jpkrohling | |||
extension/healthcheckextensionv2/ @open-telemetry/collector-contrib-approvers @mwear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd much rather this replaces the existing extension, than to add a new extension :)
But i appreciate your comment about making it more difficult to review that way
Fair enough, though I would say at this stage, it's more than the current healthcheck extension's capabilities |
I agree. I suppose we can classify those PRs as nice to have in the short term and must have in the longer term. |
@mwear i've tried your healhcheck-v2 code to build my own otel collector (v0.92.0) build failing with following errors:
here is my build-config.yaml
i am using below builder version help.. |
@vynu just updated your comment to make the error messages more readable :) |
thanks @codeboten i've been trying that .. but for some reason i am seeing error while editing .. |
.github/CODEOWNERS
Outdated
|
||
|
||
## UNMAINTAINED components | ||
|
||
receiver/googlecloudspannerreceiver/ @open-telemetry/collector-contrib-approvers | ||
receiver/googlecloudspannerreceiver/ @open-telemetry/collector-contrib-approvers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert those white space changes, this file is generated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. FYI, this PR is just a reference. I've been keeping it up to date as I split off smaller PRs and they are integrated.
… service (#34049) **Description:** <Describe what has changed.> The PR is the fifth in a series to decompose #30673 into more manageable pieces for review. This PR builds on #34028 and completes the gRPC service by adding support for the streaming Watch RPC. For reference, the gRPC service is an implementation of [grpc_health_v1 service](https://github.com/grpc/grpc-proto/blob/master/grpc/health/v1/health.proto). **Link to tracking Issue:** #26661 **Testing:** Units / manual **Documentation:** Comments, etc.
**Description:** <Describe what has changed.> The PR is the sixth and final in a series to decompose #30673 into more manageable pieces for review. This PR adds code to wire up the extension as a status watcher and to manage the HTTP and gRPC services as independent subcomponents. After this merges the extension will be usable for evaluation purposes and we will likely make some refinements based on feedback from early adopters. **Link to tracking Issue:** #26661 **Testing:** Units / manual **Documentation:** Comments, etc.
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was broken down into smaller pieces, all of which have been reviewed and merged. healthcheckv2extension is available for anyone who wants to build their own collector, but not yet available in contrib. There is some refactoring of component status happening in core currently. I will aim to keep this extension working as that work is being done, but there could be some churn in the sort term. Currently the readme says the extension is not ready for use. However, it is ready for experimentation with the caveats I've just discussed. I will update the readme once the refactor in core is finished. I'm going to close this now that everything exists on main. |
Description:
This PR introduces a candidate to replace the current health check extension. This extension is based on component status reporting and provides insight into collector health at the pipeline and component level. It includes an HTTP service and a gRPC service. The HTTP service has a status endpoint that provides the ability to probe the overall collector health and the health of individual pipelines. It also provides a config endpoint to obtain the config of the currently running collector.
The gRPC service is an implementation of the grpc_health_v1 service. While not as detailed as the HTTP service, it provides the ability to check the health of the collector overall and the health of individual pipelines via a unary or streaming RPC.
Looking at the README in this PR is a good place to start to understand the changes.
I have temporarily named this extension healthcheckv2. We can discuss the best way to integrate this work going forward, but my reasoning for using a different name is to facilitate the code review (the diff would be hard to follow if we were to overwrite the current extension). If we decide to move forward with this extension and replace the existing one, we can handle the deletion of the old extension and the renaming in a followup PR.
Ultimately this extension will need some form of one of the following before it will be a suitable replacement:
The more components that adopt status reporting, the more useful this extension will be.
Link to tracking Issue: #26661
Testing: Units, Manual
Documentation: README