Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[healthcheckextensionv2] Introduce health check extension based on component status reporting #30673

Closed
wants to merge 28 commits into from

Conversation

mwear
Copy link
Member

@mwear mwear commented Jan 19, 2024

Description:
This PR introduces a candidate to replace the current health check extension. This extension is based on component status reporting and provides insight into collector health at the pipeline and component level. It includes an HTTP service and a gRPC service. The HTTP service has a status endpoint that provides the ability to probe the overall collector health and the health of individual pipelines. It also provides a config endpoint to obtain the config of the currently running collector.

The gRPC service is an implementation of the grpc_health_v1 service. While not as detailed as the HTTP service, it provides the ability to check the health of the collector overall and the health of individual pipelines via a unary or streaming RPC.

Looking at the README in this PR is a good place to start to understand the changes.

I have temporarily named this extension healthcheckv2. We can discuss the best way to integrate this work going forward, but my reasoning for using a different name is to facilitate the code review (the diff would be hard to follow if we were to overwrite the current extension). If we decide to move forward with this extension and replace the existing one, we can handle the deletion of the old extension and the renaming in a followup PR.

Ultimately this extension will need some form of one of the following before it will be a suitable replacement:

The more components that adopt status reporting, the more useful this extension will be.

Link to tracking Issue: #26661

Testing: Units, Manual

Documentation: README

@github-actions github-actions bot added cmd/otelcontribcol otelcontribcol command extension/healthcheck Health Check Extension labels Jan 19, 2024
@github-actions github-actions bot requested a review from jpkrohling January 19, 2024 03:51
@mwear mwear force-pushed the healthcheck-v2 branch 4 times, most recently from 05ee467 to 75fa8da Compare January 19, 2024 04:29
@mwear mwear force-pushed the healthcheck-v2 branch 6 times, most recently from 18b4a68 to c24a335 Compare January 19, 2024 06:31
@mwear mwear marked this pull request as ready for review January 19, 2024 07:14
@mwear mwear requested a review from a team January 19, 2024 07:14
@mwear mwear force-pushed the healthcheck-v2 branch 2 times, most recently from 8f4ed60 to fcc2b7c Compare January 19, 2024 15:51
@codeboten
Copy link
Contributor

Thanks for all this work @mwear!

Ultimately this extension will need some form of one of the following before it will be a suitable replacement:

Can you elaborate on why that's needed? My understanding of the current healthcheck extension is that it doesn't actually work for checking the health of pipelines (see issue: #11780 and the readme warning: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension#health-check). Is the implementation in this PR not already suitable as a replacement in that case?

I'm asking to better understanding if the PRs listed are necessary before moving this work forward.

@mwear
Copy link
Member Author

mwear commented Jan 19, 2024

As you point out, there are currently some issues with the accuracy of the current health check extension. Beyond this, we now have component status reporting as a collector feature, which allows us to report status of individual components. The overall collector health and pipeline health can be derived by looking at the health of the underlying components. All components have a basic level of automated status reporting, but there are nuanced failure and recovery cases that need to be handled on a case by case basis. That is, some components will need to do some level of manual status reporting for their health to be observable. The two PRs I mentioned add manual status reporting to the core exporters based on error conditions they can encounter. They attempt to identify the RecoverableError conditions, when and if they have recovered, and the PermanentError conditions that will require human intervention to fix. Without this information, the health check extension will be able to tell you very little about the health of your exporters. You will be able to know if they are running, or not, at that's all. tl;dr: we need to add status reporting to components that we care about for purposes of collector health. Exporters are the highest priority, then receivers and processors

@@ -93,6 +93,7 @@ extension/encoding/textencodingextension/ @open-telemetry/collect
extension/encoding/zipkinencodingextension/ @open-telemetry/collector-contrib-approvers @MovieStoreGuy @dao-jun
extension/headerssetterextension/ @open-telemetry/collector-contrib-approvers @jpkrohling
extension/healthcheckextension/ @open-telemetry/collector-contrib-approvers @jpkrohling
extension/healthcheckextensionv2/ @open-telemetry/collector-contrib-approvers @mwear
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd much rather this replaces the existing extension, than to add a new extension :)

But i appreciate your comment about making it more difficult to review that way

@codeboten
Copy link
Contributor

You will be able to know if they are running, or not, at that's all.

Fair enough, though I would say at this stage, it's more than the current healthcheck extension's capabilities

@mwear
Copy link
Member Author

mwear commented Jan 19, 2024

You will be able to know if they are running, or not, at that's all.

Fair enough, though I would say at this stage, it's more than the current healthcheck extension's capabilities

I agree. I suppose we can classify those PRs as nice to have in the short term and must have in the longer term.

@vynu
Copy link

vynu commented Jan 24, 2024

@mwear i've tried your healhcheck-v2 code to build my own otel collector (v0.92.0) build failing with following errors:

280.3 Error: failed to compile the OpenTelemetry Collector distribution: go subcommand failed with args '[build -trimpath -o otelcol -ldflags=-s -w]': exit status 1. Output:
280.3 # go.opentelemetry.io/collector/exporter/exporterhelper
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:65:55: undefined: obsmetrics.TagKeyExporter
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:231:28: undefined: obsmetrics.ExporterSentSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:232:30: undefined: obsmetrics.ExporterFailedToSendSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:234:28: undefined: obsmetrics.ExporterSentMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:235:30: undefined: obsmetrics.ExporterFailedToSendMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:237:28: undefined: obsmetrics.ExporterSentLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:238:30: undefined: obsmetrics.ExporterFailedToSendLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:289:30: undefined: obsmetrics.ExporterFailedToEnqueueSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:291:30: undefined: obsmetrics.ExporterFailedToEnqueueMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:293:30: undefined: obsmetrics.ExporterFailedToEnqueueLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:293:30: too many errors
280.3 # go.opentelemetry.io/collector/processor/processorhelper
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:77:58: undefined: obsmetrics.TagKeyProcessor
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:192:32: undefined: obsmetrics.ProcessorAcceptedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:193:31: undefined: obsmetrics.ProcessorRefusedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:194:31: undefined: obsmetrics.ProcessorDroppedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:196:32: undefined: obsmetrics.ProcessorAcceptedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:197:31: undefined: obsmetrics.ProcessorRefusedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:198:31: undefined: obsmetrics.ProcessorDroppedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:200:32: undefined: obsmetrics.ProcessorAcceptedLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:201:31: undefined: obsmetrics.ProcessorRefusedLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:202:31: undefined: obsmetrics.ProcessorDroppedLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:202:31: too many errors
280.3 # go.opentelemetry.io/collector/receiver/receiverhelper
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:76:26: undefined: obsmetrics.TagKeyReceiver
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:77:26: undefined: obsmetrics.TagKeyTransport
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:306:32: undefined: obsmetrics.ReceiverAcceptedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:307:31: undefined: obsmetrics.ReceiverRefusedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:309:32: undefined: obsmetrics.ReceiverAcceptedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:310:31: undefined: obsmetrics.ReceiverRefusedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:312:32: undefined: obsmetrics.ReceiverAcceptedLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:313:31: undefined: obsmetrics.ReceiverRefusedLogRecords
280.3 
------
failed to solve: process "/bin/sh -c builder --config builder-config.yaml" did not complete successfully: exit code: 1

here is my build-config.yaml

dist:
  name: otelcol
  description: Custom OTel Collector distribution
  output_path: ./otelcol-dev
  otelcol_version: 0.92.0

extensions:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/extension/healthcheckextensionv2 v0.92.0

exporters:
  - gomod: go.opentelemetry.io/collector/exporter/otlphttpexporter v0.92.0

  - gomod: go.opentelemetry.io/collector/exporter/otlpexporter v0.92.0

  - gomod: go.opentelemetry.io/collector/exporter/debugexporter v0.92.0

  - gomod: go.opentelemetry.io/collector/exporter/loggingexporter v0.92.0

  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/fileexporter v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/splunkhecexporter v0.92.0

processors:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/attributesprocessor v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/filterprocessor v0.92.0
  - gomod: go.opentelemetry.io/collector/processor/batchprocessor v0.92.0

receivers:
  - gomod: go.opentelemetry.io/collector/receiver/otlpreceiver v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/filelogreceiver v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver v0.92.0

replaces:
  - github.com/open-telemetry/opentelemetry-collector-contrib/extension/healthcheckextensionv2 => ../healthcheckextensionv2

i am using below builder version
go.opentelemetry.io/collector/cmd/[email protected]

help..

@codeboten
Copy link
Contributor

@vynu just updated your comment to make the error messages more readable :)

@vynu
Copy link

vynu commented Jan 24, 2024

thanks @codeboten i've been trying that .. but for some reason i am seeing error while editing ..



## UNMAINTAINED components

receiver/googlecloudspannerreceiver/ @open-telemetry/collector-contrib-approvers
receiver/googlecloudspannerreceiver/ @open-telemetry/collector-contrib-approvers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert those white space changes, this file is generated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. FYI, this PR is just a reference. I've been keeping it up to date as I split off smaller PRs and they are integrated.

.github/CODEOWNERS Outdated Show resolved Hide resolved
.github/CODEOWNERS Outdated Show resolved Hide resolved
.github/CODEOWNERS Outdated Show resolved Hide resolved
jpkrohling pushed a commit that referenced this pull request Jul 16, 2024
… service (#34049)

**Description:** <Describe what has changed.>
The PR is the fifth in a series to decompose #30673 into more manageable
pieces for review. This PR builds on #34028 and completes the gRPC
service by adding support for the streaming Watch RPC. For reference,
the
gRPC service is an implementation of [grpc_health_v1
service](https://github.com/grpc/grpc-proto/blob/master/grpc/health/v1/health.proto).

**Link to tracking Issue:** #26661

**Testing:** Units / manual

**Documentation:** Comments, etc.
jpkrohling pushed a commit that referenced this pull request Jul 18, 2024
**Description:** <Describe what has changed.>
The PR is the sixth and final in a series to decompose #30673 into more
manageable pieces for review. This PR adds code to wire up the extension
as a status watcher and to manage the HTTP and gRPC services as
independent subcomponents. After this merges the extension will be
usable for evaluation purposes and we will likely make some refinements
based on feedback from early adopters.

**Link to tracking Issue:** #26661

**Testing:** Units / manual

**Documentation:** Comments, etc.
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jul 26, 2024
@mwear
Copy link
Member Author

mwear commented Aug 2, 2024

This PR was broken down into smaller pieces, all of which have been reviewed and merged. healthcheckv2extension is available for anyone who wants to build their own collector, but not yet available in contrib. There is some refactoring of component status happening in core currently. I will aim to keep this extension working as that work is being done, but there could be some churn in the sort term. Currently the readme says the extension is not ready for use. However, it is ready for experimentation with the caveats I've just discussed. I will update the readme once the refactor in core is finished. I'm going to close this now that everything exists on main.

@mwear mwear closed this Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.