Health Check Extension V2

This is an experimental extension that is intended to replace the existing health check extension. As the stability level is currently development, users wishing to experiment with this extension will have to build a custom collector binary using the OpenTelemetry Collector Builder. Health check extension V2 has new functionality that can be opted-in to, and also supports original healthcheck extension functionality with the exception of the check_collector_pipeline feature. See the warning below.

⚠️⚠️⚠️ Warning ⚠️⚠️⚠️

The check_collector_pipeline feature of this extension was not working as expected and has been removed. The config remains for backwards compatibility, but it too will be removed in the future. Users wishing to monitor pipeline health should use the v2 functionality described below and opt-in to component health as described in component health configuration.

Status
Stability	development
Distributions	[]
Issues
Code Owners	@jpkrohling, @mwear

V1

Health Check Extension V1 enables an HTTP url that can be probed to check the status of the OpenTelemetry Collector. This extension can be used as a liveness and/or readiness probe on Kubernetes.

The following settings are required:

endpoint (default = localhost:13133): Address to publish the health check status. For full list of ServerConfig refer here. See our security best practices doc to understand how to set the endpoint in different environments.
path (default = "/"): Specifies the path to be configured for the health check server.
response_body (default = ""): Specifies a static body that overrides the default response returned by the health check service.
check_collector_pipeline: (deprecated and ignored): Settings of collector pipeline health check
- enabled (default = false): Whether enable collector pipeline check or not
- interval (default = "5m"): Time interval to check the number of failures
- exporter_failure_threshold (default = 5): The failure number threshold to mark containers as healthy.

Example:

extensions:
  health_check:
  health_check/1:
    endpoint: "localhost:13"
    tls:
      ca_file: "/path/to/ca.crt"
      cert_file: "/path/to/cert.crt"
      key_file: "/path/to/key.key"
    path: "/health/status"

V2

Health Check Extension - V2 provides HTTP and gRPC healthcheck services. The services can be used separately or together depending on your needs. The source of health for both services is component status reporting, a collector feature, that allows individual components to report their health via StatusEvents. The health check extension aggregates the component StatusEvents into overall collector health and pipeline health and exposes this data through its services.

Below is a table enumerating component statuses and their meanings. These will be mapped to appropriate status codes for the protocol.

Status	Meaning
Starting	The component is starting.
OK	The component is running without issue.
RecoverableError	The component has experienced a transient error and may recover.
PermanentError	The component has detected a condition at runtime that will need human intervention to fix. The collector will continue to run in a degraded mode.
FatalError	The collector has experienced a fatal runtime error and will shutdown.
Stopping	The component is in the process of shutting down.
Stopped	The component has completed shutdown.

Note: Adoption of status reporting by collector components is still a work in progress. The accuracy of this extension will improve as more components participate.

Configuration

Below is sample configuration for both the HTTP and gRPC services with component health opt-in. Note, the use_v2: true setting is necessary during the interim while V1 functionality is incrementally phased out.

extensions:
  healthcheckv2:
    use_v2: true
    component_health:
      include_permanent_errors: false
      include_recoverable_errors: true
      recovery_duration: 5m
    http:
      endpoint: "localhost:13133"
      status:
        enabled: true
        path: "/health/status"
      config:
        enabled: true
        path: "/health/config"
    grpc:
      endpoint: "localhost:13132"
      transport: "tcp"

Component Health Config

By default the Health Check Extension will not consider component error statuses as unhealthy. That is, an error status will not be reflected in the response code of the health check, but it will be available in the response body regardless of configuration. This behavior can be changed by opting in to include recoverable and / or permanent errors.

`include_permanent_errors`

To opt-in to permanent errors set include_permanent_errors: true. When true, a permanent error will result in a non-ok return status. By definition, this is a permanent state, and one that will require human intervention to fix. The collector is running, albeit in a degraded state, and restarting is unlikely to fix the problem. Thus, caution should be used when enabling this setting while using the extension as a liveness or readiness probe in k8s.

`include_recoverable_errors` and `recovery_duration`

To opt-in recoverable errors set include_recoverable_errors: true. This setting works in tandem with the recovery_duration option. When true, the Health Check Extension will consider a recoverable error to be healthy until the recovery duration elapses, and unhealthy afterwards. During the recovery duration an ok status will be returned. If the collector does not recover in that time, a non-ok status will be returned. If the collector subsequently recovers, it will resume reporting an ok status.

HTTP Service

Status Endpoint

The HTTP service provides a status endpoint that can be probed for overall collector status and per-pipeline status. The endpoint is located at /status by default, but can be configured using the http.status.path setting. Requests to /status will return the overall collector status. To probe pipeline status, pass the pipeline name as a query parameter, e.g. /status?pipeline=traces. The HTTP status code returned maps to the overall collector or pipeline status, with the mapping described below.

⚠️ Take care not to expose this endpoint on non-localhost ports as it contains the internal state of the running collector.

Mapping of Component Status to HTTP Status

Component statuses are aggregated into overall collector status and overall pipeline status. In each case, you can consider the aggregated status to be the sum of its parts. The mapping from component status to HTTP status is as follows:

Status	HTTP Status Code
Starting	503 - Service Unavailable
OK	200 - OK
RecoverableError	200 - OK¹
PermanentError	200 - OK²
FatalError	500 - Internal Server Error
Stopping	503 - Service Unavailable
Stopped	503 - Service Unavailable

If include_recoverable_errors: true: 200 when elapsed time < recovery duration; 500 otherwise
If include_permanent_errors: true: 500 - Internal Server Error

Response Body

The response body contains either a detailed, or non-detailed view into collector or pipeline health in JSON format. The level of detail applies to the contents of the response body and is controlled by passing verbose as a query parameter.

Error Precedence

The response body contains either a partial or complete aggregate status in JSON format. The aggregation process functions similar to a priority queue, where the most relevant status bubbles to the top. By default, FatalError > PermanentError > RecoverableError, however, the priority of RecoverableError and PermanentError will be reversed if include_permanent_errors is false and include_recoverable_errors is true as this configuration makes RecoverableErrors more relevant.

Collector Health

The detailed response body for collector health will include the overall status for the collector, the overall status for each pipeline in the collector, and the statuses for the individual components in each pipeline. The non-detailed response will only contain the overall collector health.

Verbose Example

Assuming the health check extension is configured with http.status.endpoint set to localhost:13133 a request to http://localhost:13133/status?verbose will have a response body such as:

{
    "start_time": "2024-01-18T17:27:12.570394-08:00",
    "healthy": true,
    "status": "StatusRecoverableError",
    "error": "rpc error: code = ResourceExhausted desc = resource exhausted",
    "status_time": "2024-01-18T17:27:32.572301-08:00",
    "components": {
        "extensions": {
            "healthy": true,
            "status": "StatusOK",
            "status_time": "2024-01-18T17:27:12.570428-08:00",
            "components": {
                "extension:healthcheckv2": {
                    "healthy": true,
                    "status": "StatusOK",
                    "status_time": "2024-01-18T17:27:12.570428-08:00"
                }
            }
        },
        "pipeline:metrics/grpc": {
            "healthy": true,
            "status": "StatusRecoverableError",
            "error": "rpc error: code = ResourceExhausted desc = resource exhausted",
            "status_time": "2024-01-18T17:27:32.572301-08:00",
            "components": {
                "exporter:otlp/staging": {
                    "healthy": true,
                    "status": "StatusRecoverableError",
                    "error": "rpc error: code = ResourceExhausted desc = resource exhausted",
                    "status_time": "2024-01-18T17:27:32.572301-08:00"
                },
                "processor:batch": {
                    "healthy": true,
                    "status": "StatusOK",
                    "status_time": "2024-01-18T17:27:12.571132-08:00"
                },
                "receiver:otlp": {
                    "healthy": true,
                    "status": "StatusOK",
                    "status_time": "2024-01-18T17:27:12.571576-08:00"
                }
            }
        },
        "pipeline:traces/http": {
            "healthy": true,
            "status": "StatusOK",
            "status_time": "2024-01-18T17:27:12.571625-08:00",
            "components": {
                "exporter:otlphttp/staging": {
                    "healthy": true,
                    "status": "StatusOK",
                    "status_time": "2024-01-18T17:27:12.571615-08:00"
                },
                "processor:batch": {
                    "healthy": true,
                    "status": "StatusOK",
                    "status_time": "2024-01-18T17:27:12.571621-08:00"
                },
                "receiver:otlp": {
                    "healthy": true,
                    "status": "StatusOK",
                    "status_time": "2024-01-18T17:27:12.571625-08:00"
                }
            }
        }
    }
}

Note the following based on this response:

The overall status is StatusRecoverableError but the status healthy because include_recoverable_errors is set to false or it is true and the recovery duration has not yet passed.
pipeline:metrics/grpc has a matching status, as does exporter:otlp/staging. This implicates the exporter as the root cause for the pipeline and overall collector status.
pipeline:traces/http is completely healthy.

Non-verbose Response example

If the same request is made to a collector without setting the verbose flag, only the overall status will be returned. The pipeline and component level statuses will be omitted.

{
    "start_time": "2024-01-18T17:39:15.87324-08:00",
    "healthy": true,
    "status": "StatusRecoverableError",
    "error": "rpc error: code = ResourceExhausted desc = resource exhausted",
    "status_time": "2024-01-18T17:39:35.875024-08:00"
}

Pipeline Health

The detailed response body for pipeline health is essentially a zoomed in version of the detailed collector response. It contains the overall status for the pipeline and the statuses of the individual components. The non-detailed response body contains only the overall status for the pipeline.

Verbose Response Example

Assuming the health check extension is configured with http.status.endpoint set to localhost:13133 a request to http://localhost:13133/status?pipeline=traces/http&verbose will have a response body such as:

{
    "start_time": "2024-01-18T17:27:12.570394-08:00",
    "healthy": true,
    "status": "StatusOK",
    "status_time": "2024-01-18T17:27:12.571625-08:00",
    "components": {
        "exporter:otlphttp/staging": {
            "healthy": true,
            "status": "StatusOK",
            "status_time": "2024-01-18T17:27:12.571615-08:00"
        },
        "processor:batch": {
            "healthy": true,
            "status": "StatusOK",
            "status_time": "2024-01-18T17:27:12.571621-08:00"
        },
        "receiver:otlp": {
            "healthy": true,
            "status": "StatusOK",
            "status_time": "2024-01-18T17:27:12.571625-08:00"
        }
    }
}

Non-detailed Response Example

If the same request is made without the verbose flag, only the overall pipline status will be returned. The component level statuses will be omitted.

{
    "start_time": "2024-01-18T17:39:15.87324-08:00",
    "healthy": true,
    "status": "StatusOK",
    "status_time": "2024-01-18T17:39:15.874236-08:00"
}

Collector Config Endpoint

The HTTP service optionally exposes an endpoint that provides the collector configuration. Note, the configuration returned is unfiltered and may contain sensitive information. As such, the configuration is disabled by default. Enable it using the http.config.enabled setting. By default the path will be /config, but it can be changed using the http.config.path setting.

⚠️ Take care not to expose this endpoint on non-localhost ports as it contains the unobfuscated config of the running collector.

gRPC Service

The health check extension provides an implementation of the grpc_health_v1 service. The service was chosen for compatibility with existing gRPC health checks, however, it does not provide the additional detail available with the HTTP service. Additionally, the gRPC service has a less nuanced view of the world with only two reportable statuses: HealthCheckResponse_SERVING and HealthCheckResponse_NOT_SERVING.

Mapping of ComponentStatus to HealthCheckResponse_ServingStatus

The HTTP and gRCP services use the same method of component status aggregation to derive overall collector health and pipeline health from individual status events. The component statuses map to the following HealthCheckResponse_ServingStatuses.

Status	HealthCheckResponse_ServingStatus
Starting	NOT_SERVING
OK	SERVING
RecoverableError	SERVING¹
PermanentError	SERVING²
FatalError	NOT_SERVING
Stopping	NOT_SERVING
Stopped	NOT_SERVING

If include_recoverable_errors: true: SERVING when elapsed time < recovery duration; NOT_SERVING otherwise.
If include_permanent_errors: true: NOT_SERVING

HealthCheckRequest

The gRPC service exposes two RPCs: Check and Watch (more about those below). Each takes a HealthCheckRequest argument. The HealthCheckRequest message is defined as:

message HealthCheckRequest {
  string service = 1;
}

To query for overall collector health, use the empty string "" as the service name. To query for pipeline health, use the pipeline name as the service.

Check RPC

The Check RPC is defined as:

rpc Check(HealthCheckRequest) returns (HealthCheckResponse)

If the service is unknown the RPC will return an error with status NotFound. Otherwise it will return a HealthCheckResponse with the serving status as mapped in the table above.

Watch Streaming RPC

The Watch RPC is defined as:

rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse)

The Watch RPC will initiate a stream for the given service. If the service is known at the time the RPC is made, its current status will be sent and changes in status will be sent thereafter. If the service is unknown, a response with a status of `HealthCheckResponse_SERVICE_UNKNOWN`` will be sent. The stream will remain open, and if and when the service starts reporting, its status will begin streaming.

Future

There are plans to provide the ability to export status events as OTLP logs adhering to the event semantic conventions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Health Check Extension V2

V1

V2

Configuration

Component Health Config

`include_permanent_errors`

`include_recoverable_errors` and `recovery_duration`

HTTP Service

Status Endpoint

Mapping of Component Status to HTTP Status

Response Body

Error Precedence

Collector Health

Pipeline Health

Collector Config Endpoint

gRPC Service

Mapping of ComponentStatus to HealthCheckResponse_ServingStatus

HealthCheckRequest

Check RPC

Watch Streaming RPC

Future

Files

README.md

Latest commit

History

README.md

File metadata and controls

Health Check Extension V2

V1

V2

Configuration

Component Health Config

include_permanent_errors

include_recoverable_errors and recovery_duration

HTTP Service

Status Endpoint

Mapping of Component Status to HTTP Status

Response Body

Error Precedence

Collector Health

Pipeline Health

Collector Config Endpoint

gRPC Service

Mapping of ComponentStatus to HealthCheckResponse_ServingStatus

HealthCheckRequest

Check RPC

Watch Streaming RPC

Future

`include_permanent_errors`

`include_recoverable_errors` and `recovery_duration`