This is an experimental extension that is intended to replace the existing
health check extension. As the stability level is currently development, users
wishing to experiment with this extension will have to build a custom collector
binary using the OpenTelemetry Collector Builder.
Health check extension V2 has new functionality that can be opted-in to, and
also supports original healthcheck extension functionality with the exception
of the check_collector_pipeline
feature. See the warning below.
⚠️ ⚠️ ⚠️ Warning⚠️ ⚠️ ⚠️ The
check_collector_pipeline
feature of this extension was not working as expected and has been removed. The config remains for backwards compatibility, but it too will be removed in the future. Users wishing to monitor pipeline health should use the v2 functionality described below and opt-in to component health as described in component health configuration.
Status | |
---|---|
Stability | development |
Distributions | [] |
Issues | |
Code Owners | @jpkrohling, @mwear |
Health Check Extension V1 enables an HTTP url that can be probed to check the status of the OpenTelemetry Collector. This extension can be used as a liveness and/or readiness probe on Kubernetes.
The following settings are required:
endpoint
(default = localhost:13133): Address to publish the health check status. For full list ofServerConfig
refer here. See our security best practices doc to understand how to set the endpoint in different environments.path
(default = "/"): Specifies the path to be configured for the health check server.response_body
(default = ""): Specifies a static body that overrides the default response returned by the health check service.check_collector_pipeline:
(deprecated and ignored): Settings of collector pipeline health checkenabled
(default = false): Whether enable collector pipeline check or notinterval
(default = "5m"): Time interval to check the number of failuresexporter_failure_threshold
(default = 5): The failure number threshold to mark containers as healthy.
Example:
extensions:
health_check:
health_check/1:
endpoint: "localhost:13"
tls:
ca_file: "/path/to/ca.crt"
cert_file: "/path/to/cert.crt"
key_file: "/path/to/key.key"
path: "/health/status"
Health Check Extension - V2 provides HTTP and gRPC healthcheck services. The services can be used
separately or together depending on your needs. The source of health for both services is component
status reporting, a collector feature, that allows individual components to report their health via
StatusEvent
s. The health check extension aggregates the component StatusEvent
s into overall
collector health and pipeline health and exposes this data through its services.
Below is a table enumerating component statuses and their meanings. These will be mapped to appropriate status codes for the protocol.
Status | Meaning |
---|---|
Starting | The component is starting. |
OK | The component is running without issue. |
RecoverableError | The component has experienced a transient error and may recover. |
PermanentError | The component has detected a condition at runtime that will need human intervention to fix. The collector will continue to run in a degraded mode. |
FatalError | The collector has experienced a fatal runtime error and will shutdown. |
Stopping | The component is in the process of shutting down. |
Stopped | The component has completed shutdown. |
Note: Adoption of status reporting by collector components is still a work in progress. The accuracy of this extension will improve as more components participate.
Below is sample configuration for both the HTTP and gRPC services with component health opt-in.
Note, the use_v2: true
setting is necessary during the interim while V1 functionality is
incrementally phased out.
extensions:
healthcheckv2:
use_v2: true
component_health:
include_permanent_errors: false
include_recoverable_errors: true
recovery_duration: 5m
http:
endpoint: "localhost:13133"
status:
enabled: true
path: "/health/status"
config:
enabled: true
path: "/health/config"
grpc:
endpoint: "localhost:13132"
transport: "tcp"
By default the Health Check Extension will not consider component error statuses as unhealthy. That is, an error status will not be reflected in the response code of the health check, but it will be available in the response body regardless of configuration. This behavior can be changed by opting in to include recoverable and / or permanent errors.
To opt-in to permanent errors set include_permanent_errors: true
. When true, a permanent error
will result in a non-ok return status. By definition, this is a permanent state, and one that will
require human intervention to fix. The collector is running, albeit in a degraded state, and
restarting is unlikely to fix the problem. Thus, caution should be used when enabling this setting
while using the extension as a liveness or readiness probe in k8s.
To opt-in recoverable errors set include_recoverable_errors: true
. This setting works in tandem
with the recovery_duration
option. When true, the Health Check Extension will consider a
recoverable error to be healthy until the recovery duration elapses, and unhealthy afterwards.
During the recovery duration an ok status will be returned. If the collector does not recover in
that time, a non-ok status will be returned. If the collector subsequently recovers, it will resume
reporting an ok status.
The HTTP service provides a status endpoint that can be probed for overall collector status and
per-pipeline status. The endpoint is located at /status
by default, but can be configured using
the http.status.path
setting. Requests to /status
will return the overall collector status. To
probe pipeline status, pass the pipeline name as a query parameter, e.g. /status?pipeline=traces
.
The HTTP status code returned maps to the overall collector or pipeline status, with the mapping
described below.
Component statuses are aggregated into overall collector status and overall pipeline status. In each case, you can consider the aggregated status to be the sum of its parts. The mapping from component status to HTTP status is as follows:
Status | HTTP Status Code |
---|---|
Starting | 503 - Service Unavailable |
OK | 200 - OK |
RecoverableError | 200 - OK1 |
PermanentError | 200 - OK2 |
FatalError | 500 - Internal Server Error |
Stopping | 503 - Service Unavailable |
Stopped | 503 - Service Unavailable |
- If
include_recoverable_errors: true
: 200 when elapsed time < recovery duration; 500 otherwise - If
include_permanent_errors: true
: 500 - Internal Server Error
The response body contains either a detailed, or non-detailed view into collector or pipeline health
in JSON format. The level of detail applies to the contents of the response body and is controlled
by passing verbose
as a query parameter.
The response body contains either a partial or complete aggregate status in JSON format. The
aggregation process functions similar to a priority queue, where the most relevant status bubbles
to the top. By default, FatalError > PermanentError > RecoverableError, however, the priority of
RecoverableError and PermanentError will be reversed if include_permanent_errors
is false
and
include_recoverable_errors
is true
as this configuration makes RecoverableErrors more
relevant.
The detailed response body for collector health will include the overall status for the collector, the overall status for each pipeline in the collector, and the statuses for the individual components in each pipeline. The non-detailed response will only contain the overall collector health.
Verbose Example
Assuming the health check extension is configured with http.status.endpoint
set to
localhost:13133
a request to http://localhost:13133/status?verbose
will have a
response body such as:
{
"start_time": "2024-01-18T17:27:12.570394-08:00",
"healthy": true,
"status": "StatusRecoverableError",
"error": "rpc error: code = ResourceExhausted desc = resource exhausted",
"status_time": "2024-01-18T17:27:32.572301-08:00",
"components": {
"extensions": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.570428-08:00",
"components": {
"extension:healthcheckv2": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.570428-08:00"
}
}
},
"pipeline:metrics/grpc": {
"healthy": true,
"status": "StatusRecoverableError",
"error": "rpc error: code = ResourceExhausted desc = resource exhausted",
"status_time": "2024-01-18T17:27:32.572301-08:00",
"components": {
"exporter:otlp/staging": {
"healthy": true,
"status": "StatusRecoverableError",
"error": "rpc error: code = ResourceExhausted desc = resource exhausted",
"status_time": "2024-01-18T17:27:32.572301-08:00"
},
"processor:batch": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571132-08:00"
},
"receiver:otlp": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571576-08:00"
}
}
},
"pipeline:traces/http": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571625-08:00",
"components": {
"exporter:otlphttp/staging": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571615-08:00"
},
"processor:batch": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571621-08:00"
},
"receiver:otlp": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571625-08:00"
}
}
}
}
}
Note the following based on this response:
- The overall status is
StatusRecoverableError
but the status healthy becauseinclude_recoverable_errors
is set tofalse
or it istrue
and the recovery duration has not yet passed. pipeline:metrics/grpc
has a matching status, as doesexporter:otlp/staging
. This implicates the exporter as the root cause for the pipeline and overall collector status.pipeline:traces/http
is completely healthy.
Non-verbose Response example
If the same request is made to a collector without setting the verbose flag, only the overall status will be returned. The pipeline and component level statuses will be omitted.
{
"start_time": "2024-01-18T17:39:15.87324-08:00",
"healthy": true,
"status": "StatusRecoverableError",
"error": "rpc error: code = ResourceExhausted desc = resource exhausted",
"status_time": "2024-01-18T17:39:35.875024-08:00"
}
The detailed response body for pipeline health is essentially a zoomed in version of the detailed collector response. It contains the overall status for the pipeline and the statuses of the individual components. The non-detailed response body contains only the overall status for the pipeline.
Verbose Response Example
Assuming the health check extension is configured with http.status.endpoint
set to
localhost:13133
a request to http://localhost:13133/status?pipeline=traces/http&verbose
will have
a response body such as:
{
"start_time": "2024-01-18T17:27:12.570394-08:00",
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571625-08:00",
"components": {
"exporter:otlphttp/staging": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571615-08:00"
},
"processor:batch": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571621-08:00"
},
"receiver:otlp": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571625-08:00"
}
}
}
Non-detailed Response Example
If the same request is made without the verbose flag, only the overall pipline status will be returned. The component level statuses will be omitted.
{
"start_time": "2024-01-18T17:39:15.87324-08:00",
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:39:15.874236-08:00"
}
The HTTP service optionally exposes an endpoint that provides the collector configuration. Note,
the configuration returned is unfiltered and may contain sensitive information. As such, the
configuration is disabled by default. Enable it using the http.config.enabled
setting. By
default the path will be /config
, but it can be changed using the http.config.path
setting.
The health check extension provides an implementation of the grpc_health_v1 service. The service
was chosen for compatibility with existing gRPC health checks, however, it does not provide the
additional detail available with the HTTP service. Additionally, the gRPC service has a less
nuanced view of the world with only two reportable statuses: HealthCheckResponse_SERVING
and
HealthCheckResponse_NOT_SERVING
.
The HTTP and gRCP services use the same method of component status aggregation to derive
overall collector health and pipeline health from individual status events. The component
statuses map to the following HealthCheckResponse_ServingStatus
es.
Status | HealthCheckResponse_ServingStatus |
---|---|
Starting | NOT_SERVING |
OK | SERVING |
RecoverableError | SERVING1 |
PermanentError | SERVING2 |
FatalError | NOT_SERVING |
Stopping | NOT_SERVING |
Stopped | NOT_SERVING |
- If
include_recoverable_errors: true
: SERVING when elapsed time < recovery duration; NOT_SERVING otherwise. - If
include_permanent_errors: true
: NOT_SERVING
The gRPC service exposes two RPCs: Check
and Watch
(more about those below). Each takes a
HealthCheckRequest
argument. The HealthCheckRequest
message is defined as:
message HealthCheckRequest {
string service = 1;
}
To query for overall collector health, use the empty string ""
as the service
name. To query for
pipeline health, use the pipeline name as the service
.
The Check
RPC is defined as:
rpc Check(HealthCheckRequest) returns (HealthCheckResponse)
If the service is unknown the RPC will return an error with status NotFound
. Otherwise it will
return a HealthCheckResponse
with the serving status as mapped in the table above.
The Watch
RPC is defined as:
rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse)
The Watch
RPC will initiate a stream for the given service
. If the service is known at the time
the RPC is made, its current status will be sent and changes in status will be sent thereafter. If
the service is unknown, a response with a status of `HealthCheckResponse_SERVICE_UNKNOWN`` will be
sent. The stream will remain open, and if and when the service starts reporting, its status will
begin streaming.
There are plans to provide the ability to export status events as OTLP logs adhering to the event semantic conventions.