This repository contains the Strimzi canary tool implementation. It acts as an indicator of whether Kafka clusters are operating correctly. This is achieved by creating a canary topic and periodically producing and consuming events on the topic and getting metrics out of these exchanges.
When running the Strimzi canary tool, it is possible to configure different aspects by using the following environment variables.
KAFKA_BOOTSTRAP_SERVERS
: comma separated bootstrap servers of the Kafka cluster to connect to. Defaultlocalhost:9092
.KAFKA_BOOTSTRAP_BACKOFF_MAX_ATTEMPTS
: maximum number of attempts for connecting to the Kafka cluster if it is not ready yet. Default10
.KAFKA_BOOTSTRAP_BACKOFF_SCALE
: the scale used to delay between attempts to connect to the Kafka cluster (in ms). Default5000
.TOPIC
: the name of the topic used by the tool to send and receive messages. Default__strimzi_canary
.TOPIC_CONFIG
: topic configuration defined as a list of semicolon separatedkey=value
pairs (i.e.retention.ms=600000;segment.bytes=16384
). Default empty.RECONCILE_INTERVAL_MS
: it defines how often the tool has to send and receive messages (in ms). Default30000
.CLIENT_ID
: the client id used for configuring producer and consumer. Defaultstrimzi-canary-client
.CONSUMER_GROUP_ID
: group id for the consumer group joined by the canary consumer. Defaultstrimzi-canary-group
.PRODUCER_LATENCY_BUCKETS
: buckets of the histogram related to the producer latency metric (in ms). Default100,200,400,800,1600
.ENDTOEND_LATENCY_BUCKETS
: buckets of the histogram related to the end to end latency metric between producer and consumer (in ms). Default100,200,400,800,1600
.EXPECTED_CLUSTER_SIZE
: expected number of brokers in the Kafka cluster where the canary connects to. This parameter avoids that the tool runs more partitions reassignment of the topic while the Kafka cluster is starting up and the brokers are coming one by one. Default-1
means "dynamic" reassignment as described above. When greater than 0, the canary waits for the Kafka cluster having the expected number of brokers running before creating the topic and assigning the partitions.KAFKA_VERSION
: version of the Kafka cluster. Default2.8.0
.SARAMA_LOG_ENABLED
: enables the Sarama client logging. Defaultfalse
.VERBOSITY_LOG_LEVEL
: verbosity of the tool logging. Default0
. Allowed values 0 = INFO, 1 = DEBUG, 2 = TRACE.TLS_ENABLED
: if the canary has to use TLS to connect to the Kafka cluster. Defaultfalse
.TLS_CA_CERT
: TLS CA certificate, in PEM format, to use to connect to the Kafka cluster. When this parameter is empty (default behaviour) and the TLS connection is enabled, the canary uses the system certificates trust store. When a TLS CA certificate is specified, it is added to the system certificates trust store.TLS_CLIENT_CERT
: TLS client certificate, in PEM format, to use for enabling TLS client authentication against the Kafka cluster. Default empty.TLS_CLIENT_KEY
: TLS client private key, in PEM format, to use for enabling TLS client authentication against the Kafka cluster. Default empty.TLS_INSECURE_SKIP_VERIFY
: if the underneath Sarama client has to verify the server's certificate chain and host name. Defaultfalse
.SASL_MECHANISM
: mechanism to use for SASL authentication against the Kafka cluster. Supported arePLAIN
,SCRAM-SHA-256
andSCRAM-SHA-512
. Default empty.SASL_USER
: username for SASL authentication against the Kafka cluster when one ofPLAIN
,SCRAM-SHA-256
orSCRAM-SHA-512
is used. Default empty.SASL_PASSWORD
: password for SASL authentication against the Kafka cluster when one ofPLAIN
,SCRAM-SHA-256
orSCRAM-SHA-512
is used. Default empty.CONNECTION_CHECK_INTERVAL_MS
: it defines how often the tool has to check the connection with brokers (in ms). Default120000
.CONNECTION_CHECK_LATENCY_BUCKETS
: buckets of the histogram related to the broker's connection latency metric (in ms). Default100,200,400,800,1600
.STATUS_CHECK_INTERVAL_MS
: it defines how often (in ms) the tool updates internal status information (i.e. percentage of consumed messages) to expose outside on the corresponding HTTP endpoint. Default30000
.STATUS_TIME_WINDOW_MS
: it defines the sliding time window size (in ms) in which status information are sampled. Default300000
The canary exposes some HTTP endpoints, on port 8080, to provide information about status, health and metrics.
The /liveness
and /readiness
endpoints report back if the canary is live and ready by proving just an OK
HTTP body.
The /metrics
endpoint provides useful metrics in Prometheus format.
The /status
endpoint provides status information through a JSON object structured with different sections.
The Consuming
field provides information about the Percentage
of messages correctly consumed in a sliding TimeWindow
(in ms), whose maximum size is configured via the STATUS_TIME_WINDOW_MS
environment variable; until that size is reached, the TimeWindow
field reports the current covered time window with gathered samples.
{
"Consuming": {
"TimeWindow": 150000,
"Percentage": 100
}
}
In order to check how your Apache Kafka cluster is behaving, the Canary provides the following metrics on the corresponding HTTP endpoint.
Name | Description |
---|---|
client_creation_error_total |
Total number of errors while creating Sarama client |
expected_cluster_size_error_total |
Total number of errors while waiting the Kafka cluster having the expected size |
topic_creation_failed_total |
Total number of errors while creating the canary topic |
topic_describe_cluster_error_total |
Total number of errors while describing cluster |
topic_describe_error_total |
Total number of errors while getting canary topic metadata |
topic_alter_assignments_error_total |
Total number of errors while altering partitions assignments for the canary topic |
topic_alter_configuration_error_total |
Total number of errors while altering configuration for the canary topic |
records_produced_total |
The total number of records produced |
records_produced_failed_total |
The total number of records failed to produce |
producer_refresh_metadata_error_total |
Total number of errors while refreshing producer metadata |
records_produced_latency |
Records produced latency in milliseconds |
records_consumed_total |
The total number of records consumed |
consumer_error_total |
Total number of errors reported by the consumer |
consumer_timeout_join_group_total |
The total number of consumers not joining the group within the timeout |
records_consumed_latency |
Records end-to-end latency in milliseconds |
connection_error_total |
Total number of errors while checking the connection to Kafka brokers |
connection_latency |
Latency in milliseconds for established or failed connections |
Following an example of metrics output.
# HELP strimzi_canary_records_produced_total The total number of records produced
# TYPE strimzi_canary_records_produced_total counter
strimzi_canary_records_produced_total{clientid="strimzi-canary-client",partition="0"} 1
strimzi_canary_records_produced_total{clientid="strimzi-canary-client",partition="1"} 1
strimzi_canary_records_produced_total{clientid="strimzi-canary-client",partition="2"} 1
# HELP strimzi_canary_records_consumed_total The total number of records consumed
# TYPE strimzi_canary_records_consumed_total counter
strimzi_canary_records_consumed_total{clientid="strimzi-canary-client",partition="0"} 1
strimzi_canary_records_consumed_total{clientid="strimzi-canary-client",partition="1"} 1
strimzi_canary_records_consumed_total{clientid="strimzi-canary-client",partition="2"} 1
# HELP strimzi_canary_records_produced_latency Records produced latency in milliseconds
# TYPE strimzi_canary_records_produced_latency histogram
strimzi_canary_records_produced_latency_bucket{clientid="strimzi-canary-client",partition="0",le="50"} 0
strimzi_canary_records_produced_latency_bucket{clientid="strimzi-canary-client",partition="0",le="100"} 0
...
strimzi_canary_records_produced_latency_bucket{clientid="strimzi-canary-client",partition="0",le="+Inf"} 1
strimzi_canary_records_produced_latency_sum{clientid="strimzi-canary-client",partition="0"} 151
strimzi_canary_records_produced_latency_count{clientid="strimzi-canary-client",partition="0"} 1
strimzi_canary_records_produced_latency_bucket{clientid="strimzi-canary-client",partition="1",le="50"} 0
...
strimzi_canary_records_produced_latency_bucket{clientid="strimzi-canary-client",partition="1",le="+Inf"} 1
strimzi_canary_records_produced_latency_sum{clientid="strimzi-canary-client",partition="1"} 125
strimzi_canary_records_produced_latency_count{clientid="strimzi-canary-client",partition="1"} 1
strimzi_canary_records_produced_latency_bucket{clientid="strimzi-canary-client",partition="2",le="50"} 0
strimzi_canary_records_produced_latency_bucket{clientid="strimzi-canary-client",partition="2",le="100"} 0
...
strimzi_canary_records_produced_latency_bucket{clientid="strimzi-canary-client",partition="2",le="+Inf"} 1
strimzi_canary_records_produced_latency_sum{clientid="strimzi-canary-client",partition="2"} 263
strimzi_canary_records_produced_latency_count{clientid="strimzi-canary-client",partition="2"} 1
# HELP strimzi_canary_records_consumed_latency Records end-to-end latency in milliseconds
# TYPE strimzi_canary_records_consumed_latency histogram
strimzi_canary_records_consumed_latency_bucket{clientid="strimzi-canary-client",partition="0",le="100"} 0
strimzi_canary_records_consumed_latency_bucket{clientid="strimzi-canary-client",partition="0",le="200"} 1
...
strimzi_canary_records_consumed_latency_bucket{clientid="strimzi-canary-client",partition="0",le="+Inf"} 1
strimzi_canary_records_consumed_latency_sum{clientid="strimzi-canary-client",partition="0"} 161
strimzi_canary_records_consumed_latency_count{clientid="strimzi-canary-client",partition="0"} 1
strimzi_canary_records_consumed_latency_bucket{clientid="strimzi-canary-client",partition="1",le="100"} 0
strimzi_canary_records_consumed_latency_bucket{clientid="strimzi-canary-client",partition="1",le="200"} 1
...
strimzi_canary_records_consumed_latency_bucket{clientid="strimzi-canary-client",partition="1",le="+Inf"} 1
strimzi_canary_records_consumed_latency_sum{clientid="strimzi-canary-client",partition="1"} 133
strimzi_canary_records_consumed_latency_count{clientid="strimzi-canary-client",partition="1"} 1
strimzi_canary_records_consumed_latency_bucket{clientid="strimzi-canary-client",partition="2",le="100"} 0
strimzi_canary_records_consumed_latency_bucket{clientid="strimzi-canary-client",partition="2",le="200"} 0
...
strimzi_canary_records_consumed_latency_bucket{clientid="strimzi-canary-client",partition="2",le="+Inf"} 1
strimzi_canary_records_consumed_latency_sum{clientid="strimzi-canary-client",partition="2"} 266
strimzi_canary_records_consumed_latency_count{clientid="strimzi-canary-client",partition="2"} 1
# HELP strimzi_canary_connection_latency Latency in milliseconds for established or failed connections
# TYPE strimzi_canary_connection_latency histogram
strimzi_canary_connection_latency_bucket{brokerid="0",connected="true",le="100"} 1
strimzi_canary_connection_latency_bucket{brokerid="0",connected="true",le="200"} 1
...
strimzi_canary_connection_latency_bucket{brokerid="0",connected="true",le="+Inf"} 1
strimzi_canary_connection_latency_sum{brokerid="0",connected="true"} 23
strimzi_canary_connection_latency_count{brokerid="0",connected="true"} 1
strimzi_canary_connection_latency_bucket{brokerid="1",connected="true",le="100"} 1
strimzi_canary_connection_latency_bucket{brokerid="1",connected="true",le="200"} 1
...
strimzi_canary_connection_latency_bucket{brokerid="1",connected="true",le="+Inf"} 1
strimzi_canary_connection_latency_sum{brokerid="1",connected="true"} 8
strimzi_canary_connection_latency_count{brokerid="1",connected="true"} 1
strimzi_canary_connection_latency_bucket{brokerid="2",connected="true",le="100"} 1
strimzi_canary_connection_latency_bucket{brokerid="2",connected="true",le="200"} 1
...
strimzi_canary_connection_latency_bucket{brokerid="2",connected="true",le="+Inf"} 1
strimzi_canary_connection_latency_sum{brokerid="2",connected="true"} 6
strimzi_canary_connection_latency_count{brokerid="2",connected="true"} 1
# HELP strimzi_canary_client_creation_error_total Total number of errors while creating Sarama client
# TYPE strimzi_canary_client_creation_error_total counter
strimzi_canary_client_creation_error_total 4
# HELP strimzi_canary_connection_error_total Total number of errors while checking the connection to Kafka brokers
# TYPE strimzi_canary_connection_error_total counter
strimzi_canary_connection_error_total{brokerid="1",connected="false"} 1
strimzi_canary_connection_error_total{brokerid="2",connected="false"} 1
If you encounter any issues while using Strimzi, you can get help using:
You can contribute by raising any issues you find and/or fixing issues by opening Pull Requests. All bugs, tasks or enhancements are tracked as GitHub issues.
The development documentation describe how to build, test and release Strimzi Canary.
Strimzi is licensed under the Apache License, Version 2.0