-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Kemal Akkoyun <[email protected]>
- Loading branch information
Showing
4 changed files
with
207 additions
and
49 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,180 @@ | ||
[//]: # "TODO(kakkoyun): Generate this file using embedmd." | ||
|
||
# Alerts | ||
|
||
Here are some example alerts configured for Kubernetes environment. | ||
|
||
## Compaction | ||
|
||
```yaml | ||
- alert: ThanosCompactHalted | ||
expr: thanos_compactor_halted{app="thanos-compact"} == 1 | ||
for: 5m | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos compaction has failed to run and now is halted | ||
impact: Long term storage queries will be slower | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: COMPACTION_URL | ||
- alert: ThanosCompactCompactionsFailed | ||
expr: rate(prometheus_tsdb_compactions_failed_total{app="thanos-compact"}[5m]) > 0 | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Compact is failing compaction | ||
impact: Long term storage queries will be slower | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: COMPACTION_URL | ||
- alert: ThanosCompactBucketOperationsFailed | ||
expr: rate(thanos_objstore_bucket_operation_failures_total{app="thanos-compact"}[5m]) > 0 | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Compact bucket operations are failing | ||
impact: Long term storage queries will be slower | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: COMPACTION_URL | ||
- alert: ThanosCompactNotRunIn24Hours | ||
expr: (time() - max(thanos_objstore_bucket_last_successful_upload_time{app="thanos-compact"}) ) /60/60 > 24 | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Compaction has not been run in 24 hours | ||
impact: Long term storage queries will be slower | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: COMPACTION_URL | ||
- alert: ThanosComactionIsNotRunning | ||
expr: up{app="thanos-compact"} == 0 or absent({app="thanos-compact"}) | ||
for: 5m | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Compaction is not running | ||
impact: Long term storage queries will be slower | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: COMPACTION_URL | ||
- alert: ThanosComactionMultipleCompactionsAreRunning | ||
expr: sum(up{app="thanos-compact"}) > 1 | ||
for: 5m | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Multiple replicas of Thanos compaction shouldn't be running. | ||
impact: Metrics in long term storage may be corrupted | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: COMPACTION_URL | ||
|
||
``` | ||
|
||
## Ruler | ||
|
||
For Thanos ruler we run some alerts in local Prometheus, to make sure that Thanos Rule is working: | ||
|
||
```yaml | ||
- alert: ThanosRuleIsDown | ||
expr: up{app="thanos-rule"} == 0 or absent(up{app="thanos-rule"}) | ||
for: 5m | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Rule is down | ||
impact: Alerts are not working | ||
action: 'check {{ $labels.kubernetes_pod_name }} pod in {{ $labels.kubernetes_namespace}} namespace' | ||
dashboard: RULE_DASHBOARD | ||
- alert: ThanosRuleIsDroppingAlerts | ||
expr: rate(thanos_alert_queue_alerts_dropped_total{app="thanos-rule"}[5m]) > 0 | ||
for: 5m | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Rule is dropping alerts | ||
impact: Alerts are not working | ||
action: 'check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace' | ||
dashboard: RULE_DASHBOARD | ||
- alert: ThanosRuleGrpcErrorRate | ||
expr: rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable",app="thanos-rule"}[5m]) > 0 | ||
for: 5m | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Rule is returning Internal/Unavailable errors | ||
impact: Recording Rules are not working | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: RULE_DASHBOARD | ||
``` | ||
## Store Gateway | ||
```yaml | ||
- alert: ThanosStoreGrpcErrorRate | ||
expr: rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable",app="thanos-store"}[5m]) > 0 | ||
for: 5m | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Store is returning Internal/Unavailable errors | ||
impact: Long Term Storage Prometheus queries are failing | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: GATEWAY_URL | ||
- alert: ThanosStoreBucketOperationsFailed | ||
expr: rate(thanos_objstore_bucket_operation_failures_total{app="thanos-store"}[5m]) > 0 | ||
for: 5m | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Store is failing to do bucket operations | ||
impact: Long Term Storage Prometheus queries are failing | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: GATEWAY_URL | ||
``` | ||
## Sidecar | ||
``` | ||
- alert: ThanosSidecarPrometheusDown | ||
expr: thanos_sidecar_prometheus_up{name="prometheus"} == 0 | ||
for: 5m | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Sidecar cannot connect to Prometheus | ||
impact: Prometheus configuration is not being refreshed | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: SIDECAR_URL | ||
- alert: ThanosSidecarBucketOperationsFailed | ||
expr: rate(thanos_objstore_bucket_operation_failures_total{name="prometheus"}[5m]) > 0 | ||
for: 5m | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Sidecar bucket operations are failing | ||
impact: We will lose metrics data if not fixed in 24h | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: SIDECAR_URL | ||
- alert: ThanosSidecarGrpcErrorRate | ||
expr: rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable",name="prometheus"}[5m]) > 0 | ||
for: 5m | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Sidecar is returning Internal/Unavailable errors | ||
impact: Prometheus queries are failing | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: SIDECAR_URL | ||
``` | ||
## Query | ||
```yaml | ||
- alert: ThanosQueryGrpcErrorRate | ||
expr: rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable",name="prometheus"}[5m]) > 0 | ||
for: 5m | ||
labels: | ||
team: TEAM | ||
annotations: | ||
summary: Thanos Query is returning Internal/Unavailable errors | ||
impact: Grafana is not showing metrics | ||
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace | ||
dashboard: QUERY_URL | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
[//]: # "TODO(kakkoyun): Improve documentation." | ||
|
||
# Dashboards | ||
|
||
There exists Grafana dashboards for each component (not all of them complete) targeted for environments running Kubernetes: | ||
|
||
- [Thanos Overview](thanos-overview.json) | ||
- [Thanos Compact](thanos-compact.json) | ||
- [Thanos Query](thanos-querier.json) | ||
- [Thanos Store](thanos-store.json) | ||
- [Thanos Receive](thanos-receive.json) | ||
- [Thanos Sidecar](thanos-sidecar.json) | ||
- [Thanos Rule](thanos-rule.json) | ||
|
||
You can import them via `Import -> Paste JSON` in Grafana. | ||
These dashboards require Grafana 5 or above, importing them in older versions are known not to work. | ||
|
||
## Configuration | ||
|
||
All dashboards are generated using [`thanos-mixin`](../../jsonnet/thanos-mixin) and can be configured via editing [jsonnet configuration file](../../jsonnet/thanos-mixin/config.libsonnet), which are used to pinpoint Thanos components. |
This file was deleted.
Oops, something went wrong.