Skip to content

Commit

Permalink
Merge pull request grafana/cortex-jsonnet#406 from grafana/alert-on-c…
Browse files Browse the repository at this point in the history
…onsul-failures

Added CortexFailingToTalkToConsul alert
  • Loading branch information
pracucci authored Oct 14, 2021
2 parents fd975db + d02bb6b commit 859efc9
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 0 deletions.
21 changes: 21 additions & 0 deletions jsonnet/mimir-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,27 @@
|||,
},
},
{
alert: 'CortexKVStoreFailure',
expr: |||
(
sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.+"}[1m]))
/
sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count[1m]))
)
# We want to get alerted only in case there's a constant failure.
== 1
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
'for': '5m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
Cortex {{ $labels.pod }} in %(alert_aggregation_variables)s is failing to talk to the KV store {{ $labels.kv_name }}.
||| % $._config,
},
},
{
alert: 'CortexMemoryMapAreasTooHigh',
expr: |||
Expand Down
14 changes: 14 additions & 0 deletions jsonnet/mimir-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -734,6 +734,20 @@ How to **investigate**:
- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`)
- Run `kubectl -n <namespace> describe statefulset <name>` or `kubectl -n <namespace> describe deployment <name>` and look at "Pod Status" and "Events" to get more information
### CortexKVStoreFailure
This alert fires if a Cortex instance is failing to run any operation on a KV store (eg. consul or etcd).
How it **works**:
- Consul is typically used to store the hash ring state.
- Etcd is typically used to store by the HA tracker (distributor) to deduplicate samples.
- If an instance is failing operations on the **hash ring**, either the instance can't update the heartbeat in the ring or is failing to receive ring updates.
- If an instance is failing operations on the **HA tracker** backend, either the instance can't update the authoritative replica or is failing to receive updates.
How to **investigate**:
- Ensure Consul/Etcd is up and running.
- Investigate the logs of the affected instance to find the specific error occurring when talking to Consul/Etcd.
## Cortex routes by path
**Write path**:
Expand Down

0 comments on commit 859efc9

Please sign in to comment.