Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update runbooks to mention possibility to investigate memberlist KV store in various alerts #2158

Merged
merged 6 commits into from
Jun 24, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@
* [ENHANCEMENT] Clarify "Set rule group" API specification. #1869
* [ENHANCEMENT] Published Mimir jsonnet documentation. #2024
* [ENHANCEMENT] Documented required scrape interval for using alerting and recording rules from Mimir jsonnet. #2147
* [ENHANCEMENT] Runbooks: Mention memberlist as possible source of problems for various alerts. #2158
* [ENHANCEMENT] Documented how to configure queriers’ autoscaling with Jsonnet. #2128
* [BUGFIX] Fixed ruler configuration used in the getting started guide. #2052
* [BUGFIX] Fixed Mimir Alertmanager datasource in Grafana used by "Play with Grafana Mimir" tutorial. #2115
Expand Down
19 changes: 13 additions & 6 deletions docs/sources/operators-guide/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,7 @@ How to **investigate**:
- If the failing service is going OOM (`OOMKilled`): scale up or increase the memory
- If the failing service is crashing / panicking: look for the stack trace in the logs and investigate from there
- If crashing service is query-frontend, querier or store-gateway, and you have "activity tracker" feature enabled, look for `found unfinished activities from previous run` message and subsequent `activity` messages in the log file to see which queries caused the crash.
- When using Memberlist as KV store for hash rings, ensure that Memberlist is working correctly. See instructions for [`MimirGossipMembersMismatch`](#MimirGossipMembersMismatch) alert.

#### Alertmanager

Expand Down Expand Up @@ -296,6 +297,8 @@ More information:

This alert occurs when a ruler is unable to validate whether or not it should claim ownership over the evaluation of a rule group. The most likely cause is that one of the rule ring entries is unhealthy. If this is the case proceed to the ring admin http page and forget the unhealth ruler. The other possible cause would be an error returned the ring client. If this is the case look into debugging the ring based on the in-use backend implementation.

When using Memberlist as KV store for hash rings, ensure that Memberlist is working correctly. See instructions for [`MimirGossipMembersMismatch`](#MimirGossipMembersMismatch) alert.

### MimirRulerTooManyFailedPushes

This alert fires when rulers cannot push new samples (result of rule evaluation) to ingesters.
Expand All @@ -306,6 +309,7 @@ This alert fires only for first kind of problems, and not for problems caused by
How to **fix** it:

- Investigate the ruler logs to find out the reason why ruler cannot write samples. Note that ruler logs all push errors, including "user errors", but those are not causing the alert to fire. Focus on problems with ingesters.
- When using Memberlist as KV store for hash rings, ensure that Memberlist is working correctly. See instructions for [`MimirGossipMembersMismatch`](#MimirGossipMembersMismatch) alert.

### MimirRulerTooManyFailedQueries

Expand All @@ -319,6 +323,7 @@ How to **fix** it:

- Investigate the ruler logs to find out the reason why ruler cannot evaluate queries. Note that ruler logs rule evaluation errors even for "user errors", but those are not causing the alert to fire. Focus on problems with ingesters or store-gateways.
- In case remote operational mode is enabled the problem could be at any of the ruler query path components (ruler-query-frontend, ruler-query-scheduler and ruler-querier). Check the `Mimir / Remote ruler reads` and `Mimir / Remote ruler reads resources` dashboards to find out in which Mimir service the error is being originated.
- When using Memberlist as KV store for hash rings, ensure that Memberlist is working correctly. See instructions for [`MimirGossipMembersMismatch`](#MimirGossipMembersMismatch) alert.

### MimirRulerMissedEvaluations

Expand Down Expand Up @@ -761,12 +766,12 @@ This alert fires when any instance does not register all other instances as memb

How it **works**:

- This alert applies when memberlist is used for the ring backing store.
- This alert applies when memberlist is used as KV store for hash rings.
- All Mimir instances using the ring, regardless of type, join a single memberlist cluster.
- Each instance (=memberlist cluster member) should be able to see all others.
- Each instance (ie. memberlist cluster member) should see all memberlist cluster members.
- Therefore the following should be equal for every instance:
- The reported number of cluster members (`memberlist_client_cluster_members_count`)
- The total number of currently responsive instances.
- The total number of currently responsive instances that use memberlist KV store for hash ring.

How to **investigate**:

Expand All @@ -783,7 +788,7 @@ How to **investigate**:
- `memberlist_tcp_transport_packets_sent_errors_total`
- `memberlist_tcp_transport_packets_received_errors_total`
- These errors (and others) can be found by searching for messages prefixed with `TCPTransport:`.
- Logs coming directly from memberlist are also logged by Mimir; they may indicate where to investigate further. These can be identified as such due to being tagged with `caller=memberlist_logger.go:xyz`.
- Logs coming directly from memberlist are also logged by Mimir; they may indicate where to investigate further. These can be identified as such due to being tagged with `caller=memberlist_logger.go:<line>`.

### EtcdAllocatingTooMuchMemory

Expand Down Expand Up @@ -831,11 +836,12 @@ This alert is fired when the multi-tenant alertmanager has been unable to check

When the alertmanager loads its configuration on start up, when it polls for config changes or when there is a ring change it must check the ring to see if the tenant is still owned on this shard. To prevent one error from causing the loading of all configurations to fail we assume that on error the tenant is NOT owned for this shard. If checking the ring continues to fail then some tenants might not be assigned an alertmanager and might not be able to receive notifications for their alerts.

The metric for this alert is cortex_alertmanager_ring_check_errors_total.
The metric for this alert is `cortex_alertmanager_ring_check_errors_total`.

How to **investigate**:

Look at the error message that is logged and attempt to understand what is causing the failure. In most cases the error will be encountered when attempting to read from the ring, which can fail if there is an issue with in-use backend implementation.
- Look at the error message that is logged and attempt to understand what is causing the failure. In most cases the error will be encountered when attempting to read from the ring, which can fail if there is an issue with in-use backend implementation.
- When using Memberlist as KV store for hash rings, ensure that Memberlist is working correctly. See instructions for [`MimirGossipMembersMismatch`](#MimirGossipMembersMismatch) alert.

### MimirAlertmanagerPartialStateMergeFailing

Expand Down Expand Up @@ -920,6 +926,7 @@ How to **investigate**:
### MimirKVStoreFailure

This alert fires if a Mimir instance is failing to run any operation on a KV store (eg. consul or etcd).
When using Memberlist as KV store for hash rings, all read and update operations work on a local copy of the hash ring, and will never fail and raise this alert.

How it **works**:

Expand Down