grafana · pstibrany · Jun 24, 2022 · Jun 21, 2022 · Jun 21, 2022 · Jun 22, 2022
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -102,6 +102,7 @@
 * [ENHANCEMENT] Clarify "Set rule group" API specification. #1869
 * [ENHANCEMENT] Published Mimir jsonnet documentation. #2024
 * [ENHANCEMENT] Documented required scrape interval for using alerting and recording rules from Mimir jsonnet. #2147
+* [ENHANCEMENT] Runbooks: Mention memberlist as possible source of problems for various alerts. #2158
 * [ENHANCEMENT] Documented how to configure queriers’ autoscaling with Jsonnet. #2128
 * [BUGFIX] Fixed ruler configuration used in the getting started guide. #2052
 * [BUGFIX] Fixed Mimir Alertmanager datasource in Grafana used by "Play with Grafana Mimir" tutorial. #2115

@@ -265,6 +265,7 @@ How to **investigate**:
 - If the failing service is going OOM (`OOMKilled`): scale up or increase the memory
 - If the failing service is crashing / panicking: look for the stack trace in the logs and investigate from there
   - If crashing service is query-frontend, querier or store-gateway, and you have "activity tracker" feature enabled, look for `found unfinished activities from previous run` message and subsequent `activity` messages in the log file to see which queries caused the crash.
+- When using Memberlist as KV store for hash rings, ensure that Memberlist is working correctly. See instructions for [`MimirGossipMembersMismatch`](#MimirGossipMembersMismatch) alert.
 
 #### Alertmanager
 
@@ -296,6 +297,8 @@ More information:
 
 This alert occurs when a ruler is unable to validate whether or not it should claim ownership over the evaluation of a rule group. The most likely cause is that one of the rule ring entries is unhealthy. If this is the case proceed to the ring admin http page and forget the unhealth ruler. The other possible cause would be an error returned the ring client. If this is the case look into debugging the ring based on the in-use backend implementation.
 
+When using Memberlist as KV store for hash rings, ensure that Memberlist is working correctly. See instructions for [`MimirGossipMembersMismatch`](#MimirGossipMembersMismatch) alert.
+
 ### MimirRulerTooManyFailedPushes
 
 This alert fires when rulers cannot push new samples (result of rule evaluation) to ingesters.
@@ -306,6 +309,7 @@ This alert fires only for first kind of problems, and not for problems caused by
 How to **fix** it:
 
 - Investigate the ruler logs to find out the reason why ruler cannot write samples. Note that ruler logs all push errors, including "user errors", but those are not causing the alert to fire. Focus on problems with ingesters.
+- When using Memberlist as KV store for hash rings, ensure that Memberlist is working correctly. See instructions for [`MimirGossipMembersMismatch`](#MimirGossipMembersMismatch) alert.
 
 ### MimirRulerTooManyFailedQueries
 
@@ -319,6 +323,7 @@ How to **fix** it:
 
 - Investigate the ruler logs to find out the reason why ruler cannot evaluate queries. Note that ruler logs rule evaluation errors even for "user errors", but those are not causing the alert to fire. Focus on problems with ingesters or store-gateways.
 - In case remote operational mode is enabled the problem could be at any of the ruler query path components (ruler-query-frontend, ruler-query-scheduler and ruler-querier). Check the `Mimir / Remote ruler reads` and `Mimir / Remote ruler reads resources` dashboards to find out in which Mimir service the error is being originated.
+- When using Memberlist as KV store for hash rings, ensure that Memberlist is working correctly. See instructions for [`MimirGossipMembersMismatch`](#MimirGossipMembersMismatch) alert.
 
 ### MimirRulerMissedEvaluations
 
@@ -761,12 +766,12 @@ This alert fires when any instance does not register all other instances as memb
 
 How it **works**:
 
-- This alert applies when memberlist is used for the ring backing store.
+- This alert applies when memberlist is used as KV store for hash rings.
 - All Mimir instances using the ring, regardless of type, join a single memberlist cluster.
-- Each instance (=memberlist cluster member) should be able to see all others.
+- Each instance (ie. memberlist cluster member) should see all memberlist cluster members.
 - Therefore the following should be equal for every instance:
   - The reported number of cluster members (`memberlist_client_cluster_members_count`)
-  - The total number of currently responsive instances.
+  - The total number of currently responsive instances that use memberlist KV store for hash ring.
 
 How to **investigate**:
 
@@ -783,7 +788,7 @@ How to **investigate**:
   - `memberlist_tcp_transport_packets_sent_errors_total`
   - `memberlist_tcp_transport_packets_received_errors_total`
   - These errors (and others) can be found by searching for messages prefixed with `TCPTransport:`.
-- Logs coming directly from memberlist are also logged by Mimir; they may indicate where to investigate further. These can be identified as such due to being tagged with `caller=memberlist_logger.go:xyz`.
+- Logs coming directly from memberlist are also logged by Mimir; they may indicate where to investigate further. These can be identified as such due to being tagged with `caller=memberlist_logger.go:<line>`.
 
 ### EtcdAllocatingTooMuchMemory
 
@@ -831,11 +836,12 @@ This alert is fired when the multi-tenant alertmanager has been unable to check
 
 When the alertmanager loads its configuration on start up, when it polls for config changes or when there is a ring change it must check the ring to see if the tenant is still owned on this shard. To prevent one error from causing the loading of all configurations to fail we assume that on error the tenant is NOT owned for this shard. If checking the ring continues to fail then some tenants might not be assigned an alertmanager and might not be able to receive notifications for their alerts.
 
-The metric for this alert is cortex_alertmanager_ring_check_errors_total.
+The metric for this alert is `cortex_alertmanager_ring_check_errors_total`.
 
 How to **investigate**:
 
-Look at the error message that is logged and attempt to understand what is causing the failure. In most cases the error will be encountered when attempting to read from the ring, which can fail if there is an issue with in-use backend implementation.
+- Look at the error message that is logged and attempt to understand what is causing the failure. In most cases the error will be encountered when attempting to read from the ring, which can fail if there is an issue with in-use backend implementation.
+- When using Memberlist as KV store for hash rings, ensure that Memberlist is working correctly. See instructions for [`MimirGossipMembersMismatch`](#MimirGossipMembersMismatch) alert.
 
 ### MimirAlertmanagerPartialStateMergeFailing
 
@@ -920,6 +926,7 @@ How to **investigate**:
 ### MimirKVStoreFailure
 
 This alert fires if a Mimir instance is failing to run any operation on a KV store (eg. consul or etcd).
+When using Memberlist as KV store for hash rings, all read and update operations work on a local copy of the hash ring, and will never fail and raise this alert.
 
 How it **works**: