[Stack Monitoring] Diagnostic query docs (#127572)

elastic · Mar 14, 2022 · f071726 · f071726
1 parent 5410626
commit f071726
Show file tree

Hide file tree

Showing 4 changed files with 121 additions and 3 deletions.
diff --git a/x-pack/plugins/monitoring/dev_docs/how_to/cloud_setup.md b/x-pack/plugins/monitoring/dev_docs/how_to/cloud_setup.md
@@ -38,6 +38,7 @@ elasticsearch.hosts: ${ELASTICSEARCH_ENDPOINT}
 elasticsearch.username: kibana_dev
 elasticsearch.password: ${ELASTIC_PASSWORD}
 elasticsearch.ignoreVersionMismatch: true
+monitoring.ui.container.elasticsearch.enabled: true
 YAML
 ```
 
@@ -47,4 +48,4 @@ And start kibana with that config:
 yarn start --config config/kibana.cloud.yml
 ```
 
-Note that your local kibana will run data migrations and probably render the cloud created kibana unusable after your local kibana starts up.
+Note that your local kibana will run data migrations and probably render the cloud created kibana unusable after your local kibana starts up.
diff --git a/x-pack/plugins/monitoring/dev_docs/runbook/cpu_metrics.md b/x-pack/plugins/monitoring/dev_docs/runbook/cpu_metrics.md
@@ -0,0 +1,21 @@
+CPU Utilization is a metric that seems like a simple question: How hard are my CPUs working?
+
+But the way CPU resources get managed can get interesting. Especially when [cgroups](https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt) and [CFS](https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html) are used.
+
+When trying to debug why a CPU metric doesn't look the way you expect it to in a Stack Monitoring graph, this information may be helpful.
+
+At the time of writing, the code path to get from a system level CPU metric to a utilization percentage looks like this:
+
+1. `node_cpu_metric` set to `node_cgroup_quota_as_cpu_utilization` when cgroup is enabled: [node_detail.js](/x-pack/plugins/monitoring/server/routes/api/v1/elasticsearch/node_detail.js#L61-65)
+1. `node_cgroup_quota_as_cpu_utilization` defined as a `QuotaMetric` against `cpu.cfs_quota_micros`: [metrics.ts](/x-pack/plugins/monitoring/server/lib/metrics/elasticsearch/metrics.ts#L798-801)
+1. `QuotaMetric` tries to produce a ratio of usage to quota, but returns null when quota isn't a positive number: [quota_metric.ts](/x-pack/plugins/monitoring/server/lib/metrics/classes/quota_metric.ts#L79-80)
+
+So it's important to be aware of the `monitoring.ui.container.elasticsearch.enabled` setting, which defaults to `true` on cloud.elastic.co.
+
+Some values of `cfs_quota_micros` could produce unexpected results. For example, if cgroups enabled but no quota is set, you'll get an "N/A" in the stack monitoring UI since elasticsearch can't directly see how much of the CPU it's using.
+
+You can confirm a point-in-time value of `cfs_quota_micros` for Elasticsearch by using the [node stats API](https://www.elastic.co/guide/en/elasticsearch/reference/master/cluster-nodes-stats.html).
+
+The CPU available on Elastic Cloud is based on the memory size of the instance, and smaller instance sizes get an additional boost via direct adjustments to the `cfs_quota_us` cgroup setting.
+
+For self-hosted deployments, the cgroup configuration will likely need to be checked via `docker inspect`.
diff --git a/x-pack/plugins/monitoring/dev_docs/runbook/diagnostic_queries.md b/x-pack/plugins/monitoring/dev_docs/runbook/diagnostic_queries.md
@@ -0,0 +1,96 @@
+If the stack monitoring UI isn't showing data for any cluster, it may first be useful to survey the available data using a query like this:
+
+```Kibana Dev Tools
+POST .monitoring-*/_search
+{
+  "size": 0,
+  "query": {
+    "range": {
+      "timestamp": {
+        "gte": "now-1h",
+        "lte": "now"
+      }
+    }
+  },
+  "aggs": {
+    "clusters": {
+      "terms": {
+        "field": "cluster_uuid",
+        "size": 1000
+      },
+      "aggs": {
+        "indices": {
+          "terms": {
+            "field": "_index",
+            "size": 1000
+          },
+          "aggs": {
+            "documentTypes": {
+              "terms": {
+                "field": "type",
+                "size": 1000
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+}
+```
+
+This will show what document types are available in each index for each cluster UUID in the last hour.
+
+The main cluster list requires ES cluster stats to be available. You can use this query to check for the presence of cluster stats for a given `CLUSTER_UUID` (note the replacement required in the query).
+
+```Kibana Dev Tools
+POST .monitoring-*,*:.monitoring-*,metrics-*,*:metrics-*/_search
+{
+  "size": 10,
+  "query": {
+    "bool": {
+      "filter": [
+        {
+          "bool": {
+            "should": [
+              {
+                "term": {
+                  "type": "cluster_stats"
+                }
+              },
+              {
+                "term": {
+                  "metricset.name": "cluster_stats"
+                }
+              }
+            ]
+          }
+        },
+        {
+          "term": {
+            "cluster_uuid": "<CLUSTER UUID>"
+          }
+        },
+        {
+          "range": {
+            "timestamp": {
+              "format": "epoch_millis",
+              "gte": "now-7d",
+              "lte": "now"
+            }
+          }
+        }
+      ]
+    }
+  },
+  "collapse": {
+    "field": "cluster_uuid"
+  },
+  "sort": {
+    "timestamp": {
+      "order": "desc",
+      "unmapped_type": "long"
+    }
+  }
+}
+```
diff --git a/x-pack/plugins/monitoring/readme.md b/x-pack/plugins/monitoring/readme.md
@@ -18,5 +18,5 @@ This plugin provides the Stack Monitoring kibana application.
 - [APM tracing](dev_docs/how_to/apm_tracing.md) (WIP)
 
 ## Troubleshooting
-- [Diagnostic queries](dev_docs/runbook/diagnostic_queries.md) (WIP)
-- [CPU metrics](dev_docs/runbook/cpu_metrics.md) (WIP)
+- [Diagnostic queries](dev_docs/runbook/diagnostic_queries.md)
+- [CPU metrics](dev_docs/runbook/cpu_metrics.md)