Skip to content

Commit

Permalink
[Stack Monitoring] Diagnostic query docs (#127572)
Browse files Browse the repository at this point in the history
  • Loading branch information
matschaffer authored Mar 14, 2022
1 parent 5410626 commit f071726
Show file tree
Hide file tree
Showing 4 changed files with 121 additions and 3 deletions.
3 changes: 2 additions & 1 deletion x-pack/plugins/monitoring/dev_docs/how_to/cloud_setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ elasticsearch.hosts: ${ELASTICSEARCH_ENDPOINT}
elasticsearch.username: kibana_dev
elasticsearch.password: ${ELASTIC_PASSWORD}
elasticsearch.ignoreVersionMismatch: true
monitoring.ui.container.elasticsearch.enabled: true
YAML
```

Expand All @@ -47,4 +48,4 @@ And start kibana with that config:
yarn start --config config/kibana.cloud.yml
```

Note that your local kibana will run data migrations and probably render the cloud created kibana unusable after your local kibana starts up.
Note that your local kibana will run data migrations and probably render the cloud created kibana unusable after your local kibana starts up.
21 changes: 21 additions & 0 deletions x-pack/plugins/monitoring/dev_docs/runbook/cpu_metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
CPU Utilization is a metric that seems like a simple question: How hard are my CPUs working?

But the way CPU resources get managed can get interesting. Especially when [cgroups](https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt) and [CFS](https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html) are used.

When trying to debug why a CPU metric doesn't look the way you expect it to in a Stack Monitoring graph, this information may be helpful.

At the time of writing, the code path to get from a system level CPU metric to a utilization percentage looks like this:

1. `node_cpu_metric` set to `node_cgroup_quota_as_cpu_utilization` when cgroup is enabled: [node_detail.js](/x-pack/plugins/monitoring/server/routes/api/v1/elasticsearch/node_detail.js#L61-65)
1. `node_cgroup_quota_as_cpu_utilization` defined as a `QuotaMetric` against `cpu.cfs_quota_micros`: [metrics.ts](/x-pack/plugins/monitoring/server/lib/metrics/elasticsearch/metrics.ts#L798-801)
1. `QuotaMetric` tries to produce a ratio of usage to quota, but returns null when quota isn't a positive number: [quota_metric.ts](/x-pack/plugins/monitoring/server/lib/metrics/classes/quota_metric.ts#L79-80)

So it's important to be aware of the `monitoring.ui.container.elasticsearch.enabled` setting, which defaults to `true` on cloud.elastic.co.

Some values of `cfs_quota_micros` could produce unexpected results. For example, if cgroups enabled but no quota is set, you'll get an "N/A" in the stack monitoring UI since elasticsearch can't directly see how much of the CPU it's using.

You can confirm a point-in-time value of `cfs_quota_micros` for Elasticsearch by using the [node stats API](https://www.elastic.co/guide/en/elasticsearch/reference/master/cluster-nodes-stats.html).

The CPU available on Elastic Cloud is based on the memory size of the instance, and smaller instance sizes get an additional boost via direct adjustments to the `cfs_quota_us` cgroup setting.

For self-hosted deployments, the cgroup configuration will likely need to be checked via `docker inspect`.
96 changes: 96 additions & 0 deletions x-pack/plugins/monitoring/dev_docs/runbook/diagnostic_queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
If the stack monitoring UI isn't showing data for any cluster, it may first be useful to survey the available data using a query like this:

```Kibana Dev Tools
POST .monitoring-*/_search
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "now-1h",
"lte": "now"
}
}
},
"aggs": {
"clusters": {
"terms": {
"field": "cluster_uuid",
"size": 1000
},
"aggs": {
"indices": {
"terms": {
"field": "_index",
"size": 1000
},
"aggs": {
"documentTypes": {
"terms": {
"field": "type",
"size": 1000
}
}
}
}
}
}
}
}
```

This will show what document types are available in each index for each cluster UUID in the last hour.

The main cluster list requires ES cluster stats to be available. You can use this query to check for the presence of cluster stats for a given `CLUSTER_UUID` (note the replacement required in the query).

```Kibana Dev Tools
POST .monitoring-*,*:.monitoring-*,metrics-*,*:metrics-*/_search
{
"size": 10,
"query": {
"bool": {
"filter": [
{
"bool": {
"should": [
{
"term": {
"type": "cluster_stats"
}
},
{
"term": {
"metricset.name": "cluster_stats"
}
}
]
}
},
{
"term": {
"cluster_uuid": "<CLUSTER UUID>"
}
},
{
"range": {
"timestamp": {
"format": "epoch_millis",
"gte": "now-7d",
"lte": "now"
}
}
}
]
}
},
"collapse": {
"field": "cluster_uuid"
},
"sort": {
"timestamp": {
"order": "desc",
"unmapped_type": "long"
}
}
}
```
4 changes: 2 additions & 2 deletions x-pack/plugins/monitoring/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,5 @@ This plugin provides the Stack Monitoring kibana application.
- [APM tracing](dev_docs/how_to/apm_tracing.md) (WIP)

## Troubleshooting
- [Diagnostic queries](dev_docs/runbook/diagnostic_queries.md) (WIP)
- [CPU metrics](dev_docs/runbook/cpu_metrics.md) (WIP)
- [Diagnostic queries](dev_docs/runbook/diagnostic_queries.md)
- [CPU metrics](dev_docs/runbook/cpu_metrics.md)

0 comments on commit f071726

Please sign in to comment.