Enhance diagnostics docs

The diagnostics docs were all in one page(`setup/reference/metrics.mdx`) which made things a bit hard to find, and wasn't conducive to adding new content. Diagnostics information is now located at `setup/diagnostics` with each section from the original `metrics.mdx` doc now having its own dedicated page. The section will now show up in the navbar at "Setup" > "Monitoring Your Cluster". The content of the profiling page is expanded to include information from https://github.com/gravitational/teleport/blob/740d184d1cfc69ae2e96c50ee738b13884fb232b/assets/monitoring/README.md#low-level-monitoring to illustrate what the different profile types are, the information that they capture, and how to retrieve them. A new Distributed Tracing page is also added to instruct users on how to setup the `tracing_service` to collect and export spans (the last open item for #12241).
gravitational · Sep 29, 2022 · 8c60a6a · 8c60a6a
1 parent 7f1f8ec
commit 8c60a6a
Show file tree

Hide file tree

Showing 14 changed files with 455 additions and 224 deletions.
diff --git a/docs/config.json b/docs/config.json
@@ -469,6 +469,32 @@
               "slug": "/management/guides/ssh-key-extensions/"
             }
           ]
+        },
+        {
+          "title": "Diagnostics",
+          "slug": "/management/diagnostics/",
+          "entries": [
+            {
+              "title": "Health Monitoring",
+              "slug": "/management/diagnostics/monitoring/",
+              "forScopes": ["oss", "enterprise", "cloud"]
+            },
+            {
+              "title": "Metrics",
+              "slug": "/management/diagnostics/metrics/",
+              "forScopes": ["oss", "enterprise", "cloud"]
+            },
+            {
+              "title": "Collecting Profiles",
+              "slug": "/management/diagnostics/profiles/",
+              "forScopes": ["oss", "enterprise", "cloud"]
+            },
+            {
+              "title": "Distributed Tracing",
+              "slug": "/management/diagnostics/tracing/",
+              "forScopes": ["oss", "enterprise", "cloud"]
+            }
+          ]
         }
       ]
     },
@@ -1161,7 +1187,7 @@
     },
     {
       "source": "/metrics-logs-reference/",
-      "destination": "/reference/metrics/",
+      "destination": "/management/diagnostics/metrics/",
       "permanent": true
     },
     {

diff --git a/docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx b/docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx
@@ -0,0 +1,28 @@
+Teleports diagnostic HTTP endpoints are disabled by default. You can enable them via:
+
+<Tabs>
+    <TabItem label="Command line">
+        ```code
+        $ sudo teleport start --diag-addr=127.0.0.1:3000
+        ```
+    </TabItem>
+    <TabItem label="Config file">
+        ```yaml
+        teleport:
+            diag_addr: 127.0.0.1:3000
+        ```
+    </TabItem>
+</Tabs>
+
+
+<Details
+    title="Ensure you can connect to the diagnostic endpoint"
+    opened={false}
+>
+
+    Verify that Teleport is now serving the diagnostics endpoint:
+
+    ```code
+    $ curl http://127.0.0.1:3000
+    ```
+</Details>
diff --git a/docs/pages/includes/diagnostics/tracing-config.yaml b/docs/pages/includes/diagnostics/tracing-config.yaml
@@ -0,0 +1,25 @@
+tracing_service:
+  # Turns tracing on. Default is 'no'
+  enabled: yes
+  # The OLTP exporter to send traces to. Possible values:
+  #    "grpc://collector.example.com"        : Traces will be exported via gRPC to the provided URL.
+  #    "http(s)://collector.example.com"     : Traces will be exported via HTTP to the provided URL.
+  #    "file:///var/lib/teleport/traces"     : Traces will be saved to files within the provided directory. Each file
+  #                                            will contain one proto encoded span per line. Files are rotated after
+  #                                            reaching 100MB. To override the rotation limit add
+  #                                            ?limit=<desired_file_size_in_bytes> to the
+  #                                            url (i.e. file:///var/lib/teleport/traces?limit=100)
+  exporter_url: grpc://collector.example.com:4317
+  # The number of samples to collect per million spans.
+  # 1000000 will sample **all** spans generated by Teleport
+  # 500000 will sample 50% of spans generated by Teleport
+  # 10000 will sample 1% of spans generated by Teleport
+  # 0 will not sample any spans generated by Teleport but will respect any parent span's sampling.
+  sampling_rate_per_million: 1000000
+  # Optional CA certificates are used to validate the exporter.
+  ca_certs:
+    - /var/lib/teleport/exporter_ca.pem
+  # Optional TLS certificates are used to enable mTLS for the exporter
+  https_keypairs:
+    - key_file: /var/lib/teleport/exporter_key.pem
+      cert_file: /var/lib/teleport/exporter_cert.pem
diff --git a/docs/pages/includes/metrics.mdx b/docs/pages/includes/metrics.mdx
@@ -0,0 +1,139 @@
+## Auth Service and backends
+
+| Name | Type | Component | Description |
+| - | - | - | - |
+| `audit_failed_disk_monitoring` | counter | Teleport Audit Log | Number of times disk monitoring failed. |
+| `audit_failed_emit_events` | counter | Teleport Audit Log | Number of times emitting audit events failed. |
+| `audit_percentage_disk_space_used` | gauge | Teleport Audit Log | Percentage of disk space used. |
+| `audit_server_open_files` | gauge | Teleport Audit Log | Number of open audit files. |
+| `auth_generate_requests_throttled_total` | counter | Teleport Auth | Number of throttled requests to generate new server keys. |
+| `auth_generate_requests_total` | counter | Teleport Auth | Number of requests to generate new server keys. |
+| `auth_generate_requests` | gauge | Teleport Auth | Number of current generate requests. |
+| `auth_generate_seconds` | `histogram` | Teleport Auth | Latency for generate requests. |
+| `backend_batch_read_requests_total` | counter | cache | Number of read requests to the backend. |
+| `backend_batch_read_seconds` | histogram | cache | Latency for batch read operations. |
+| `backend_batch_write_requests_total` | counter | cache | Number of batch write requests to the backend. |
+| `backend_batch_write_seconds` | histogram | cache | Latency for backend batch write operations. |
+| `backend_read_requests_total` | counter | cache | Number of read requests to the backend. |
+| `backend_read_seconds` | histogram | cache | Latency for read operations. |
+| `backend_write_requests_total` | counter | cache | Number of write requests to the backend. |
+| `backend_write_seconds` | histogram | cache | Latency for backend write operations. |
+| `cluster_name_not_found_total` | counter | Teleport Auth | Number of times a cluster was not found. |
+| `dynamo_requests_total` | counter | dynamodb | Number of requests to the DynamoDB api. |
+| `dynamo_requests` | counter | dynamodb |  Number of failed requests to the DynamoDB api. |
+| `dynamo_requests_seconds` | histogram | dynamodb | Latency of DynamoDB api requests. |
+| `etcd_backend_batch_read_requests` | counter | etcd | Number of read requests to the etcd database. |
+| `etcd_backend_batch_read_seconds` | histogram | etcd | Latency for etcd read operations. |
+| `etcd_backend_read_requests` | counter | etcd | Number of read requests to the etcd database. |
+| `etcd_backend_read_seconds` | histogram | etcd | Latency for etcd read operations. |
+| `etcd_backend_tx_requests` | counter | etcd | Number of transaction requests to the database. |
+| `etcd_backend_tx_seconds` | histogram | etcd | Latency for etcd transaction operations. |
+| `etcd_backend_write_requests` | counter | etcd | Number of write requests to the database. |
+| `etcd_backend_write_seconds` | histogram | etcd | Latency for etcd write operations. |
+| `firestore_events_backend_batch_read_requests` | counter | GCP Cloud Firestore | Number of batch read requests to Cloud Firestore events. |
+| `firestore_events_backend_batch_read_seconds` | histogram | GCP Cloud Firestore | Latency for Cloud Firestore events batch read operations. |
+| `firestore_events_backend_batch_write_requests` | counter | GCP Cloud Firestore | Number of batch write requests to Cloud Firestore events. |
+| `firestore_events_backend_batch_write_seconds` | histogram | GCP Cloud Firestore | Latency for Cloud Firestore events batch write operations. |
+| `gcs_event_storage_downloads_seconds` | histogram | GCP GCS | Latency for GCS download operations. |
+| `gcs_event_storage_downloads` | counter | GCP GCS | Number of downloads from the GCS backend. |
+| `gcs_event_storage_uploads_seconds` | histogram | GCP GCS | Latency for GCS upload operations. |
+| `gcs_event_storage_uploads` | counter | GCP GCS | Number of uploads to the GCS backend. |
+| `grpc_server_started_total` | counter | Teleport Auth | Total number of RPCs started on the server. |
+| `grpc_server_handled_total` | counter | Teleport Auth | Total number of RPCs completed on the server, regardless of success or failure. |
+| `grpc_server_msg_received_total` | counter | Teleport Auth | Total number of RPC stream messages received on the server. |
+| `grpc_server_msg_sent_total` | counter | Teleport Auth | Total number of gRPC stream messages sent by the server. |
+| `heartbeat_connections_missed_total` | counter | Teleport Auth | Number of times the Auth Service did not receive a heartbeat from a Node. |
+| `heartbeat_connections_received_total` | counter | Teleport Auth | Number of times the Auth Service received a heartbeat connection. |
+| `s3_requests_total` | counter | Amazon S3 | Total number of requests to the S3 API. |
+| `s3_requests` | counter | Amazon S3 |  Number of requests to the S3 API by result. |
+| `s3_requests_seconds` | histogram | Amazon S3 | Request latency for the S3 API. |
+| `teleport_audit_emit_events` | counter | Teleport Audit Log | Number of audit events emitted. |
+| `teleport_connected_resources` | gauge | Teleport Auth | Number and type of resources connected via keepalives. |
+| `teleport_registered_servers` | gauge | Teleport Auth | The number of Teleport services that are connected to an Auth Service instance grouped by version. |
+| `user_login_total` | counter | Teleport Auth | Number of user logins. |
+| `watcher_event_sizes` | histogram | cache | Overall size of events emitted. |
+| `watcher_events` | histogram | cache | Per resource size of events emitted. |
+
+
+## Proxy Service
+
+| Name | Type | Component | Description |
+| - | - | - | - |
+| `failed_connect_to_node_attempts_total` | counter | Teleport Proxy | Number of failed SSH connection attempts to a node. Use with `teleport_connect_to_node_attempts_total` to get the failure rate. |
+| `failed_login_attempts_total` | counter | Teleport Proxy | Number of failed `tsh login` or `tsh ssh` logins. |
+| `grpc_client_started_total` | counter | Teleport Proxy | Total number of RPCs started on the client. |
+| `grpc_client_handled_total` | counter | Teleport Proxy | Total number of RPCs completed on the client, regardless of success or failure. |
+| `grpc_client_msg_received_total` | counter | Teleport Proxy | Total number of RPC stream messages received on the client. |
+| `grpc_client_msg_sent_total` | counter | Teleport Proxy | Total number of gRPC stream messages sent by the client. |
+| `proxy_connection_limit_exceeded_total` | counter | Teleport Proxy | Number of connections that exceeded the proxy connection limit. |
+| `proxy_missing_ssh_tunnels` | gauge | Teleport Proxy | Number of missing SSH tunnels. Used to debug if nodes have discovered all proxies. |
+| `teleport_connect_to_node_attempts_total` | counter | Teleport Proxy | Number of SSH connection attempts to a node. Use with `failed_connect_to_node_attempts_total` to get the failure rate. |
+| `teleport_reverse_tunnels_connected` | gauge | Teleport Proxy | Number of reverse SSH tunnels connected to the Teleport Proxy Service by Teleport instances. |
+
+## Teleport Nodes
+
+| Name | Type | Component | Description |
+| - | - | - | - |
+| `user_max_concurrent_sessions_hit_total` | counter | Teleport Node | Number of times a user exceeded their concurrent session limit. |
+
+## All Teleport instances
+
+| Name | Type | Component | Description |
+| - | - | - | - |
+| `certificate_mismatch_total` | counter | Teleport | Number of SSH server login failures due to a certificate mismatch. |
+| `reversetunnel_connected_proxies` | gauge | Teleport | Number of known proxies being sought. |
+| `rx` | counter | Teleport | Number of bytes received during an SSH connection. |
+| `server_interactive_sessions_total` | gauge | Teleport | Number of active sessions. |
+| `teleport_build_info` | gauge | Teleport | Provides build information of Teleport including gitref (git describe --long --tags), Go version, and Teleport version. The value of this gauge will always be 1. |
+| `teleport_cache_events` | counter | Teleport | Number of events received by a Teleport service cache. Teleport's Auth Service, Proxy Service, and other services cache incoming events related to their service. |
+| `teleport_cache_stale_events` | counter | Teleport | Number of stale events received by a Teleport service cache. A high percentage of stale events can indicate a degraded backend. |
+| `trusted_clusters` | gauge | Teleport | Number of tunnels per state. |
+| `tx` | counter | Teleport | Number of bytes transmitted during an SSH connection. |
+
+
+## Golang runtime metrics
+
+| Name | Type | Component | Description |
+| - | - | - | - |
+| `go_gc_duration_seconds` | summary | Internal Golang | A summary of GC invocation durations. |
+| `go_goroutines` | gauge | Internal Golang | Number of goroutines that currently exist. |
+| `go_info` | gauge | Internal Golang | Information about the Go environment. |
+| `go_memstats_alloc_bytes_total` | counter | Internal Golang | Total number of bytes allocated, even if freed. |
+| `go_memstats_alloc_bytes` | gauge | Internal Golang | Number of bytes allocated and still in use. |
+| `go_memstats_buck_hash_sys_bytes` | gauge | Internal Golang | Number of bytes used by the profiling bucket hash table. |
+| `go_memstats_frees_total` | counter | Internal Golang | Total number of frees. |
+| `go_memstats_gc_cpu_fraction` | gauge | Internal Golang | The fraction of this program's available CPU time used by the GC since the program started. |
+| `go_memstats_gc_sys_bytes` | gauge | Internal Golang | Number of bytes used for garbage collection system metadata. |
+| `go_memstats_heap_alloc_bytes` | gauge | Internal Golang | Number of heap bytes allocated and still in use. |
+| `go_memstats_heap_idle_bytes` | gauge | Internal Golang | Number of heap bytes waiting to be used. |
+| `go_memstats_heap_inuse_bytes` | gauge | Internal Golang | Number of heap bytes that are in use. |
+| `go_memstats_heap_objects` | gauge | Internal Golang | Number of allocated objects. |
+| `go_memstats_heap_released_bytes` | gauge | Internal Golang | Number of heap bytes released to the OS. |
+| `go_memstats_heap_sys_bytes` | gauge | Internal Golang | Number of heap bytes obtained from the system. |
+| `go_memstats_last_gc_time_seconds` | gauge | Internal Golang | Number of seconds since the Unix epoch of the last garbage collection. |
+| `go_memstats_lookups_total` | counter | Internal Golang | Total number of pointer lookups. |
+| `go_memstats_mallocs_total` | counter | Internal Golang | Total number of mallocs. |
+| `go_memstats_mcache_inuse_bytes` | gauge | Internal Golang | Number of bytes in use by mcache structures. |
+| `go_memstats_mcache_sys_bytes` | gauge | Internal Golang | Number of bytes used for mcache structures obtained from system. |
+| `go_memstats_mspan_inuse_bytes` | gauge | Internal Golang | Number of bytes in use by mspan structures. |
+| `go_memstats_mspan_sys_bytes` | gauge | Internal Golang | Number of bytes used for mspan structures obtained from system. |
+| `go_memstats_next_gc_bytes` | gauge | Internal Golang | Number of heap bytes when next the garbage collection will take place. |
+| `go_memstats_other_sys_bytes` | gauge | Internal Golang | Number of bytes used for other system allocations. |
+| `go_memstats_stack_inuse_bytes` | gauge | Internal Golang | Number of bytes in use by the stack allocator. |
+| `go_memstats_stack_sys_bytes` | gauge | Internal Golang | Number of bytes obtained from the system for stack allocator. |
+| `go_memstats_sys_bytes` | gauge | Internal Golang | Number of bytes obtained from the system. |
+| `go_threads` | gauge | Internal Golang | Number of OS threads created. |
+| `process_cpu_seconds_total` | counter | Internal Golang | Total user and system CPU time spent in seconds. |
+| `process_max_fds` | gauge | Internal Golang | Maximum number of open file descriptors. |
+| `process_open_fds` | gauge | Internal Golang | Number of open file descriptors. |
+| `process_resident_memory_bytes` | gauge | Internal Golang | Resident memory size in bytes. |
+| `process_start_time_seconds` | gauge | Internal Golang | Start time of the process since the Unix epoch in seconds. |
+| `process_virtual_memory_bytes` | gauge | Internal Golang | Virtual memory size in bytes. |
+| `process_virtual_memory_max_bytes` | gauge | Internal Golang | Maximum amount of virtual memory available in bytes. |
+
+## Prometheus
+
+| Name | Type | Component | Description |
+| - | - | - | - |
+| `promhttp_metric_handler_requests_in_flight` | gauge | prometheus | Current number of scrapes being served. |
+| `promhttp_metric_handler_requests_total` | counter | prometheus | Total number of scrapes by HTTP status code. |
diff --git a/docs/pages/management/admin/troubleshooting.mdx b/docs/pages/management/admin/troubleshooting.mdx
@@ -156,7 +156,7 @@ For more information about custom features, or to try our [Enterprise edition](.
 
 This guide showed how to investigate issues with the `teleport` process. To see
 how you can monitor more general health and performance data from your Teleport
-cluster, read our [Teleport Diagnostics](../../reference/metrics.mdx) guide.
+cluster, read our [Teleport Diagnostics](../diagnostics/monitoring.mdx) guides.
 
 For additional sources of Teleport support, please see the
 [Teleport Support and Education Center](https://goteleport.com/support/).
@@ -171,19 +171,19 @@ purposes and seeing it within your logs is not necessarily an indication that
 anything is incorrect.
 
 Firstly, Teleport uses this value within certificates (as a DNS Subject
-Alternative Name) issued to the Auth and Proxy Service. Teleport clients can 
-then use this value to validate the service's certificates during the TLS 
-handshake regardless of the service address as long as the client already has a 
+Alternative Name) issued to the Auth and Proxy Service. Teleport clients can
+then use this value to validate the service's certificates during the TLS
+handshake regardless of the service address as long as the client already has a
 copy of the cluster's certificate authorities. This is important as there are
 often multiple different ways that a client can connect to the Auth Service and
 these are not always via the same address.
 
-Secondly, this value is used by clients as part of the URL when making gRPC or 
+Secondly, this value is used by clients as part of the URL when making gRPC or
 HTTP requests to the Teleport API. This is because the Teleport API client uses
 special logic to open the connection to the Auth Service to make the request,
-rather than connecting to a single address as a typical client may do. This 
-special logic is necessary for the client to be able to support connecting to a 
-list of Auth Services or to be able to connect to the Auth Service through a 
-tunnel via the Proxy Service. This means that `teleport.cluster.local` appears 
-in log messages that show the URL of a request made to the Auth Service, and 
+rather than connecting to a single address as a typical client may do. This
+special logic is necessary for the client to be able to support connecting to a
+list of Auth Services or to be able to connect to the Auth Service through a
+tunnel via the Proxy Service. This means that `teleport.cluster.local` appears
+in log messages that show the URL of a request made to the Auth Service, and
 does not explicitly indicate that something is misconfigured.
diff --git a/docs/pages/management/diagnostics.mdx b/docs/pages/management/diagnostics.mdx
@@ -0,0 +1,10 @@
+---
+title: Monitoring your Cluster
+description: Monitoring your Teleport deployment
+layout: tocless-doc
+---
+
+- [Health Monitoring](./diagnostics/monitoring.mdx): How to monitor the health of a Teleport instance.
+- [Metrics](./diagnostics/metrics.mdx): How to enable exporting Prometheus metrics.
+- [Collecting Profiles](./diagnostics/profiles.mdx): How to collect runtime profiling data from a Teleport instance.
+- [Distributed Tracing](./diagnostics/tracing.mdx): How to enable Distributed Tracing for a Teleport instance.
diff --git a/docs/pages/management/diagnostics/metrics.mdx b/docs/pages/management/diagnostics/metrics.mdx
@@ -0,0 +1,21 @@
+---
+title: Metrics
+description: How to enable and consume metrics
+---
+
+## Prerequisites
+
+(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!)
+
+This will enable the `http://127.0.0.1:3000/metrics` endpoint, which serves the
+metrics that Teleport tracks. It is compatible with [Prometheus](https://prometheus.io/) collectors.
+
+The following metrics are available:
+
+<Notice scope={["cloud"]} type="tip">
+
+    Teleport Cloud does not expose monitoring endpoints for the Auth Service and Proxy Service.
+
+</Notice>
+
+(!docs/pages/includes/metrics.mdx!)