Enhance diagnostics docs

The diagnostics docs were all in one page(`setup/reference/metrics.mdx`) which made things a bit hard to find, and wasn't conducive to adding new content. Diagnostics information is now located at `setup/diagnostics` with each section from the original `metrics.mdx` doc now having its own dedicated page. The section will now show up in the navbar at "Setup" > "Monitoring Your Cluster". The content of the profiling page is expanded to include information from https://github.com/gravitational/teleport/blob/740d184d1cfc69ae2e96c50ee738b13884fb232b/assets/monitoring/README.md#low-level-monitoring to illustrate what the different profile types are, the information that they capture, and how to retrieve them. A new Distributed Tracing page is also added to instruct users on how to setup the `tracing_service` to collect and export spans (the last open item for #12241).
gravitational · Oct 14, 2022 · 30ea89a · 30ea89a
1 parent 4410a2a
commit 30ea89a
Show file tree

Hide file tree

Showing 14 changed files with 466 additions and 236 deletions.
diff --git a/docs/config.json b/docs/config.json
@@ -484,6 +484,32 @@
               "slug": "/management/guides/ssh-key-extensions/"
             }
           ]
+        },
+        {
+          "title": "Diagnostics",
+          "slug": "/management/diagnostics/",
+          "entries": [
+            {
+              "title": "Health Monitoring",
+              "slug": "/management/diagnostics/monitoring/",
+              "forScopes": ["oss", "enterprise", "cloud"]
+            },
+            {
+              "title": "Metrics",
+              "slug": "/management/diagnostics/metrics/",
+              "forScopes": ["oss", "enterprise", "cloud"]
+            },
+            {
+              "title": "Collecting Profiles",
+              "slug": "/management/diagnostics/profiles/",
+              "forScopes": ["oss", "enterprise", "cloud"]
+            },
+            {
+              "title": "Distributed Tracing",
+              "slug": "/management/diagnostics/tracing/",
+              "forScopes": ["oss", "enterprise", "cloud"]
+            }
+          ]
         }
       ]
     },
@@ -1188,7 +1214,7 @@
     },
     {
       "source": "/metrics-logs-reference/",
-      "destination": "/reference/metrics/",
+      "destination": "/management/diagnostics/metrics/",
       "permanent": true
     },
     {

diff --git a/docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx b/docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx
@@ -0,0 +1,28 @@
+Teleports diagnostic HTTP endpoints are disabled by default. You can enable them via:
+
+<Tabs>
+    <TabItem label="Command line">
+        ```code
+        $ sudo teleport start --diag-addr=127.0.0.1:3000
+        ```
+    </TabItem>
+    <TabItem label="Config file">
+        ```yaml
+        teleport:
+            diag_addr: 127.0.0.1:3000
+        ```
+    </TabItem>
+</Tabs>
+
+
+<Details
+    title="Ensure you can connect to the diagnostic endpoint"
+    opened={false}
+>
+
+    Verify that Teleport is now serving the diagnostics endpoint:
+
+    ```code
+    $ curl http://127.0.0.1:3000
+    ```
+</Details>
diff --git a/docs/pages/includes/diagnostics/tracing-config.yaml b/docs/pages/includes/diagnostics/tracing-config.yaml
@@ -0,0 +1,25 @@
+tracing_service:
+  # Turns tracing on. Default is 'no'
+  enabled: yes
+  # The OLTP exporter to send traces to. Possible values:
+  #    "grpc://collector.example.com"        : Traces will be exported via gRPC to the provided URL.
+  #    "http(s)://collector.example.com"     : Traces will be exported via HTTP to the provided URL.
+  #    "file:///var/lib/teleport/traces"     : Traces will be saved to files within the provided directory. Each file
+  #                                            will contain one proto encoded span per line. Files are rotated after
+  #                                            reaching 100MB. To override the rotation limit add
+  #                                            ?limit=<desired_file_size_in_bytes> to the
+  #                                            url (i.e. file:///var/lib/teleport/traces?limit=100)
+  exporter_url: grpc://collector.example.com:4317
+  # The number of samples to collect per million spans.
+  # 1000000 will sample **all** spans generated by Teleport
+  # 500000 will sample 50% of spans generated by Teleport
+  # 10000 will sample 1% of spans generated by Teleport
+  # 0 will not sample any spans generated by Teleport but will respect any parent span's sampling.
+  sampling_rate_per_million: 1000000
+  # Optional CA certificates are used to validate the exporter.
+  ca_certs:
+    - /var/lib/teleport/exporter_ca.pem
+  # Optional TLS certificates are used to enable mTLS for the exporter
+  https_keypairs:
+    - key_file: /var/lib/teleport/exporter_key.pem
+      cert_file: /var/lib/teleport/exporter_cert.pem
diff --git a/docs/pages/includes/metrics.mdx b/docs/pages/includes/metrics.mdx
diff --git a/docs/pages/management/admin/troubleshooting.mdx b/docs/pages/management/admin/troubleshooting.mdx
@@ -156,7 +156,7 @@ For more information about custom features, or to try our [Enterprise edition](.
 
 This guide showed how to investigate issues with the `teleport` process. To see
 how you can monitor more general health and performance data from your Teleport
-cluster, read our [Teleport Diagnostics](../../reference/metrics.mdx) guide.
+cluster, read our [Teleport Diagnostics](../diagnostics/monitoring.mdx) guides.
 
 For additional sources of Teleport support, please see the
 [Teleport Support and Education Center](https://goteleport.com/support/).
@@ -171,19 +171,19 @@ purposes and seeing it within your logs is not necessarily an indication that
 anything is incorrect.
 
 Firstly, Teleport uses this value within certificates (as a DNS Subject
-Alternative Name) issued to the Auth and Proxy Service. Teleport clients can 
-then use this value to validate the service's certificates during the TLS 
-handshake regardless of the service address as long as the client already has a 
+Alternative Name) issued to the Auth and Proxy Service. Teleport clients can
+then use this value to validate the service's certificates during the TLS
+handshake regardless of the service address as long as the client already has a
 copy of the cluster's certificate authorities. This is important as there are
 often multiple different ways that a client can connect to the Auth Service and
 these are not always via the same address.
 
-Secondly, this value is used by clients as part of the URL when making gRPC or 
+Secondly, this value is used by clients as part of the URL when making gRPC or
 HTTP requests to the Teleport API. This is because the Teleport API client uses
 special logic to open the connection to the Auth Service to make the request,
-rather than connecting to a single address as a typical client may do. This 
-special logic is necessary for the client to be able to support connecting to a 
-list of Auth Services or to be able to connect to the Auth Service through a 
-tunnel via the Proxy Service. This means that `teleport.cluster.local` appears 
-in log messages that show the URL of a request made to the Auth Service, and 
+rather than connecting to a single address as a typical client may do. This
+special logic is necessary for the client to be able to support connecting to a
+list of Auth Services or to be able to connect to the Auth Service through a
+tunnel via the Proxy Service. This means that `teleport.cluster.local` appears
+in log messages that show the URL of a request made to the Auth Service, and
 does not explicitly indicate that something is misconfigured.
diff --git a/docs/pages/management/diagnostics.mdx b/docs/pages/management/diagnostics.mdx
@@ -0,0 +1,10 @@
+---
+title: Monitoring your Cluster
+description: Monitoring your Teleport deployment
+layout: tocless-doc
+---
+
+- [Health Monitoring](./diagnostics/monitoring.mdx): How to monitor the health of a Teleport instance.
+- [Metrics](./diagnostics/metrics.mdx): How to enable exporting Prometheus metrics.
+- [Collecting Profiles](./diagnostics/profiles.mdx): How to collect runtime profiling data from a Teleport instance.
+- [Distributed Tracing](./diagnostics/tracing.mdx): How to enable Distributed Tracing for a Teleport instance.
diff --git a/docs/pages/management/diagnostics/metrics.mdx b/docs/pages/management/diagnostics/metrics.mdx
@@ -0,0 +1,21 @@
+---
+title: Metrics
+description: How to enable and consume metrics
+---
+
+## Prerequisites
+
+(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!)
+
+This will enable the `http://127.0.0.1:3000/metrics` endpoint, which serves the
+metrics that Teleport tracks. It is compatible with [Prometheus](https://prometheus.io/) collectors.
+
+The following metrics are available:
+
+<Notice scope={["cloud"]} type="tip">
+
+    Teleport Cloud does not expose monitoring endpoints for the Auth Service and Proxy Service.
+
+</Notice>
+
+(!docs/pages/includes/metrics.mdx!)
diff --git a/docs/pages/management/diagnostics/monitoring.mdx b/docs/pages/management/diagnostics/monitoring.mdx
@@ -0,0 +1,58 @@
+---
+title: Health Monitoring
+description: Monitoring health and readiness.
+---
+
+## Prerequisites
+
+(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!)
+
+Now you can collect monitoring information from several endpoints.
+
+## `/healthz`
+
+The `http://127.0.0.1:3000/healthz` endpoint responds with a body of
+`{"status":"ok"}` and an HTTP 200 OK status code if the process is running.
+
+This is a simple check, suitable for determining if the Teleport process is
+still running.
+
+## `/readyz`
+
+The `http://127.0.0.1:3000/readyz` endpoint is similar to `/healthz`, but its
+response includes information about the state of the process.
+
+The response body is a JSON object of the form:
+
+```
+{ "status": "a status message here"}
+```
+
+### `/readyz` and heartbeats
+
+If a Teleport component fails to execute its heartbeat procedure, it will enter
+a degraded state. Teleport will begin recovering from this state when a
+heartbeat completes successfully.
+
+The first successful heartbeat will transition Teleport into a recovering state.
+
+A second consecutive successful heartbeat will cause Teleport to transition to
+the OK state.
+
+Teleport heartbeats run approximately every 60 seconds when healthy, and failed
+heartbeats are retried approximately every 5 seconds. This means that depending
+on the timing of heartbeats, it can take 60-70 seconds after connectivity is
+restored for `/readyz` to start reporting healthy again.
+
+### Status codes
+
+The status code of the response can be one of:
+
+- HTTP 200 OK: Teleport is operating normally
+- HTTP 503 Service Unavailable: Teleport has encountered a connection error and
+  is running in a degraded state. This happens when a Teleport heartbeat fails.
+- HTTP 400 Bad Request: Teleport is either entering its initial startup phase or
+  has begun recovering from a degraded state.
+
+The same state information is also available via the `process_state` metric
+under the `/metrics` endpoint.
diff --git a/docs/pages/management/diagnostics/profiles.mdx b/docs/pages/management/diagnostics/profiles.mdx
@@ -0,0 +1,74 @@
+---
+title: Profiling
+description: Collecting pprof profiles.
+---
+
+Teleport leverages Go's diagnostic capabilities to collect and export
+profiling data. Profiles can help identify the cause of spikes in CPU,
+source of memory leaks, or the reason for a deadlock.
+
+## Prerequisites
+
+The profiling endpoint is only enabled if the `--debug` flag is supplied.
+
+(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!)
+
+## Collecting Profiles
+
+Go's standard [pprof](https://golang.org/pkg/net/http/pprof/) endpoints are
+served at `http://127.0.0.1:3000/debug/pprof/`. Retrieving a profile requires
+sending a request to the endpoint corresponding to the desired profile type.
+When debugging an issue it is helpful to collect a series of profiles over
+a period of time.
+
+### CPU
+CPU profile shows execution statistics gathered over a 30 second period:
+
+``` code
+# Download the profile into a file:
+$ curl -o cpu.profile http://127.0.0.1:3000/debug/pprof/profile
+
+# Visualize the profile
+$ go tool pprof -http : cpu.profile
+```
+
+### Goroutine
+
+Goroutine profile shows the stack traces for all running goroutines in the system:
+
+``` code
+# Download the profile into a file:
+$ curl -o goroutine.profile http://127.0.0.1:3000/debug/pprof/goroutine
+
+# Visualize the profile
+$ go tool pprof -http : goroutine.profile
+```
+
+### Heap
+
+Heap profile shows allocated objects in the system:
+
+```code
+# Download the profile into a file:
+$ curl -o heap.profile http://127.0.0.1:3000/debug/pprof/heap
+
+# Visualize the profile
+$ go tool pprof  -http : heap.profile
+```
+
+### Trace
+
+Trace captures scheduling, syscall, garbage collections, heap size, and other events that are collected by the Go runtime:
+
+```code
+# Download the profile into a file:
+$ curl -o trace.out http://127.0.0.1:3000/debug/pprof/trace
+
+# Visualize the profile
+$ go tool trace trace.out
+```
+
+## Further Reading
+
+- More information about diagnostics in the Go ecosystem: https://go.dev/doc/diagnostics
+- A deep dive on profiling Go programs: https://go.dev/blog/pprof
diff --git a/docs/pages/management/diagnostics/reference/configuration.mdx b/docs/pages/management/diagnostics/reference/configuration.mdx
@@ -0,0 +1,16 @@
+---
+title: Distributed Tracing Configuration Reference
+description: Configuration reference for Distributed Tracing.
+---
+
+# Tracing Service configuration
+
+`teleport.yaml` fields related to Distributed Tracing:
+
+```yaml
+# Main service responsible for Distributed Tracing.
+#
+# You must enable the Tracing Service once per teleport.yaml for all
+# agents that you wish to capture traces from,
+(!docs/pages/includes/diagnostics/tracing-config.yaml!)
+```
diff --git a/docs/pages/management/diagnostics/tracing.mdx b/docs/pages/management/diagnostics/tracing.mdx
@@ -0,0 +1,43 @@
+---
+title: Distributed Tracing
+description: How to enable tracing within Teleport.
+---
+
+Teleport leverages [OpenTelemetry](https://opentelemetry.io/) to generate traces
+and export them to any [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/reference/specification/protocol/otlp/)
+capable exporter. In the event that your telemetry backend doesn't have support receiving OTLP traces you may be able to
+leverage the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) to proxy traces from OTLP
+to a format that your telemetry backend accepts.
+
+## Configure Teleport
+
+In order to enable tracing in Teleport, add the following section in `teleport.yaml`. For a detailed description of
+these configuration fields, see the [configuration reference](./reference/configuration.mdx) page.
+
+<Details title="Tracing Service Configuration" min="8.3.18,9.3.15,10.1" scopeOnly={true} scope={["oss", "enterprise"]}>
+```yaml
+tracing_service:
+   enabled: yes
+   exporter_url: grpc://collector.example.com:4317
+   sampling_rate_per_million: 1000000
+```
+</Details>
+
+<Admonition type="warning" title="Sampling Rate">
+It is important to chose the sampling rate wisely. Sampling at a rate of 100% could have a negative impact on the
+performance of your cluster. Teleport honors the parent spans sampling rate, which means even when the the
+`tracing_service` is enabled and the sampling rate is 0 if Teleport receives a request that has a span which is
+sampled, then Teleport will sample and exports all spans that are generated in response to that request.
+</Admonition>
+
+After updating `teleport.yaml`, start Teleport as usual using `teleport start`.
+
+## tsh
+
+To capture traces from `tsh` simply add the `--trace` flag to your command. All traces generated by `tsh --trace` will be
+proxied to the `exporter_url` defined for the Auth service of the cluster the command is being run on.
+
+```code
+$ tsh --trace ssh root@myserver
+$ tsh --trace ls
+```
diff --git a/docs/pages/reference/backends.mdx b/docs/pages/reference/backends.mdx
@@ -71,7 +71,7 @@ Teleport will handle its own SSL on top of that with its own certificates.
 </Admonition>
 
 If your load balancer supports HTTP health checks, configure it to hit the
-`/readyz` [diagnostics endpoint](../reference/metrics.mdx) on machines running Teleport. This endpoint
+`/readyz` [diagnostics endpoint](../diagnostics/monitoring.mdx) on machines running Teleport. This endpoint
 must be enabled by using the `--diag-addr` flag to teleport start: `teleport start --diag-addr=127.0.0.1:3000`
 The [http://127.0.0.1:3000/readyz](http://127.0.0.1:3000/readyz) endpoint will reply `{"status":"ok"}` if the Teleport service
 is running without problems.

diff --git a/docs/pages/reference/config.mdx b/docs/pages/reference/config.mdx
@@ -135,7 +135,7 @@ teleport:
     # Teleport provides HTTP endpoints for monitoring purposes. They are
     # disabled by default but you can enable them using the diagnosis address.
     # See the Teleport metrics reference:
-    # https://goteleport.com/docs/setup/reference/metrics/
+    # https://goteleport.com/docs/setup/diagnostics/metrics
     diag_addr: "127.0.0.1:3000"