-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
The diagnostics docs were all in one page(`setup/reference/metrics.mdx`) which made things a bit hard to find, and wasn't conducive to adding new content. Diagnostics information is now located at `setup/diagnostics` with each section from the original `metrics.mdx` doc now having its own dedicated page. The section will now show up in the navbar at "Setup" > "Monitoring Your Cluster". The content of the profiling page is expanded to include information from https://github.com/gravitational/teleport/blob/740d184d1cfc69ae2e96c50ee738b13884fb232b/assets/monitoring/README.md#low-level-monitoring to illustrate what the different profile types are, the information that they capture, and how to retrieve them. A new Distributed Tracing page is also added to instruct users on how to setup the `tracing_service` to collect and export spans (the last open item for #12241).
- Loading branch information
1 parent
4410a2a
commit 30ea89a
Showing
14 changed files
with
466 additions
and
236 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
28 changes: 28 additions & 0 deletions
28
docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
Teleports diagnostic HTTP endpoints are disabled by default. You can enable them via: | ||
|
||
<Tabs> | ||
<TabItem label="Command line"> | ||
```code | ||
$ sudo teleport start --diag-addr=127.0.0.1:3000 | ||
``` | ||
</TabItem> | ||
<TabItem label="Config file"> | ||
```yaml | ||
teleport: | ||
diag_addr: 127.0.0.1:3000 | ||
``` | ||
</TabItem> | ||
</Tabs> | ||
<Details | ||
title="Ensure you can connect to the diagnostic endpoint" | ||
opened={false} | ||
> | ||
Verify that Teleport is now serving the diagnostics endpoint: | ||
```code | ||
$ curl http://127.0.0.1:3000 | ||
``` | ||
</Details> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
tracing_service: | ||
# Turns tracing on. Default is 'no' | ||
enabled: yes | ||
# The OLTP exporter to send traces to. Possible values: | ||
# "grpc://collector.example.com" : Traces will be exported via gRPC to the provided URL. | ||
# "http(s)://collector.example.com" : Traces will be exported via HTTP to the provided URL. | ||
# "file:///var/lib/teleport/traces" : Traces will be saved to files within the provided directory. Each file | ||
# will contain one proto encoded span per line. Files are rotated after | ||
# reaching 100MB. To override the rotation limit add | ||
# ?limit=<desired_file_size_in_bytes> to the | ||
# url (i.e. file:///var/lib/teleport/traces?limit=100) | ||
exporter_url: grpc://collector.example.com:4317 | ||
# The number of samples to collect per million spans. | ||
# 1000000 will sample **all** spans generated by Teleport | ||
# 500000 will sample 50% of spans generated by Teleport | ||
# 10000 will sample 1% of spans generated by Teleport | ||
# 0 will not sample any spans generated by Teleport but will respect any parent span's sampling. | ||
sampling_rate_per_million: 1000000 | ||
# Optional CA certificates are used to validate the exporter. | ||
ca_certs: | ||
- /var/lib/teleport/exporter_ca.pem | ||
# Optional TLS certificates are used to enable mTLS for the exporter | ||
https_keypairs: | ||
- key_file: /var/lib/teleport/exporter_key.pem | ||
cert_file: /var/lib/teleport/exporter_cert.pem |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
--- | ||
title: Monitoring your Cluster | ||
description: Monitoring your Teleport deployment | ||
layout: tocless-doc | ||
--- | ||
|
||
- [Health Monitoring](./diagnostics/monitoring.mdx): How to monitor the health of a Teleport instance. | ||
- [Metrics](./diagnostics/metrics.mdx): How to enable exporting Prometheus metrics. | ||
- [Collecting Profiles](./diagnostics/profiles.mdx): How to collect runtime profiling data from a Teleport instance. | ||
- [Distributed Tracing](./diagnostics/tracing.mdx): How to enable Distributed Tracing for a Teleport instance. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
--- | ||
title: Metrics | ||
description: How to enable and consume metrics | ||
--- | ||
|
||
## Prerequisites | ||
|
||
(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!) | ||
|
||
This will enable the `http://127.0.0.1:3000/metrics` endpoint, which serves the | ||
metrics that Teleport tracks. It is compatible with [Prometheus](https://prometheus.io/) collectors. | ||
|
||
The following metrics are available: | ||
|
||
<Notice scope={["cloud"]} type="tip"> | ||
|
||
Teleport Cloud does not expose monitoring endpoints for the Auth Service and Proxy Service. | ||
|
||
</Notice> | ||
|
||
(!docs/pages/includes/metrics.mdx!) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
--- | ||
title: Health Monitoring | ||
description: Monitoring health and readiness. | ||
--- | ||
|
||
## Prerequisites | ||
|
||
(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!) | ||
|
||
Now you can collect monitoring information from several endpoints. | ||
|
||
## `/healthz` | ||
|
||
The `http://127.0.0.1:3000/healthz` endpoint responds with a body of | ||
`{"status":"ok"}` and an HTTP 200 OK status code if the process is running. | ||
|
||
This is a simple check, suitable for determining if the Teleport process is | ||
still running. | ||
|
||
## `/readyz` | ||
|
||
The `http://127.0.0.1:3000/readyz` endpoint is similar to `/healthz`, but its | ||
response includes information about the state of the process. | ||
|
||
The response body is a JSON object of the form: | ||
|
||
``` | ||
{ "status": "a status message here"} | ||
``` | ||
|
||
### `/readyz` and heartbeats | ||
|
||
If a Teleport component fails to execute its heartbeat procedure, it will enter | ||
a degraded state. Teleport will begin recovering from this state when a | ||
heartbeat completes successfully. | ||
|
||
The first successful heartbeat will transition Teleport into a recovering state. | ||
|
||
A second consecutive successful heartbeat will cause Teleport to transition to | ||
the OK state. | ||
|
||
Teleport heartbeats run approximately every 60 seconds when healthy, and failed | ||
heartbeats are retried approximately every 5 seconds. This means that depending | ||
on the timing of heartbeats, it can take 60-70 seconds after connectivity is | ||
restored for `/readyz` to start reporting healthy again. | ||
|
||
### Status codes | ||
|
||
The status code of the response can be one of: | ||
|
||
- HTTP 200 OK: Teleport is operating normally | ||
- HTTP 503 Service Unavailable: Teleport has encountered a connection error and | ||
is running in a degraded state. This happens when a Teleport heartbeat fails. | ||
- HTTP 400 Bad Request: Teleport is either entering its initial startup phase or | ||
has begun recovering from a degraded state. | ||
|
||
The same state information is also available via the `process_state` metric | ||
under the `/metrics` endpoint. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
--- | ||
title: Profiling | ||
description: Collecting pprof profiles. | ||
--- | ||
|
||
Teleport leverages Go's diagnostic capabilities to collect and export | ||
profiling data. Profiles can help identify the cause of spikes in CPU, | ||
source of memory leaks, or the reason for a deadlock. | ||
|
||
## Prerequisites | ||
|
||
The profiling endpoint is only enabled if the `--debug` flag is supplied. | ||
|
||
(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!) | ||
|
||
## Collecting Profiles | ||
|
||
Go's standard [pprof](https://golang.org/pkg/net/http/pprof/) endpoints are | ||
served at `http://127.0.0.1:3000/debug/pprof/`. Retrieving a profile requires | ||
sending a request to the endpoint corresponding to the desired profile type. | ||
When debugging an issue it is helpful to collect a series of profiles over | ||
a period of time. | ||
|
||
### CPU | ||
CPU profile shows execution statistics gathered over a 30 second period: | ||
|
||
``` code | ||
# Download the profile into a file: | ||
$ curl -o cpu.profile http://127.0.0.1:3000/debug/pprof/profile | ||
# Visualize the profile | ||
$ go tool pprof -http : cpu.profile | ||
``` | ||
|
||
### Goroutine | ||
|
||
Goroutine profile shows the stack traces for all running goroutines in the system: | ||
|
||
``` code | ||
# Download the profile into a file: | ||
$ curl -o goroutine.profile http://127.0.0.1:3000/debug/pprof/goroutine | ||
# Visualize the profile | ||
$ go tool pprof -http : goroutine.profile | ||
``` | ||
|
||
### Heap | ||
|
||
Heap profile shows allocated objects in the system: | ||
|
||
```code | ||
# Download the profile into a file: | ||
$ curl -o heap.profile http://127.0.0.1:3000/debug/pprof/heap | ||
# Visualize the profile | ||
$ go tool pprof -http : heap.profile | ||
``` | ||
|
||
### Trace | ||
|
||
Trace captures scheduling, syscall, garbage collections, heap size, and other events that are collected by the Go runtime: | ||
|
||
```code | ||
# Download the profile into a file: | ||
$ curl -o trace.out http://127.0.0.1:3000/debug/pprof/trace | ||
# Visualize the profile | ||
$ go tool trace trace.out | ||
``` | ||
|
||
## Further Reading | ||
|
||
- More information about diagnostics in the Go ecosystem: https://go.dev/doc/diagnostics | ||
- A deep dive on profiling Go programs: https://go.dev/blog/pprof |
16 changes: 16 additions & 0 deletions
16
docs/pages/management/diagnostics/reference/configuration.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
--- | ||
title: Distributed Tracing Configuration Reference | ||
description: Configuration reference for Distributed Tracing. | ||
--- | ||
|
||
# Tracing Service configuration | ||
|
||
`teleport.yaml` fields related to Distributed Tracing: | ||
|
||
```yaml | ||
# Main service responsible for Distributed Tracing. | ||
# | ||
# You must enable the Tracing Service once per teleport.yaml for all | ||
# agents that you wish to capture traces from, | ||
(!docs/pages/includes/diagnostics/tracing-config.yaml!) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
--- | ||
title: Distributed Tracing | ||
description: How to enable tracing within Teleport. | ||
--- | ||
|
||
Teleport leverages [OpenTelemetry](https://opentelemetry.io/) to generate traces | ||
and export them to any [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/reference/specification/protocol/otlp/) | ||
capable exporter. In the event that your telemetry backend doesn't have support receiving OTLP traces you may be able to | ||
leverage the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) to proxy traces from OTLP | ||
to a format that your telemetry backend accepts. | ||
|
||
## Configure Teleport | ||
|
||
In order to enable tracing in Teleport, add the following section in `teleport.yaml`. For a detailed description of | ||
these configuration fields, see the [configuration reference](./reference/configuration.mdx) page. | ||
|
||
<Details title="Tracing Service Configuration" min="8.3.18,9.3.15,10.1" scopeOnly={true} scope={["oss", "enterprise"]}> | ||
```yaml | ||
tracing_service: | ||
enabled: yes | ||
exporter_url: grpc://collector.example.com:4317 | ||
sampling_rate_per_million: 1000000 | ||
``` | ||
</Details> | ||
<Admonition type="warning" title="Sampling Rate"> | ||
It is important to chose the sampling rate wisely. Sampling at a rate of 100% could have a negative impact on the | ||
performance of your cluster. Teleport honors the parent spans sampling rate, which means even when the the | ||
`tracing_service` is enabled and the sampling rate is 0 if Teleport receives a request that has a span which is | ||
sampled, then Teleport will sample and exports all spans that are generated in response to that request. | ||
</Admonition> | ||
|
||
After updating `teleport.yaml`, start Teleport as usual using `teleport start`. | ||
|
||
## tsh | ||
|
||
To capture traces from `tsh` simply add the `--trace` flag to your command. All traces generated by `tsh --trace` will be | ||
proxied to the `exporter_url` defined for the Auth service of the cluster the command is being run on. | ||
|
||
```code | ||
$ tsh --trace ssh root@myserver | ||
$ tsh --trace ls | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.