Skip to content

Commit

Permalink
Enhance diagnostics docs
Browse files Browse the repository at this point in the history
The diagnostics docs were all in one page(`setup/reference/metrics.mdx`)
which made things a bit hard to find, and wasn't conducive to adding
new content.

Diagnostics information is now located at `setup/diagnostics` with each
section from the original `metrics.mdx` doc now having its own dedicated
page. The section will now show up in the navbar at
"Setup" > "Monitoring Your Cluster".

The content of the profiling page is expanded to include information from
https://github.com/gravitational/teleport/blob/740d184d1cfc69ae2e96c50ee738b13884fb232b/assets/monitoring/README.md#low-level-monitoring
to illustrate what the different profile types are, the information that
they capture, and how to retrieve them.

A new Distributed Tracing page is also added to instruct users on how to setup
the `tracing_service` to collect and export spans (the last open item
for #12241).
  • Loading branch information
rosstimothy authored and github-actions committed Oct 14, 2022
1 parent 4410a2a commit 30ea89a
Show file tree
Hide file tree
Showing 14 changed files with 466 additions and 236 deletions.
28 changes: 27 additions & 1 deletion docs/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -484,6 +484,32 @@
"slug": "/management/guides/ssh-key-extensions/"
}
]
},
{
"title": "Diagnostics",
"slug": "/management/diagnostics/",
"entries": [
{
"title": "Health Monitoring",
"slug": "/management/diagnostics/monitoring/",
"forScopes": ["oss", "enterprise", "cloud"]
},
{
"title": "Metrics",
"slug": "/management/diagnostics/metrics/",
"forScopes": ["oss", "enterprise", "cloud"]
},
{
"title": "Collecting Profiles",
"slug": "/management/diagnostics/profiles/",
"forScopes": ["oss", "enterprise", "cloud"]
},
{
"title": "Distributed Tracing",
"slug": "/management/diagnostics/tracing/",
"forScopes": ["oss", "enterprise", "cloud"]
}
]
}
]
},
Expand Down Expand Up @@ -1188,7 +1214,7 @@
},
{
"source": "/metrics-logs-reference/",
"destination": "/reference/metrics/",
"destination": "/management/diagnostics/metrics/",
"permanent": true
},
{
Expand Down
28 changes: 28 additions & 0 deletions docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
Teleports diagnostic HTTP endpoints are disabled by default. You can enable them via:

<Tabs>
<TabItem label="Command line">
```code
$ sudo teleport start --diag-addr=127.0.0.1:3000
```
</TabItem>
<TabItem label="Config file">
```yaml
teleport:
diag_addr: 127.0.0.1:3000
```
</TabItem>
</Tabs>
<Details
title="Ensure you can connect to the diagnostic endpoint"
opened={false}
>
Verify that Teleport is now serving the diagnostics endpoint:
```code
$ curl http://127.0.0.1:3000
```
</Details>
25 changes: 25 additions & 0 deletions docs/pages/includes/diagnostics/tracing-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
tracing_service:
# Turns tracing on. Default is 'no'
enabled: yes
# The OLTP exporter to send traces to. Possible values:
# "grpc://collector.example.com" : Traces will be exported via gRPC to the provided URL.
# "http(s)://collector.example.com" : Traces will be exported via HTTP to the provided URL.
# "file:///var/lib/teleport/traces" : Traces will be saved to files within the provided directory. Each file
# will contain one proto encoded span per line. Files are rotated after
# reaching 100MB. To override the rotation limit add
# ?limit=<desired_file_size_in_bytes> to the
# url (i.e. file:///var/lib/teleport/traces?limit=100)
exporter_url: grpc://collector.example.com:4317
# The number of samples to collect per million spans.
# 1000000 will sample **all** spans generated by Teleport
# 500000 will sample 50% of spans generated by Teleport
# 10000 will sample 1% of spans generated by Teleport
# 0 will not sample any spans generated by Teleport but will respect any parent span's sampling.
sampling_rate_per_million: 1000000
# Optional CA certificates are used to validate the exporter.
ca_certs:
- /var/lib/teleport/exporter_ca.pem
# Optional TLS certificates are used to enable mTLS for the exporter
https_keypairs:
- key_file: /var/lib/teleport/exporter_key.pem
cert_file: /var/lib/teleport/exporter_cert.pem
149 changes: 149 additions & 0 deletions docs/pages/includes/metrics.mdx

Large diffs are not rendered by default.

20 changes: 10 additions & 10 deletions docs/pages/management/admin/troubleshooting.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ For more information about custom features, or to try our [Enterprise edition](.

This guide showed how to investigate issues with the `teleport` process. To see
how you can monitor more general health and performance data from your Teleport
cluster, read our [Teleport Diagnostics](../../reference/metrics.mdx) guide.
cluster, read our [Teleport Diagnostics](../diagnostics/monitoring.mdx) guides.

For additional sources of Teleport support, please see the
[Teleport Support and Education Center](https://goteleport.com/support/).
Expand All @@ -171,19 +171,19 @@ purposes and seeing it within your logs is not necessarily an indication that
anything is incorrect.

Firstly, Teleport uses this value within certificates (as a DNS Subject
Alternative Name) issued to the Auth and Proxy Service. Teleport clients can
then use this value to validate the service's certificates during the TLS
handshake regardless of the service address as long as the client already has a
Alternative Name) issued to the Auth and Proxy Service. Teleport clients can
then use this value to validate the service's certificates during the TLS
handshake regardless of the service address as long as the client already has a
copy of the cluster's certificate authorities. This is important as there are
often multiple different ways that a client can connect to the Auth Service and
these are not always via the same address.

Secondly, this value is used by clients as part of the URL when making gRPC or
Secondly, this value is used by clients as part of the URL when making gRPC or
HTTP requests to the Teleport API. This is because the Teleport API client uses
special logic to open the connection to the Auth Service to make the request,
rather than connecting to a single address as a typical client may do. This
special logic is necessary for the client to be able to support connecting to a
list of Auth Services or to be able to connect to the Auth Service through a
tunnel via the Proxy Service. This means that `teleport.cluster.local` appears
in log messages that show the URL of a request made to the Auth Service, and
rather than connecting to a single address as a typical client may do. This
special logic is necessary for the client to be able to support connecting to a
list of Auth Services or to be able to connect to the Auth Service through a
tunnel via the Proxy Service. This means that `teleport.cluster.local` appears
in log messages that show the URL of a request made to the Auth Service, and
does not explicitly indicate that something is misconfigured.
10 changes: 10 additions & 0 deletions docs/pages/management/diagnostics.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Monitoring your Cluster
description: Monitoring your Teleport deployment
layout: tocless-doc
---

- [Health Monitoring](./diagnostics/monitoring.mdx): How to monitor the health of a Teleport instance.
- [Metrics](./diagnostics/metrics.mdx): How to enable exporting Prometheus metrics.
- [Collecting Profiles](./diagnostics/profiles.mdx): How to collect runtime profiling data from a Teleport instance.
- [Distributed Tracing](./diagnostics/tracing.mdx): How to enable Distributed Tracing for a Teleport instance.
21 changes: 21 additions & 0 deletions docs/pages/management/diagnostics/metrics.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
title: Metrics
description: How to enable and consume metrics
---

## Prerequisites

(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!)

This will enable the `http://127.0.0.1:3000/metrics` endpoint, which serves the
metrics that Teleport tracks. It is compatible with [Prometheus](https://prometheus.io/) collectors.

The following metrics are available:

<Notice scope={["cloud"]} type="tip">

Teleport Cloud does not expose monitoring endpoints for the Auth Service and Proxy Service.

</Notice>

(!docs/pages/includes/metrics.mdx!)
58 changes: 58 additions & 0 deletions docs/pages/management/diagnostics/monitoring.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: Health Monitoring
description: Monitoring health and readiness.
---

## Prerequisites

(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!)

Now you can collect monitoring information from several endpoints.

## `/healthz`

The `http://127.0.0.1:3000/healthz` endpoint responds with a body of
`{"status":"ok"}` and an HTTP 200 OK status code if the process is running.

This is a simple check, suitable for determining if the Teleport process is
still running.

## `/readyz`

The `http://127.0.0.1:3000/readyz` endpoint is similar to `/healthz`, but its
response includes information about the state of the process.

The response body is a JSON object of the form:

```
{ "status": "a status message here"}
```

### `/readyz` and heartbeats

If a Teleport component fails to execute its heartbeat procedure, it will enter
a degraded state. Teleport will begin recovering from this state when a
heartbeat completes successfully.

The first successful heartbeat will transition Teleport into a recovering state.

A second consecutive successful heartbeat will cause Teleport to transition to
the OK state.

Teleport heartbeats run approximately every 60 seconds when healthy, and failed
heartbeats are retried approximately every 5 seconds. This means that depending
on the timing of heartbeats, it can take 60-70 seconds after connectivity is
restored for `/readyz` to start reporting healthy again.

### Status codes

The status code of the response can be one of:

- HTTP 200 OK: Teleport is operating normally
- HTTP 503 Service Unavailable: Teleport has encountered a connection error and
is running in a degraded state. This happens when a Teleport heartbeat fails.
- HTTP 400 Bad Request: Teleport is either entering its initial startup phase or
has begun recovering from a degraded state.

The same state information is also available via the `process_state` metric
under the `/metrics` endpoint.
74 changes: 74 additions & 0 deletions docs/pages/management/diagnostics/profiles.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: Profiling
description: Collecting pprof profiles.
---

Teleport leverages Go's diagnostic capabilities to collect and export
profiling data. Profiles can help identify the cause of spikes in CPU,
source of memory leaks, or the reason for a deadlock.

## Prerequisites

The profiling endpoint is only enabled if the `--debug` flag is supplied.

(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!)

## Collecting Profiles

Go's standard [pprof](https://golang.org/pkg/net/http/pprof/) endpoints are
served at `http://127.0.0.1:3000/debug/pprof/`. Retrieving a profile requires
sending a request to the endpoint corresponding to the desired profile type.
When debugging an issue it is helpful to collect a series of profiles over
a period of time.

### CPU
CPU profile shows execution statistics gathered over a 30 second period:

``` code
# Download the profile into a file:
$ curl -o cpu.profile http://127.0.0.1:3000/debug/pprof/profile
# Visualize the profile
$ go tool pprof -http : cpu.profile
```

### Goroutine

Goroutine profile shows the stack traces for all running goroutines in the system:

``` code
# Download the profile into a file:
$ curl -o goroutine.profile http://127.0.0.1:3000/debug/pprof/goroutine
# Visualize the profile
$ go tool pprof -http : goroutine.profile
```

### Heap

Heap profile shows allocated objects in the system:

```code
# Download the profile into a file:
$ curl -o heap.profile http://127.0.0.1:3000/debug/pprof/heap
# Visualize the profile
$ go tool pprof -http : heap.profile
```

### Trace

Trace captures scheduling, syscall, garbage collections, heap size, and other events that are collected by the Go runtime:

```code
# Download the profile into a file:
$ curl -o trace.out http://127.0.0.1:3000/debug/pprof/trace
# Visualize the profile
$ go tool trace trace.out
```

## Further Reading

- More information about diagnostics in the Go ecosystem: https://go.dev/doc/diagnostics
- A deep dive on profiling Go programs: https://go.dev/blog/pprof
16 changes: 16 additions & 0 deletions docs/pages/management/diagnostics/reference/configuration.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
title: Distributed Tracing Configuration Reference
description: Configuration reference for Distributed Tracing.
---

# Tracing Service configuration

`teleport.yaml` fields related to Distributed Tracing:

```yaml
# Main service responsible for Distributed Tracing.
#
# You must enable the Tracing Service once per teleport.yaml for all
# agents that you wish to capture traces from,
(!docs/pages/includes/diagnostics/tracing-config.yaml!)
```
43 changes: 43 additions & 0 deletions docs/pages/management/diagnostics/tracing.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: Distributed Tracing
description: How to enable tracing within Teleport.
---

Teleport leverages [OpenTelemetry](https://opentelemetry.io/) to generate traces
and export them to any [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/reference/specification/protocol/otlp/)
capable exporter. In the event that your telemetry backend doesn't have support receiving OTLP traces you may be able to
leverage the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) to proxy traces from OTLP
to a format that your telemetry backend accepts.

## Configure Teleport

In order to enable tracing in Teleport, add the following section in `teleport.yaml`. For a detailed description of
these configuration fields, see the [configuration reference](./reference/configuration.mdx) page.

<Details title="Tracing Service Configuration" min="8.3.18,9.3.15,10.1" scopeOnly={true} scope={["oss", "enterprise"]}>
```yaml
tracing_service:
enabled: yes
exporter_url: grpc://collector.example.com:4317
sampling_rate_per_million: 1000000
```
</Details>
<Admonition type="warning" title="Sampling Rate">
It is important to chose the sampling rate wisely. Sampling at a rate of 100% could have a negative impact on the
performance of your cluster. Teleport honors the parent spans sampling rate, which means even when the the
`tracing_service` is enabled and the sampling rate is 0 if Teleport receives a request that has a span which is
sampled, then Teleport will sample and exports all spans that are generated in response to that request.
</Admonition>

After updating `teleport.yaml`, start Teleport as usual using `teleport start`.

## tsh

To capture traces from `tsh` simply add the `--trace` flag to your command. All traces generated by `tsh --trace` will be
proxied to the `exporter_url` defined for the Auth service of the cluster the command is being run on.

```code
$ tsh --trace ssh root@myserver
$ tsh --trace ls
```
2 changes: 1 addition & 1 deletion docs/pages/reference/backends.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ Teleport will handle its own SSL on top of that with its own certificates.
</Admonition>

If your load balancer supports HTTP health checks, configure it to hit the
`/readyz` [diagnostics endpoint](../reference/metrics.mdx) on machines running Teleport. This endpoint
`/readyz` [diagnostics endpoint](../diagnostics/monitoring.mdx) on machines running Teleport. This endpoint
must be enabled by using the `--diag-addr` flag to teleport start: `teleport start --diag-addr=127.0.0.1:3000`
The [http://127.0.0.1:3000/readyz](http://127.0.0.1:3000/readyz) endpoint will reply `{"status":"ok"}` if the Teleport service
is running without problems.
Expand Down
2 changes: 1 addition & 1 deletion docs/pages/reference/config.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ teleport:
# Teleport provides HTTP endpoints for monitoring purposes. They are
# disabled by default but you can enable them using the diagnosis address.
# See the Teleport metrics reference:
# https://goteleport.com/docs/setup/reference/metrics/
# https://goteleport.com/docs/setup/diagnostics/metrics
diag_addr: "127.0.0.1:3000"


Expand Down
Loading

0 comments on commit 30ea89a

Please sign in to comment.