Skip to content

Commit

Permalink
docs: clarify /healthz and /readyz (#11085)
Browse files Browse the repository at this point in the history
- Rename the page, since it's about diagnostics rather than metrics
  alone
- Change major section headings to H2s so they apper in the table of
  contents
- Move information about heartbeats and recovery to an H3 so it's
  more visible

Updates #10799

Co-authored-by: Paul Gottschling <[email protected]>
  • Loading branch information
zmb3 and ptgott committed Mar 18, 2022
1 parent ad55671 commit 3514ab9
Show file tree
Hide file tree
Showing 3 changed files with 73 additions and 42 deletions.
24 changes: 3 additions & 21 deletions docs/pages/setup/admin/troubleshooting.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,27 +15,9 @@ run with verbose logging enabled by passing it `-d` flag.
It is not recommended to run Teleport in production with verbose logging as it generates a substantial amount of data.
</Admonition>

Sometimes you may want to reset [`teleport`](../reference/cli.mdx#teleport) to a clean
state. This can be accomplished by erasing everything under `"data_dir"`
directory. Assuming the default location, `rm -rf /var/lib/teleport/*` will do.

Teleport also supports HTTP endpoints for monitoring purposes. They are disabled
by default, but you can enable them:

```code
$ sudo teleport start --diag-addr=127.0.0.1:3000
```

Now you can see the monitoring information by visiting several endpoints:

- `http://127.0.0.1:3000/metrics` is the list of internal metrics Teleport is tracking. It is compatible with [Prometheus](https://prometheus.io/)
collectors. For a full list of metrics review our [metrics reference](../reference/metrics.mdx).
- `http://127.0.0.1:3000/healthz` returns "OK" if the process is healthy or
`503` otherwise.
- `http://127.0.0.1:3000/readyz` is similar to `/healthz`, but it returns "OK"
*only after* the node successfully joined the cluster, i.e.it draws the difference between "healthy" and "ready".
- `http://127.0.0.1:3000/debug/pprof/` is Golang's standard profiler. It's only
available when `-d` flag is given in addition to `--diag-addr`
Sometimes you may want to reset [`teleport`](../reference/cli.mdx#teleport) to a
clean state. This can be accomplished by erasing everything under the `data_dir`
directory, which defaults to `/var/lib/teleport/`.

## Debug dump

Expand Down
87 changes: 68 additions & 19 deletions docs/pages/setup/reference/metrics.mdx
Original file line number Diff line number Diff line change
@@ -1,30 +1,79 @@
---
title: Teleport Metrics
description: How to set up Prometheus to monitor Teleport for SSH and Kubernetes access
h1: Metrics
title: Teleport Diagnostics
description: How to use Teleport's health, readiness, profiling, and monitoring endpoints.
---

## Teleport Prometheus endpoint

Teleport provides HTTP endpoints for monitoring purposes. They are disabled
by default, but you can enable them using the `--diag-addr` flag to `teleport start`:
Teleport provides HTTP endpoints for monitoring purposes. They are disabled by
default, but you can enable them using the `--diag-addr` flag when running
`teleport start`:

```code
$ sudo teleport start --diag-addr=127.0.0.1:3000
```

Now you can see the monitoring information by visiting several endpoints:

- `http://127.0.0.1:3000/metrics` is the list of internal metrics Teleport is
tracking. It is compatible with [Prometheus](https://prometheus.io/)
collectors.
- `http://127.0.0.1:3000/healthz` returns "OK" if the process is healthy or
`503` otherwise.
- `http://127.0.0.1:3000/readyz` is similar to `/healthz`, but it returns "OK"
*only after* the node successfully joined the cluster, i.e.it draws the
difference between "healthy" and "ready".
- `http://127.0.0.1:3000/debug/pprof/` is Golang's standard profiler. It's only
available when `-d` flag is given in addition to `--diag-addr`
Now you can collect monitoring information from several endpoints.

## `/healthz`

The `http://127.0.0.1:3000/healthz` endpoint responds with a body of
`{"status":"ok"}` and an HTTP 200 OK status code if the process is running.

This is a simple check, suitable for determining if the Teleport process is
still running.

## `/readyz`

The `http://127.0.0.1:3000/readyz` endpoint is similar to `/healthz`, but its
response includes information about the state of the process.

The response body is a JSON object of the form:

```
{ "status": "a status message here"}
```

### `/readyz` and heartbeats

If a Teleport component fails to execute its heartbeat procedure, it will enter
a degraded state. Teleport will begin recovering from this state when a
heartbeat completes successfully.

The first successful heartbeat will transition Teleport into a recovering state.

A second consecutive successful heartbeat will cause Teleport to transition to
the OK state, so long as at least 10 seconds have elapsed since the
first successful heartbeat.

Teleport heartbeats run every 5 seconds. This means that depending on the timing
of heartbeats, it can take 10-20 seconds after connectivity is restored for
`/readyz` to start reporting healthy again.

### Status codes

The status code of the response can be one of:

- HTTP 200 OK: Teleport is operating normally
- HTTP 503 Service Unavailable: Teleport has encountered a connection error and
is running in a degraded state. This happens when a Teleport heartbeat fails.
- HTTP 400 Bad Request: Teleport is either entering its initial startup phase or
has begun recovering from a degraded state.

The same state information is also available via the `process_state` metric
under the `/metrics` endpoint.

## `/debug/pprof`

The `http://127.0.0.1:3000/debug/pprof/` endpoint is Go's standard pprof
profiler. This endpoint is only available if the `--debug` (or `-d`) flag is
supplied (in addition to `--diag-addr`).

## `/metrics`

The `http://127.0.0.1:3000/metrics` endpoint serves the internal metrics
Teleport is tracking. It is compatible with
[Prometheus](https://prometheus.io/) collectors.

The following metrics are available:

| Name | Type | Component | Description |
| - | - | - | - |
Expand Down
4 changes: 2 additions & 2 deletions lib/service/state.go
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ func (f *processState) update(event Event) {

component, ok := event.Payload.(string)
if !ok {
f.process.log.Errorf("TeleportDegradedEvent broadcasted without component name, this is a bug!")
f.process.log.Errorf("%v broadcasted without component name, this is a bug!", event.Name)
return
}
s, ok := f.states[component]
Expand Down Expand Up @@ -118,7 +118,7 @@ func (f *processState) update(event Event) {
s.recoveryTime = f.process.Clock.Now()
f.process.log.Infof("Teleport component %q is recovering from a degraded state.", component)
case stateRecovering:
if f.process.Clock.Now().Sub(s.recoveryTime) > defaults.HeartbeatCheckPeriod*2 {
if f.process.Clock.Since(s.recoveryTime) > defaults.HeartbeatCheckPeriod*2 {
s.state = stateOK
f.process.log.Infof("Teleport component %q has recovered from a degraded state.", component)
}
Expand Down

0 comments on commit 3514ab9

Please sign in to comment.