-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: clarify /healthz and /readyz #11085
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,30 +1,79 @@ | ||
--- | ||
title: Teleport Metrics | ||
description: How to set up Prometheus to monitor Teleport for SSH and Kubernetes access | ||
h1: Metrics | ||
title: Teleport Diagnostics | ||
description: How to use Teleport's health, readiness, profiling, and monitoring endpoints. | ||
--- | ||
|
||
## Teleport Prometheus endpoint | ||
|
||
Teleport provides HTTP endpoints for monitoring purposes. They are disabled | ||
by default, but you can enable them using the `--diag-addr` flag to `teleport start`: | ||
Teleport provides HTTP endpoints for monitoring purposes. They are disabled by | ||
default, but you can enable them using the `--diag-addr` flag when running | ||
`teleport start`: | ||
|
||
```code | ||
$ sudo teleport start --diag-addr=127.0.0.1:3000 | ||
``` | ||
|
||
Now you can see the monitoring information by visiting several endpoints: | ||
|
||
- `http://127.0.0.1:3000/metrics` is the list of internal metrics Teleport is | ||
tracking. It is compatible with [Prometheus](https://prometheus.io/) | ||
collectors. | ||
- `http://127.0.0.1:3000/healthz` returns "OK" if the process is healthy or | ||
`503` otherwise. | ||
- `http://127.0.0.1:3000/readyz` is similar to `/healthz`, but it returns "OK" | ||
*only after* the node successfully joined the cluster, i.e.it draws the | ||
difference between "healthy" and "ready". | ||
- `http://127.0.0.1:3000/debug/pprof/` is Golang's standard profiler. It's only | ||
available when `-d` flag is given in addition to `--diag-addr` | ||
Now you can collect monitoring information from several endpoints. | ||
|
||
## `/healthz` | ||
|
||
The `http://127.0.0.1:3000/healthz` endpoint responds with a body of | ||
`{"status":"ok"}` and an HTTP 200 OK status code if the process is running. | ||
|
||
This is a simple check, suitable for determining if the Teleport process is | ||
still running. | ||
|
||
## `/readyz` | ||
|
||
The `http://127.0.0.1:3000/readyz` endpoint is similar to `/healthz`, but its | ||
response includes information about the state of the process. | ||
|
||
The response body is a JSON object of the form: | ||
|
||
``` | ||
{ "status": "a status message here"} | ||
``` | ||
|
||
### `/readyz` and heartbeats | ||
|
||
If a Teleport component fails to execute its heartbeat procedure, it will enter | ||
a degraded state. Teleport will begin recovering from this state when a | ||
heartbeat completes successfully. | ||
|
||
The first successful heartbeat will transition Teleport into a recovering state. | ||
|
||
A second consecutive successful heartbeat will cause Teleport to transition to | ||
the OK state, so long as at least 10 seconds have elapsed since the | ||
first successful heartbeat. | ||
|
||
Teleport heartbeats run every 5 seconds. This means that depending on the timing | ||
of heartbeats, it can take 10-20 seconds after connectivity is restored for | ||
`/readyz` to start reporting healthy again. | ||
|
||
### Status codes | ||
|
||
The status code of the response can be one of: | ||
|
||
- HTTP 200 OK: Teleport is operating normally | ||
- HTTP 503 Service Unavailable: Teleport has encountered a connection error and | ||
is running in a degraded state. This happens when a Teleport heartbeat fails. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since (as far as I can tell) this is the first time we discuss the heartbeat in the docs, should we introduce the heartbeat at the beginning of the " We cover some of this in the "Recovery" Admonition, but I think it makes sense to introduce this before we first mention the |
||
- HTTP 400 Bad Request: Teleport is either entering its initial startup phase or | ||
has begun recovering from a degraded state. | ||
|
||
The same state information is also available via the `process_state` metric | ||
under the `/metrics` endpoint. | ||
|
||
## `/debug/pprof` | ||
|
||
The `http://127.0.0.1:3000/debug/pprof/` endpoint is Go's standard pprof | ||
profiler. This endpoint is only available if the `--debug` (or `-d`) flag is | ||
supplied (in addition to `--diag-addr`). | ||
|
||
## `/metrics` | ||
|
||
The `http://127.0.0.1:3000/metrics` endpoint serves the internal metrics | ||
Teleport is tracking. It is compatible with | ||
[Prometheus](https://prometheus.io/) collectors. | ||
|
||
The following metrics are available: | ||
|
||
| Name | Type | Component | Description | | ||
| - | - | - | - | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this section, as it was a duplicate of whats in metrics.mdx.
IMO, the troubleshooting page should be focused on "things aren't working, what steps can I take to fix them" and metrics is more focused on "how can I monitor to ensure things are operating correctly and detect when things start to go wrong."