gravitational · zmb3 · Mar 17, 2022 · Mar 11, 2022 · Mar 16, 2022 · Mar 16, 2022
diff --git a/docs/pages/setup/admin/troubleshooting.mdx b/docs/pages/setup/admin/troubleshooting.mdx
@@ -15,27 +15,9 @@ run with verbose logging enabled by passing it `-d` flag.
   It is not recommended to run Teleport in production with verbose logging as it generates a substantial amount of data.
 </Admonition>
 
-Sometimes you may want to reset [`teleport`](../reference/cli.mdx#teleport) to a clean
-state. This can be accomplished by erasing everything under `"data_dir"`
-directory. Assuming the default location, `rm -rf /var/lib/teleport/*` will do.
-
-Teleport also supports HTTP endpoints for monitoring purposes. They are disabled
-by default, but you can enable them:
-
-```code
-$ sudo teleport start --diag-addr=127.0.0.1:3000
-```
-
-Now you can see the monitoring information by visiting several endpoints:
-
-- `http://127.0.0.1:3000/metrics` is the list of internal metrics Teleport is tracking. It is compatible with [Prometheus](https://prometheus.io/)
-  collectors. For a full list of metrics review our [metrics reference](../reference/metrics.mdx).
-- `http://127.0.0.1:3000/healthz` returns "OK" if the process is healthy or
-  `503` otherwise.
-- `http://127.0.0.1:3000/readyz` is similar to `/healthz`, but it returns "OK"
-  *only after* the node successfully joined the cluster, i.e.it draws the difference between "healthy" and "ready".
-- `http://127.0.0.1:3000/debug/pprof/` is Golang's standard profiler. It's only
-  available when `-d` flag is given in addition to `--diag-addr`
+Sometimes you may want to reset [`teleport`](../reference/cli.mdx#teleport) to a
+clean state. This can be accomplished by erasing everything under the `data_dir`
+directory, which defaults to `/var/lib/teleport/`.
 
 ## Debug dump
 

diff --git a/docs/pages/setup/reference/metrics.mdx b/docs/pages/setup/reference/metrics.mdx
@@ -1,30 +1,79 @@
 ---
-title: Teleport Metrics
-description: How to set up Prometheus to monitor Teleport for SSH and Kubernetes access
-h1: Metrics
+title: Teleport Diagnostics
+description: How to use Teleport's health, readiness, profiling, and monitoring endpoints.
 ---
 
-## Teleport Prometheus endpoint
-
-Teleport provides HTTP endpoints for monitoring purposes. They are disabled
-by default, but you can enable them using the `--diag-addr` flag to `teleport start`:
+Teleport provides HTTP endpoints for monitoring purposes. They are disabled by
+default, but you can enable them using the `--diag-addr` flag when running
+`teleport start`:
 
 ```code
 $ sudo teleport start --diag-addr=127.0.0.1:3000
 ```
 
-Now you can see the monitoring information by visiting several endpoints:
-
-- `http://127.0.0.1:3000/metrics` is the list of internal metrics Teleport is
-  tracking. It is compatible with [Prometheus](https://prometheus.io/)
-  collectors.
-- `http://127.0.0.1:3000/healthz` returns "OK" if the process is healthy or
-  `503` otherwise.
-- `http://127.0.0.1:3000/readyz` is similar to `/healthz`, but it returns "OK"
-  *only after* the node successfully joined the cluster, i.e.it draws the
-  difference between "healthy" and "ready".
-- `http://127.0.0.1:3000/debug/pprof/` is Golang's standard profiler. It's only
-  available when `-d` flag is given in addition to `--diag-addr`
+Now you can collect monitoring information from several endpoints.
+
+## `/healthz`
+
+The `http://127.0.0.1:3000/healthz` endpoint responds with a body of
+`{"status":"ok"}` and an HTTP 200 OK status code if the process is running.
+
+This is a simple check, suitable for determining if the Teleport process is
+still running.
+
+## `/readyz`
+
+The `http://127.0.0.1:3000/readyz` endpoint is similar to `/healthz`, but its
+response includes information about the state of the process.
+
+The response body is a JSON object of the form:
+
+```
+{ "status": "a status message here"}
+```
+
+### `/readyz` and heartbeats
+
+If a Teleport component fails to execute its heartbeat procedure, it will enter
+a degraded state. Teleport will begin recovering from this state when a
+heartbeat completes successfully.
+
+The first successful heartbeat will transition Teleport into a recovering state.
+
+A second consecutive successful heartbeat will cause Teleport to transition to
+the OK state, so long as at least 10 seconds have elapsed since the
+first successful heartbeat.
+
+Teleport heartbeats run every 5 seconds. This means that depending on the timing
+of heartbeats, it can take 10-20 seconds after connectivity is restored for
+`/readyz` to start reporting healthy again.
+
+### Status codes
+
+The status code of the response can be one of:
+
+- HTTP 200 OK: Teleport is operating normally
+- HTTP 503 Service Unavailable: Teleport has encountered a connection error and
+  is running in a degraded state. This happens when a Teleport heartbeat fails.
+- HTTP 400 Bad Request: Teleport is either entering its initial startup phase or
+  has begun recovering from a degraded state.
+
+The same state information is also available via the `process_state` metric
+under the `/metrics` endpoint.
+
+## `/debug/pprof`
+
+The `http://127.0.0.1:3000/debug/pprof/` endpoint is Go's standard pprof
+profiler. This endpoint is only available if the `--debug` (or `-d`) flag is
+supplied (in addition to `--diag-addr`).
+
+## `/metrics`
+
+The `http://127.0.0.1:3000/metrics` endpoint serves the internal metrics
+Teleport is tracking. It is compatible with
+[Prometheus](https://prometheus.io/) collectors.
+
+The following metrics are available:
 
 | Name | Type | Component | Description |
 | - | - | - | - |

diff --git a/lib/service/state.go b/lib/service/state.go
@@ -88,7 +88,7 @@ func (f *processState) update(event Event) {
 
 	component, ok := event.Payload.(string)
 	if !ok {
-		f.process.log.Errorf("TeleportDegradedEvent broadcasted without component name, this is a bug!")
+		f.process.log.Errorf("%v broadcasted without component name, this is a bug!", event.Name)
 		return
 	}
 	s, ok := f.states[component]
@@ -118,7 +118,7 @@ func (f *processState) update(event Event) {
 			s.recoveryTime = f.process.Clock.Now()
 			f.process.log.Infof("Teleport component %q is recovering from a degraded state.", component)
 		case stateRecovering:
-			if f.process.Clock.Now().Sub(s.recoveryTime) > defaults.HeartbeatCheckPeriod*2 {
+			if f.process.Clock.Since(s.recoveryTime) > defaults.HeartbeatCheckPeriod*2 {
 				s.state = stateOK
 				f.process.log.Infof("Teleport component %q has recovered from a degraded state.", component)
 			}