-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add livez
endpoint
#2418
feat: add livez
endpoint
#2418
Conversation
pkg/app/server.go
Outdated
mux.Handle(livezPath, http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { | ||
|
||
// Query the Kube API to make sure we are not affected by a network outage. | ||
got := client.CoreV1().RESTClient().Get().AbsPath("/apis/").Do(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we query Kubernetes' API readyz endpoint https://kubernetes.io/docs/reference/using-api/health-checks/ instead to detect an outage of the kubernetes apiserver (I read readyz as "able to serve traffic")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, API (/apis
) discoverability used here is a subset of /livez
. I'll make the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be /livez
or rather /readyz
?
The Kubernetes API Server could be healthy (/livez = true) but not be able to accept client request (/ready = false).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/livez
knows when to restart the container, and thus knows if there's been an outage. Besides, it makes sense to point the our /livez
to the cluster's /livez
to ensure the same thing.
readinessProbe
is currently set to the telemetry metrics' availability in our jsonnet
config, which seems okay. A more robust approach would be coupling that with the cluster's /readyz
and exposing that under a dedicated /readyz
(we don't have a dedicated endpoint just yet). This would mean that (a) the cluster components are ready, and (b) the binary itself is ready. I can open another PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me! Let's open another PR for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: We already have a /healthz path, what would this /livez path then be used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/healthz
will send out a 200
if the binary is running, /readyz
will send out a 200
if the exposition machinery is working as expected (we are ready to serve metrics), and /livez
will send out a 200
if none of the collectors are affected by any outages (collectors depend on the Kube API to gather data).
A /healthz
endpoint would be more suitable for a startupProbe
, which is especially useful if we believe the binary takes a considerable time to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying! Could you add this to the README and update our jsonnet as well?
/assign @mrueg |
a608b6b
to
c46e08e
Compare
README.md
Outdated
|
||
The following probes are available, and follow the [Kubernetes best practices](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes): | ||
|
||
* `livenessProbe`: Checks if the application is not affected by an outage, and is able to access the Kube API by querying the cluster's `/livez` endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `livenessProbe`: Checks if the application is not affected by an outage, and is able to access the Kube API by querying the cluster's `/livez` endpoint. | |
* `/livez`: Checks if the application is not affected by an outage, and is able to access the Kube API by querying the cluster's `/livez` endpoint. |
README.md
Outdated
The following probes are available, and follow the [Kubernetes best practices](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes): | ||
|
||
* `livenessProbe`: Checks if the application is not affected by an outage, and is able to access the Kube API by querying the cluster's `/livez` endpoint. | ||
* `readinessProbe`: Checks if the application is ready to serve metrics by querying its own telemetry metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have one already?
README.md
Outdated
@@ -342,6 +342,13 @@ Note that your GCP identity is case sensitive but `gcloud info` as of Google Clo | |||
|
|||
After running the above, if you see `Clusterrolebinding "cluster-admin-binding" created`, then you are able to continue with the setup of this service. | |||
|
|||
#### Probes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### Probes | |
#### Health endpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer the former since per-port endpoints are not consistent across the binary and its telemetry expositions. Also, we don't have a dedicated /readyz
but the readinessProbe
makes use of the telemetry port's /metrics
to determine if we are able to serve requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Friendly ping. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a section for endpoints as well, PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this paragraph? I don't think it's specific to kube-state-metrics and having the endpoints documented is good enough. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, PTAL.
README.md
Outdated
@@ -342,6 +342,13 @@ Note that your GCP identity is case sensitive but `gcloud info` as of Google Clo | |||
|
|||
After running the above, if you see `Clusterrolebinding "cluster-admin-binding" created`, then you are able to continue with the setup of this service. | |||
|
|||
#### Probes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this paragraph? I don't think it's specific to kube-state-metrics and having the endpoints documented is good enough. :)
Add a `livez` endpoint to identify network outages. This helps in restarting the binary if such as case is observed. Signed-off-by: Pranshu Srivastava <[email protected]> Signed-off-by: Pranshu Srivastava <[email protected]>
/lgtm thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mrueg, rexagod The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add a
livez
endpoint to identify network outages. This helps in restarting the binary if such as case is observed.