-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update metrics for rotate kubelet server certificate #122834
update metrics for rotate kubelet server certificate #122834
Conversation
/cc @SergeyKanzhelev I am working on promoting this feature to GA. I have a retrospective kep (from @SergeyKanzhelev!) and we indentified that these metrics should at least be promoted to beta. I opened this up but not sure if the retrospective KEP should be merged first or not. |
cc81559
to
71e9bf7
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kannon92 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR may require stable metrics review. Stable metrics are guaranteed to not change. Please review the documentation for the requirements and lifecycle of stable metrics and ensure that your metrics meet these guidelines. |
Changelog suggestion -Bump metrics for RotateKubeletServerCertificate to BETA
+Fixed stability level for metrics associated with kubelet certificate rotation (these are now **beta**). If it's not a fix, change “Fixed” to “Promoted“. |
/retest |
CLA has an issue. Going to see if reopening will trigger it again. That worked. |
for sig-instrumentation review |
subsystem: kubelet | ||
help: Histogram of the number of seconds the previous certificate lived before being | ||
rotated. | ||
type: Histogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is histogram the right way to measure this? The choice seems a little odd to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really answer yes or no to this. I mainly bumped this metrics which have been in alpha for a long while to beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will at least need to have a consensus on https://github.com/kubernetes/kubernetes/pull/122834/files#r1472191704 if we want to bump this metric to BETA since once it is graduated it will be hard to change it in the future.
- 1.24416e+08 | ||
- name: certificate_manager_server_ttl_seconds | ||
subsystem: kubelet | ||
help: Gauge of the shortest TTL (time-to-live) of the Kubelet's serving certificate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does shortest mean here? Does this mean that each time the certificate is rotated, this metric is updated to be min(previous metric value, ttl of new cert)?
- 1.24416e+08 | ||
- name: certificate_manager_server_ttl_seconds | ||
subsystem: kubelet | ||
help: Gauge of the shortest TTL (time-to-live) of the Kubelet's serving certificate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x509 certs don't actually have a TTL field. Does TTL here refer to the remaining TTL (time until expiry), or the total TTL (length of validity period)?
stabilityLevel: BETA | ||
- name: server_expiration_renew_errors | ||
subsystem: kubelet | ||
help: Counter of certificate renewal errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help: Counter of certificate renewal errors. | |
help: Counter of server certificate renewal errors. |
subsystem: kubelet | ||
help: Gauge of the shortest TTL (time-to-live) of the Kubelet's serving certificate. | ||
The value is in seconds until certificate expiry (negative if already expired). | ||
If serving certificate is invalid or unused, the value will be +INF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does unused mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this was autogenerated from the comments in kubelet.go
for the metric.
We can discuss the commented metrics but I'm not sure if we should change this or not..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't change the generated metadata, but you can always modify the metric. In this particular case, you could update the HELP text.
If serving certificate is invalid or unused, the value will be +INF. | ||
type: Gauge | ||
stabilityLevel: BETA | ||
- name: server_expiration_renew_errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be prepended with certificate_manager_
to match the other metrics?
- name: certificate_manager_server_rotation_seconds | ||
subsystem: kubelet | ||
help: Histogram of the number of seconds the previous certificate lived before being | ||
rotated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a weird metric... it's measured as the time since it's validity start period to the time it was rotated:
kubernetes/staging/src/k8s.io/client-go/util/certificate/certificate_manager.go
Lines 595 to 597 in 7080b51
if old := m.updateCached(cert); old != nil && m.certificateRotation != nil { | |
m.certificateRotation.Observe(m.now().Sub(old.Leaf.NotBefore).Seconds()) | |
} |
This seems less useful than measuring the amount of valid time remaining (i.e. how close are we cutting it). It's also not measuring how long the cert was actually in use.
What is the purpose of this metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I also don't really understand how someone would use this information and take actions.
/assign |
@@ -88,6 +88,35 @@ | |||
help: The count of hidden metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallclair all of this was autogenerated from the metrics. I guess beta metrics are actually documented so I'm not sure if I can/will change anything in this file. I can update the comments for the metrics but not sure about that.
ping @dgrisonnet Curious on your thoughts for @tallclair! I didn't add the original comments for this metrics. I just wanted to promote the existing ones to beta but I'm open to making the comments more useful especially now that we will start documenting them more. |
@@ -61,7 +61,7 @@ func NewKubeletServerCertificateManager(kubeClient clientset.Interface, kubeCfg | |||
Subsystem: metrics.KubeletSubsystem, | |||
Name: "server_expiration_renew_errors", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a counter so it needs to be suffixed by _total
: https://prometheus.io/docs/practices/naming/
So I would suggest renaming the metric to server_expiration_renew_errors_total
and highlight that change in the changelog.
- name: certificate_manager_server_rotation_seconds | ||
subsystem: kubelet | ||
help: Histogram of the number of seconds the previous certificate lived before being | ||
rotated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I also don't really understand how someone would use this information and take actions.
subsystem: kubelet | ||
help: Histogram of the number of seconds the previous certificate lived before being | ||
rotated. | ||
type: Histogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will at least need to have a consensus on https://github.com/kubernetes/kubernetes/pull/122834/files#r1472191704 if we want to bump this metric to BETA since once it is graduated it will be hard to change it in the future.
subsystem: kubelet | ||
help: Gauge of the shortest TTL (time-to-live) of the Kubelet's serving certificate. | ||
The value is in seconds until certificate expiry (negative if already expired). | ||
If serving certificate is invalid or unused, the value will be +INF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't change the generated metadata, but you can always modify the metric. In this particular case, you could update the HELP text.
Just to clarify on the comments here, I commented on the auto-generated code since that's what was surfaced in this PR. The comments really apply to the metrics code, not the generated files here. I know you're just promoting what is already there, but as @dgrisonnet mentioned, promoting the metrics makes it harder to change them, so now is the time to evaluate the state of the metrics and think about what changes we want to make to them. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
While drafting the retrospective KEP for RotateKubeletServerCertificate, we realize that the metrics are still in alpha stage.
This PR bumps the metrics to Beta. They have been in for a long time as alpha. Our goal is push this feature to GA.
Retrospective KEP is kubernetes/enhancements#4411.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: