Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update metrics for rotate kubelet server certificate #122834

Closed

Conversation

kannon92
Copy link
Contributor

@kannon92 kannon92 commented Jan 17, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

While drafting the retrospective KEP for RotateKubeletServerCertificate, we realize that the metrics are still in alpha stage.

This PR bumps the metrics to Beta. They have been in for a long time as alpha. Our goal is push this feature to GA.

Retrospective KEP is kubernetes/enhancements#4411.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Promote stability level for metrics associated with kubelet server certificate rotation (these are now **beta**).

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[KEP]: https://github.com/kubernetes/enhancements/issues/267

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 17, 2024
@kannon92
Copy link
Contributor Author

/cc @SergeyKanzhelev
/triage accepted
/priority important-longterm

I am working on promoting this feature to GA. I have a retrospective kep (from @SergeyKanzhelev!) and we indentified that these metrics should at least be promoted to beta.

I opened this up but not sure if the retrospective KEP should be merged first or not.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 17, 2024
@kannon92 kannon92 force-pushed the kubelet-certificate-metrics-bump branch from cc81559 to 71e9bf7 Compare January 18, 2024 16:59
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/stable-metrics Issues or PRs involving stable metrics sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 18, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92
Once this PR has been reviewed and has the lgtm label, please assign rainbowmango, smarterclayton for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-triage-robot
Copy link

This PR may require stable metrics review.

Stable metrics are guaranteed to not change. Please review the documentation for the requirements and lifecycle of stable metrics and ensure that your metrics meet these guidelines.

@sftim
Copy link
Contributor

sftim commented Jan 19, 2024

Changelog suggestion

-Bump metrics for RotateKubeletServerCertificate to BETA
+Fixed stability level for metrics associated with kubelet certificate rotation (these are now **beta**).

If it's not a fix, change “Fixed” to “Promoted“.

@kannon92
Copy link
Contributor Author

/retest

@kannon92
Copy link
Contributor Author

kannon92 commented Jan 19, 2024

CLA has an issue. Going to see if reopening will trigger it again.

That worked.

@kannon92 kannon92 closed this Jan 19, 2024
@kannon92 kannon92 reopened this Jan 19, 2024
@dgrisonnet
Copy link
Member

for sig-instrumentation review
/assign

subsystem: kubelet
help: Histogram of the number of seconds the previous certificate lived before being
rotated.
type: Histogram
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is histogram the right way to measure this? The choice seems a little odd to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really answer yes or no to this. I mainly bumped this metrics which have been in alpha for a long while to beta.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will at least need to have a consensus on https://github.com/kubernetes/kubernetes/pull/122834/files#r1472191704 if we want to bump this metric to BETA since once it is graduated it will be hard to change it in the future.

- 1.24416e+08
- name: certificate_manager_server_ttl_seconds
subsystem: kubelet
help: Gauge of the shortest TTL (time-to-live) of the Kubelet's serving certificate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does shortest mean here? Does this mean that each time the certificate is rotated, this metric is updated to be min(previous metric value, ttl of new cert)?

- 1.24416e+08
- name: certificate_manager_server_ttl_seconds
subsystem: kubelet
help: Gauge of the shortest TTL (time-to-live) of the Kubelet's serving certificate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x509 certs don't actually have a TTL field. Does TTL here refer to the remaining TTL (time until expiry), or the total TTL (length of validity period)?

stabilityLevel: BETA
- name: server_expiration_renew_errors
subsystem: kubelet
help: Counter of certificate renewal errors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help: Counter of certificate renewal errors.
help: Counter of server certificate renewal errors.

subsystem: kubelet
help: Gauge of the shortest TTL (time-to-live) of the Kubelet's serving certificate.
The value is in seconds until certificate expiry (negative if already expired).
If serving certificate is invalid or unused, the value will be +INF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does unused mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was autogenerated from the comments in kubelet.go for the metric.

We can discuss the commented metrics but I'm not sure if we should change this or not..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't change the generated metadata, but you can always modify the metric. In this particular case, you could update the HELP text.

If serving certificate is invalid or unused, the value will be +INF.
type: Gauge
stabilityLevel: BETA
- name: server_expiration_renew_errors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be prepended with certificate_manager_ to match the other metrics?

- name: certificate_manager_server_rotation_seconds
subsystem: kubelet
help: Histogram of the number of seconds the previous certificate lived before being
rotated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a weird metric... it's measured as the time since it's validity start period to the time it was rotated:

if old := m.updateCached(cert); old != nil && m.certificateRotation != nil {
m.certificateRotation.Observe(m.now().Sub(old.Leaf.NotBefore).Seconds())
}

This seems less useful than measuring the amount of valid time remaining (i.e. how close are we cutting it). It's also not measuring how long the cert was actually in use.

What is the purpose of this metric?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I also don't really understand how someone would use this information and take actions.

@tallclair
Copy link
Member

/assign

@@ -88,6 +88,35 @@
help: The count of hidden metrics.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair all of this was autogenerated from the metrics. I guess beta metrics are actually documented so I'm not sure if I can/will change anything in this file. I can update the comments for the metrics but not sure about that.

@kannon92
Copy link
Contributor Author

kannon92 commented Feb 6, 2024

ping @dgrisonnet

Curious on your thoughts for @tallclair! I didn't add the original comments for this metrics. I just wanted to promote the existing ones to beta but I'm open to making the comments more useful especially now that we will start documenting them more.

@@ -61,7 +61,7 @@ func NewKubeletServerCertificateManager(kubeClient clientset.Interface, kubeCfg
Subsystem: metrics.KubeletSubsystem,
Name: "server_expiration_renew_errors",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a counter so it needs to be suffixed by _total: https://prometheus.io/docs/practices/naming/

So I would suggest renaming the metric to server_expiration_renew_errors_total and highlight that change in the changelog.

- name: certificate_manager_server_rotation_seconds
subsystem: kubelet
help: Histogram of the number of seconds the previous certificate lived before being
rotated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I also don't really understand how someone would use this information and take actions.

subsystem: kubelet
help: Histogram of the number of seconds the previous certificate lived before being
rotated.
type: Histogram
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will at least need to have a consensus on https://github.com/kubernetes/kubernetes/pull/122834/files#r1472191704 if we want to bump this metric to BETA since once it is graduated it will be hard to change it in the future.

subsystem: kubelet
help: Gauge of the shortest TTL (time-to-live) of the Kubelet's serving certificate.
The value is in seconds until certificate expiry (negative if already expired).
If serving certificate is invalid or unused, the value will be +INF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't change the generated metadata, but you can always modify the metric. In this particular case, you could update the HELP text.

@tallclair
Copy link
Member

Just to clarify on the comments here, I commented on the auto-generated code since that's what was surfaced in this PR. The comments really apply to the metrics code, not the generated files here. I know you're just promoting what is already there, but as @dgrisonnet mentioned, promoting the metrics makes it harder to change them, so now is the time to evaluate the state of the metrics and think about what changes we want to make to them.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/stable-metrics Issues or PRs involving stable metrics cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants