Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterloader2: use new apiserver latency SLI #2205

Merged
merged 3 commits into from
May 4, 2023

Conversation

dgrisonnet
Copy link
Member

@dgrisonnet dgrisonnet commented Nov 23, 2022

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

In Kubernetes 1.23.0 a new request latency metric more suitable for SLOs since it doesn't account for webhooks latency was introduced. Since this new metric is better suited to clusterloader2 than the general-purpose one, it would make sense to update the existing Prometheus rules.

Note that this implies moving from a STABLE metric to an ALPHA one, but considering that the new metric has been around for already 3 releases and that it has a real-life scenario attached to it, the metric is very unlikely to be removed in the future.

Special notes for your reviewer:

This metric has been renamed in Kubernetes 1.26 (kubernetes/kubernetes#112679) which means that the queries will only work on clusters from 1.26 onward. Would that be problematic for the support policy of clusterloader2?

This PR will require the release of Kubernetes 1.26 as well as a new version of metrics-server based on the apiserver code post 1.26. For these reasons, I will put it on hold for now.
/hold

/cc @marseel

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 23, 2022
@marseel
Copy link
Member

marseel commented Nov 28, 2022

/assign @tosi3k

@tosi3k
Copy link
Member

tosi3k commented Nov 28, 2022

Thanks for the pull request.

This metric has been renamed in Kubernetes 1.26 (kubernetes/kubernetes#112679) which means that the queries will only work on clusters older than 1.26.

Judging by the linked PR's contents I think you rather meant this will work for Kubernetes 1.26+ (since you use the new name of the metric rather than the old one), right?

@dgrisonnet
Copy link
Member Author

Judging by the linked PR's contents I think you rather meant this will work for Kubernetes 1.26+ (since you use the new name of the metric rather than the old one), right?

Ah yes definitely, I'll update the description sorry for the confusion.

@tosi3k
Copy link
Member

tosi3k commented Nov 28, 2022

Regarding compatibility - WDYT about moving this change under some flag that would either use the old or the new metric (apiserver_request_duration_seconds_bucket vs. apiserver_request_sli_duration_seconds, respectively) in api_responsiveness_prometheus.go?

By default, we'd still use the old metric but user could override that behavior by setting the flag in order to switch to the webhook-less latency metric.

For the configuration of dashboards I believe we could just include two separate charts for both metrics for now rather than fully replace the old metric in the graphs.

@dgrisonnet
Copy link
Member Author

dgrisonnet commented Nov 28, 2022

That's a good idea to make it compatible, but I don't think the users should have this level of awareness. It would be best in my opinion if this was seamless and clusterloader2 would just use the better-suited metrics.

I can see two options to make it so that users don't have to know about this technical detail:

  1. Create a recording rule to use the latest metrics available. It would look something like this:
apiserver_request_sli_duration_seconds_bucket
or
apiserver_request_slo_duration_seconds_bucket
or
apiserver_request_duration_seconds_bucket

The potential problem with that approach is that if we don't aggregate any labels like in the example above, the evaluation will take quite some time considering the sheer amount of metrics to return. Also, since recording rules are evaluated on a periodical basis (configurable), the metrics won't be as fresh as they used to be before.

  1. Add Kubernetes version awareness to clusterloader2 to be able to select particular metrics depending on the version of the cluster it is testing against

@tosi3k
Copy link
Member

tosi3k commented Nov 29, 2022

Add Kubernetes version awareness to clusterloader2 to be able to select particular metrics depending on the version of the cluster it is testing against

I like this idea very much!

I think that we might just focus on the K8s minor (instead of having some sophisticated Semantic Versioning version parsing) against which we run ClusterLoader2.

We could fetch this piece of information simply from the /version endpoint at the beginning of CL2 execution, store this information somewhere here or here and use this knob later on in the internals of API responsiveness and Metrics Server measurements from the measurement.Config var we input to the Gather methods to decide which Prometheus query (i.e. metric) to run.

@dgrisonnet
Copy link
Member Author

dgrisonnet commented Nov 29, 2022

Thank you for the pointers, I will look into adding this functionality

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2023
@wojtek-t
Copy link
Member

@dgrisonnet - will you have time to push this forward?
We will need this also in the context of APF graduation, so I'm really interesting in it.

@wojtek-t wojtek-t self-assigned this Mar 29, 2023
@dgrisonnet
Copy link
Member Author

I'll try to spend some time finishing this in the upcoming weeks.

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 29, 2023
In Kubernetes 1.23.0 a new request latency metric more suitable for SLOs
since it doesn't account for webhooks latency was introduced. Since this
new metric is better suited to clusterloader2 than the general purpose
one, it would make sense to update the existing Prometheus rules.

Note that this implies moving from a STABLE metric to an ALPHA one, but
considering that the new metric has been around for already 3 releases
and that it has a real-life scenario attached to it, the metric is very
unlikely to be removed in the future.

Signed-off-by: Damien Grisonnet <[email protected]>
@dgrisonnet dgrisonnet force-pushed the use-apiserver-sli branch 2 times, most recently from 8f6aa1b to 93d26a9 Compare April 28, 2023 09:32
@dgrisonnet
Copy link
Member Author

The code changes are done, but I still need to figure out how I can test all the areas that I've touched.

If anyone wants to have a look at the PR in the meantime, all the logic for selecting metrics depending on the cluster version has been added in the last two commits.

@wojtek-t
Copy link
Member

The code changes are done, but I still need to figure out how I can test all the areas that I've touched.

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/perf-tests/2205/pull-perf-tests-clusterloader2/1651882214037655552

From the presubmit - it seems it doesn't work correctly even at head now:

I0428 09:44:30.045738   16698 prometheus.go:439] Waiting for Prometheus stack to become healthy...
W0428 09:45:00.083052   16698 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�g
�
�������Failure�=no endpoints available for service "http:prometheus-k8s:9090""�ServiceUnavailable0����"�

My suspicion is that prometheus is not coming up correctly...

@wojtek-t
Copy link
Member

@dgrisonnet
Copy link
Member Author

Thanks for the pointers, I'll try to reproduce the failures locally.

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the code and it all looks fine. I didn't try to parse all the queries and I'm fairly sure it's some of the queries (also from looking into test results).

clusterloader2/pkg/measurement/util/metrics.go Outdated Show resolved Hide resolved
clusterloader2/pkg/measurement/util/metrics.go Outdated Show resolved Hide resolved
Depending on the version of the cluster in which clusterloader2 runs,
there are different apiserver latency metrics available:
- apiserver_request_duration_seconds
- apiserver_requests_slo_duration_seconds (1.23 -> 1.26)
- apiserver_requests_sli_duration_seconds (1.26+)

To make sure that clusterloader always use the most accurate metrics,
the measurements queries for the apiserver are now updated depending on
the version of the Kubernetes cluster.

Signed-off-by: Damien Grisonnet <[email protected]>
@dgrisonnet
Copy link
Member Author

Yes there was an extra parentheses in the queries. I also took the opportunity to fix a segfault that I introduced in the unit tests, so normally they should be green now. But I still haven't tested the feature end-to-end.

@wojtek-t
Copy link
Member

But I still haven't tested the feature end-to-end.

The test at head, we will get for free from the presubmit.
So we only need to test it against e.g. 1.26 and 1.22.
[Technically, 1.22 is out of support already, so probably testing 1.26 is enough...]

@wojtek-t
Copy link
Member

wojtek-t commented May 2, 2023

I didn't validate the metrics, for presubmits but they look reasonable (so hopefully the right metric is used).

Ideally, we should try to run it against 1.26 cluster too before merging.

The apiserver measurement recording rule and the various dashboard were
all using the apiserver_request_duration_seconds metrics, however we
want to move to the new slo/sli metrics since they give more precise
latency measurements. Because these metrics are only available in
certain versions of Kubernetes, we needed to make the recording rules
and dashboards version aware.

For the recording rules used for measure, we want the metrics that we
use to be as precise as possible so we chose to deduplicate the existing
queries with the new metrics and select which one to use in code
depending on the cluster version.

For dashboards it is a bit more complex since we can't change the query
at runtime, so instead of taking the approach that we took for
measurement, we just added a new recording rule that would use the
correct metric depending on the cluster version. This as the benefit of
not being intrusive with the existing code, but at the same time since
the result are now precomputed, the data will be less fresh than it used
to be, but the difference is not very significant from a dashboard
perspective.

Signed-off-by: Damien Grisonnet <[email protected]>
@dgrisonnet
Copy link
Member Author

I tested on both 1.25.8 and 1.27.0 locally and the metrics are correctly selected.

Is there a way I could easily check that I didn't break the dashboards?

@wojtek-t
Copy link
Member

wojtek-t commented May 4, 2023

I tested on both 1.25.8 and 1.27.0 locally and the metrics are correctly selected.

Great - thanks!

Is there a way I could easily check that I didn't break the dashboards?

This looks fine to me - we can always fix them later if we missed something.

/lgtm
/approve

/hold cancel

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 4, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2023
@k8s-ci-robot k8s-ci-robot merged commit bc59038 into kubernetes:master May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants