clusterloader2: use new apiserver latency SLI #2205

dgrisonnet · 2022-11-23T15:39:23Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

In Kubernetes 1.23.0 a new request latency metric more suitable for SLOs since it doesn't account for webhooks latency was introduced. Since this new metric is better suited to clusterloader2 than the general-purpose one, it would make sense to update the existing Prometheus rules.

Note that this implies moving from a STABLE metric to an ALPHA one, but considering that the new metric has been around for already 3 releases and that it has a real-life scenario attached to it, the metric is very unlikely to be removed in the future.

Special notes for your reviewer:

This metric has been renamed in Kubernetes 1.26 (kubernetes/kubernetes#112679) which means that the queries will only work on clusters from 1.26 onward. Would that be problematic for the support policy of clusterloader2?

This PR will require the release of Kubernetes 1.26 as well as a new version of metrics-server based on the apiserver code post 1.26. For these reasons, I will put it on hold for now.
/hold

/cc @marseel

marseel · 2022-11-28T08:15:58Z

/assign @tosi3k

tosi3k · 2022-11-28T16:11:24Z

Thanks for the pull request.

This metric has been renamed in Kubernetes 1.26 (kubernetes/kubernetes#112679) which means that the queries will only work on clusters older than 1.26.

Judging by the linked PR's contents I think you rather meant this will work for Kubernetes 1.26+ (since you use the new name of the metric rather than the old one), right?

dgrisonnet · 2022-11-28T16:12:42Z

Judging by the linked PR's contents I think you rather meant this will work for Kubernetes 1.26+ (since you use the new name of the metric rather than the old one), right?

Ah yes definitely, I'll update the description sorry for the confusion.

tosi3k · 2022-11-28T16:19:42Z

Regarding compatibility - WDYT about moving this change under some flag that would either use the old or the new metric (apiserver_request_duration_seconds_bucket vs. apiserver_request_sli_duration_seconds, respectively) in api_responsiveness_prometheus.go?

By default, we'd still use the old metric but user could override that behavior by setting the flag in order to switch to the webhook-less latency metric.

For the configuration of dashboards I believe we could just include two separate charts for both metrics for now rather than fully replace the old metric in the graphs.

dgrisonnet · 2022-11-28T17:21:54Z

That's a good idea to make it compatible, but I don't think the users should have this level of awareness. It would be best in my opinion if this was seamless and clusterloader2 would just use the better-suited metrics.

I can see two options to make it so that users don't have to know about this technical detail:

Create a recording rule to use the latest metrics available. It would look something like this:

apiserver_request_sli_duration_seconds_bucket
or
apiserver_request_slo_duration_seconds_bucket
or
apiserver_request_duration_seconds_bucket

The potential problem with that approach is that if we don't aggregate any labels like in the example above, the evaluation will take quite some time considering the sheer amount of metrics to return. Also, since recording rules are evaluated on a periodical basis (configurable), the metrics won't be as fresh as they used to be before.

Add Kubernetes version awareness to clusterloader2 to be able to select particular metrics depending on the version of the cluster it is testing against

tosi3k · 2022-11-29T11:17:44Z

Add Kubernetes version awareness to clusterloader2 to be able to select particular metrics depending on the version of the cluster it is testing against

I like this idea very much!

I think that we might just focus on the K8s minor (instead of having some sophisticated Semantic Versioning version parsing) against which we run ClusterLoader2.

We could fetch this piece of information simply from the /version endpoint at the beginning of CL2 execution, store this information somewhere here or here and use this knob later on in the internals of API responsiveness and Metrics Server measurements from the measurement.Config var we input to the Gather methods to decide which Prometheus query (i.e. metric) to run.

dgrisonnet · 2022-11-29T13:56:38Z

Thank you for the pointers, I will look into adding this functionality

k8s-triage-robot · 2023-02-27T16:13:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t · 2023-03-29T13:44:42Z

@dgrisonnet - will you have time to push this forward?
We will need this also in the context of APF graduation, so I'm really interesting in it.

dgrisonnet · 2023-03-29T14:07:26Z

I'll try to spend some time finishing this in the upcoming weeks.

/remove-lifecycle stale

clusterloader2/pkg/prometheus/manifests/prometheus-rules.yaml

In Kubernetes 1.23.0 a new request latency metric more suitable for SLOs since it doesn't account for webhooks latency was introduced. Since this new metric is better suited to clusterloader2 than the general purpose one, it would make sense to update the existing Prometheus rules. Note that this implies moving from a STABLE metric to an ALPHA one, but considering that the new metric has been around for already 3 releases and that it has a real-life scenario attached to it, the metric is very unlikely to be removed in the future. Signed-off-by: Damien Grisonnet <[email protected]>

dgrisonnet · 2023-04-28T09:39:21Z

The code changes are done, but I still need to figure out how I can test all the areas that I've touched.

If anyone wants to have a look at the PR in the meantime, all the logic for selecting metrics depending on the cluster version has been added in the last two commits.

wojtek-t · 2023-04-28T12:32:34Z

The code changes are done, but I still need to figure out how I can test all the areas that I've touched.

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/perf-tests/2205/pull-perf-tests-clusterloader2/1651882214037655552

From the presubmit - it seems it doesn't work correctly even at head now:

I0428 09:44:30.045738   16698 prometheus.go:439] Waiting for Prometheus stack to become healthy...
W0428 09:45:00.083052   16698 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�g
�
�������Failure�=no endpoints available for service "http:prometheus-k8s:9090""�ServiceUnavailable0����"�

My suspicion is that prometheus is not coming up correctly...

wojtek-t · 2023-04-28T12:33:12Z

I guess this test is explaining more:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/perf-tests/2205/pull-perf-tests-verify-test/1651882214188650496

dgrisonnet · 2023-04-28T12:41:22Z

Thanks for the pointers, I'll try to reproduce the failures locally.

wojtek-t

I looked through the code and it all looks fine. I didn't try to parse all the queries and I'm fairly sure it's some of the queries (also from looking into test results).

clusterloader2/pkg/measurement/util/metrics.go

Depending on the version of the cluster in which clusterloader2 runs, there are different apiserver latency metrics available: - apiserver_request_duration_seconds - apiserver_requests_slo_duration_seconds (1.23 -> 1.26) - apiserver_requests_sli_duration_seconds (1.26+) To make sure that clusterloader always use the most accurate metrics, the measurements queries for the apiserver are now updated depending on the version of the Kubernetes cluster. Signed-off-by: Damien Grisonnet <[email protected]>

dgrisonnet · 2023-04-28T13:45:05Z

Yes there was an extra parentheses in the queries. I also took the opportunity to fix a segfault that I introduced in the unit tests, so normally they should be green now. But I still haven't tested the feature end-to-end.

wojtek-t · 2023-04-28T14:04:43Z

But I still haven't tested the feature end-to-end.

The test at head, we will get for free from the presubmit.
So we only need to test it against e.g. 1.26 and 1.22.
[Technically, 1.22 is out of support already, so probably testing 1.26 is enough...]

clusterloader2/pkg/measurement/common/testdata/api_responsiveness_prometheus/rules.yml

clusterloader2/pkg/prometheus/manifests/prometheus-rules.yaml

wojtek-t · 2023-05-02T09:21:06Z

I didn't validate the metrics, for presubmits but they look reasonable (so hopefully the right metric is used).

Ideally, we should try to run it against 1.26 cluster too before merging.

The apiserver measurement recording rule and the various dashboard were all using the apiserver_request_duration_seconds metrics, however we want to move to the new slo/sli metrics since they give more precise latency measurements. Because these metrics are only available in certain versions of Kubernetes, we needed to make the recording rules and dashboards version aware. For the recording rules used for measure, we want the metrics that we use to be as precise as possible so we chose to deduplicate the existing queries with the new metrics and select which one to use in code depending on the cluster version. For dashboards it is a bit more complex since we can't change the query at runtime, so instead of taking the approach that we took for measurement, we just added a new recording rule that would use the correct metric depending on the cluster version. This as the benefit of not being intrusive with the existing code, but at the same time since the result are now precomputed, the data will be less fresh than it used to be, but the difference is not very significant from a dashboard perspective. Signed-off-by: Damien Grisonnet <[email protected]>

dgrisonnet · 2023-05-03T11:10:21Z

I tested on both 1.25.8 and 1.27.0 locally and the metrics are correctly selected.

Is there a way I could easily check that I didn't break the dashboards?

wojtek-t · 2023-05-04T08:26:21Z

I tested on both 1.25.8 and 1.27.0 locally and the metrics are correctly selected.

Great - thanks!

Is there a way I could easily check that I didn't break the dashboards?

This looks fine to me - we can always fix them later if we missed something.

/lgtm
/approve

/hold cancel

k8s-ci-robot · 2023-05-04T08:26:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~clusterloader2/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested a review from marseel November 23, 2022 15:39

k8s-ci-robot assigned tosi3k Nov 28, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2023

wojtek-t self-assigned this Mar 29, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 29, 2023

dgrisonnet commented Apr 26, 2023

View reviewed changes

clusterloader2/pkg/prometheus/manifests/prometheus-rules.yaml Outdated Show resolved Hide resolved

dgrisonnet force-pushed the use-apiserver-sli branch 2 times, most recently from 8f6aa1b to 93d26a9 Compare April 28, 2023 09:32

dgrisonnet force-pushed the use-apiserver-sli branch from 93d26a9 to d9c381e Compare April 28, 2023 12:40

wojtek-t reviewed Apr 28, 2023

View reviewed changes

clusterloader2/pkg/measurement/util/metrics.go Outdated Show resolved Hide resolved

clusterloader2/pkg/measurement/util/metrics.go Outdated Show resolved Hide resolved

dgrisonnet force-pushed the use-apiserver-sli branch from d9c381e to aba6be1 Compare April 28, 2023 13:42

dgrisonnet force-pushed the use-apiserver-sli branch from aba6be1 to 99b140b Compare April 28, 2023 13:46

wojtek-t reviewed May 2, 2023

View reviewed changes

clusterloader2/pkg/measurement/common/testdata/api_responsiveness_prometheus/rules.yml Outdated Show resolved Hide resolved

clusterloader2/pkg/prometheus/manifests/prometheus-rules.yaml Outdated Show resolved Hide resolved

dgrisonnet force-pushed the use-apiserver-sli branch from 99b140b to ee3c9c1 Compare May 3, 2023 07:47

wojtek-t mentioned this pull request May 4, 2023

[WIP] DO NOT MERGE: Experiment: Disable client-side rate-limitting kubernetes/kubernetes#109614

Closed

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 4, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2023

k8s-ci-robot merged commit bc59038 into kubernetes:master May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusterloader2: use new apiserver latency SLI #2205

clusterloader2: use new apiserver latency SLI #2205

dgrisonnet commented Nov 23, 2022 •

edited

Loading

marseel commented Nov 28, 2022

tosi3k commented Nov 28, 2022

dgrisonnet commented Nov 28, 2022

tosi3k commented Nov 28, 2022

dgrisonnet commented Nov 28, 2022 •

edited

Loading

tosi3k commented Nov 29, 2022 •

edited

Loading

dgrisonnet commented Nov 29, 2022 •

edited

Loading

k8s-triage-robot commented Feb 27, 2023

wojtek-t commented Mar 29, 2023

dgrisonnet commented Mar 29, 2023

dgrisonnet commented Apr 28, 2023

wojtek-t commented Apr 28, 2023

wojtek-t commented Apr 28, 2023

dgrisonnet commented Apr 28, 2023

wojtek-t left a comment

dgrisonnet commented Apr 28, 2023

wojtek-t commented Apr 28, 2023

wojtek-t commented May 2, 2023

dgrisonnet commented May 3, 2023

wojtek-t commented May 4, 2023

k8s-ci-robot commented May 4, 2023

clusterloader2: use new apiserver latency SLI #2205

clusterloader2: use new apiserver latency SLI #2205

Conversation

dgrisonnet commented Nov 23, 2022 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

marseel commented Nov 28, 2022

tosi3k commented Nov 28, 2022

dgrisonnet commented Nov 28, 2022

tosi3k commented Nov 28, 2022

dgrisonnet commented Nov 28, 2022 • edited Loading

tosi3k commented Nov 29, 2022 • edited Loading

dgrisonnet commented Nov 29, 2022 • edited Loading

k8s-triage-robot commented Feb 27, 2023

wojtek-t commented Mar 29, 2023

dgrisonnet commented Mar 29, 2023

dgrisonnet commented Apr 28, 2023

wojtek-t commented Apr 28, 2023

wojtek-t commented Apr 28, 2023

dgrisonnet commented Apr 28, 2023

wojtek-t left a comment

Choose a reason for hiding this comment

dgrisonnet commented Apr 28, 2023

wojtek-t commented Apr 28, 2023

wojtek-t commented May 2, 2023

dgrisonnet commented May 3, 2023

wojtek-t commented May 4, 2023

k8s-ci-robot commented May 4, 2023

dgrisonnet commented Nov 23, 2022 •

edited

Loading

dgrisonnet commented Nov 28, 2022 •

edited

Loading

tosi3k commented Nov 29, 2022 •

edited

Loading

dgrisonnet commented Nov 29, 2022 •

edited

Loading