-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusterloader2: use new apiserver latency SLI #2205
Conversation
/assign @tosi3k |
Thanks for the pull request.
Judging by the linked PR's contents I think you rather meant this will work for Kubernetes 1.26+ (since you use the new name of the metric rather than the old one), right? |
Ah yes definitely, I'll update the description sorry for the confusion. |
Regarding compatibility - WDYT about moving this change under some flag that would either use the old or the new metric ( By default, we'd still use the old metric but user could override that behavior by setting the flag in order to switch to the webhook-less latency metric. For the configuration of dashboards I believe we could just include two separate charts for both metrics for now rather than fully replace the old metric in the graphs. |
That's a good idea to make it compatible, but I don't think the users should have this level of awareness. It would be best in my opinion if this was seamless and clusterloader2 would just use the better-suited metrics. I can see two options to make it so that users don't have to know about this technical detail:
The potential problem with that approach is that if we don't aggregate any labels like in the example above, the evaluation will take quite some time considering the sheer amount of metrics to return. Also, since recording rules are evaluated on a periodical basis (configurable), the metrics won't be as fresh as they used to be before.
|
I like this idea very much! I think that we might just focus on the K8s minor (instead of having some sophisticated Semantic Versioning version parsing) against which we run ClusterLoader2. We could fetch this piece of information simply from the |
Thank you for the pointers, I will look into adding this functionality |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@dgrisonnet - will you have time to push this forward? |
I'll try to spend some time finishing this in the upcoming weeks. /remove-lifecycle stale |
In Kubernetes 1.23.0 a new request latency metric more suitable for SLOs since it doesn't account for webhooks latency was introduced. Since this new metric is better suited to clusterloader2 than the general purpose one, it would make sense to update the existing Prometheus rules. Note that this implies moving from a STABLE metric to an ALPHA one, but considering that the new metric has been around for already 3 releases and that it has a real-life scenario attached to it, the metric is very unlikely to be removed in the future. Signed-off-by: Damien Grisonnet <[email protected]>
8f6aa1b
to
93d26a9
Compare
The code changes are done, but I still need to figure out how I can test all the areas that I've touched. If anyone wants to have a look at the PR in the meantime, all the logic for selecting metrics depending on the cluster version has been added in the last two commits. |
From the presubmit - it seems it doesn't work correctly even at head now:
My suspicion is that prometheus is not coming up correctly... |
I guess this test is explaining more: |
93d26a9
to
d9c381e
Compare
Thanks for the pointers, I'll try to reproduce the failures locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through the code and it all looks fine. I didn't try to parse all the queries and I'm fairly sure it's some of the queries (also from looking into test results).
d9c381e
to
aba6be1
Compare
Depending on the version of the cluster in which clusterloader2 runs, there are different apiserver latency metrics available: - apiserver_request_duration_seconds - apiserver_requests_slo_duration_seconds (1.23 -> 1.26) - apiserver_requests_sli_duration_seconds (1.26+) To make sure that clusterloader always use the most accurate metrics, the measurements queries for the apiserver are now updated depending on the version of the Kubernetes cluster. Signed-off-by: Damien Grisonnet <[email protected]>
Yes there was an extra parentheses in the queries. I also took the opportunity to fix a segfault that I introduced in the unit tests, so normally they should be green now. But I still haven't tested the feature end-to-end. |
aba6be1
to
99b140b
Compare
The test at head, we will get for free from the presubmit. |
clusterloader2/pkg/measurement/common/testdata/api_responsiveness_prometheus/rules.yml
Outdated
Show resolved
Hide resolved
I didn't validate the metrics, for presubmits but they look reasonable (so hopefully the right metric is used). Ideally, we should try to run it against 1.26 cluster too before merging. |
The apiserver measurement recording rule and the various dashboard were all using the apiserver_request_duration_seconds metrics, however we want to move to the new slo/sli metrics since they give more precise latency measurements. Because these metrics are only available in certain versions of Kubernetes, we needed to make the recording rules and dashboards version aware. For the recording rules used for measure, we want the metrics that we use to be as precise as possible so we chose to deduplicate the existing queries with the new metrics and select which one to use in code depending on the cluster version. For dashboards it is a bit more complex since we can't change the query at runtime, so instead of taking the approach that we took for measurement, we just added a new recording rule that would use the correct metric depending on the cluster version. This as the benefit of not being intrusive with the existing code, but at the same time since the result are now precomputed, the data will be less fresh than it used to be, but the difference is not very significant from a dashboard perspective. Signed-off-by: Damien Grisonnet <[email protected]>
99b140b
to
ee3c9c1
Compare
I tested on both 1.25.8 and 1.27.0 locally and the metrics are correctly selected. Is there a way I could easily check that I didn't break the dashboards? |
Great - thanks!
This looks fine to me - we can always fix them later if we missed something. /lgtm /hold cancel |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgrisonnet, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
In Kubernetes 1.23.0 a new request latency metric more suitable for SLOs since it doesn't account for webhooks latency was introduced. Since this new metric is better suited to clusterloader2 than the general-purpose one, it would make sense to update the existing Prometheus rules.
Note that this implies moving from a STABLE metric to an ALPHA one, but considering that the new metric has been around for already 3 releases and that it has a real-life scenario attached to it, the metric is very unlikely to be removed in the future.
Special notes for your reviewer:
This metric has been renamed in Kubernetes 1.26 (kubernetes/kubernetes#112679) which means that the queries will only work on clusters from 1.26 onward. Would that be problematic for the support policy of clusterloader2?
This PR will require the release of Kubernetes 1.26 as well as a new version of metrics-server based on the apiserver code post 1.26. For these reasons, I will put it on hold for now.
/hold
/cc @marseel