Upgrade kube-prometheus-stack to 67.11.0 #2381

anders-elastisys · 2024-12-30T08:40:01Z

Warning

This is a public repository, ensure not to disclose:

personal data beyond what is necessary for interacting with this pull request, nor
business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

kind/admin-change
kind/dev-change
kind/security
[kind/adr](set-me)

Application Developer notice

Prometheus has been upgraded to version 3.0. This includes changes to the Prometheus UI. Prometheus V3 comes with some changes that may affect existing PromQL expressions in alerts or dashboards. Please have a look at the Prometheus V3 migration guide.

Security notice

Upgraded Prometheus to v3.1.0 to address CVE-2024-45337

What does this PR do / why do we need this PR?

Noticed that the kube-prometheus-stack was falling behind a bit, this PR upgrades the Helm chart to v67.5.0 which also upgrades Prometheus to v3..
I checked the v3 migration guide and I did not see that we are currently using any of breaking flags or configurations in our default Welkin config, but please verify if this is used in some environments.

This fixes some ARP metrics and a log issue caused by this in the node-exporter (this is mentioned in the linked issue).

Alertmanager in the Mangement cluster is not upgraded, instead the image version is fixed to previous v0.26.0 due to v0.27.0 deprecating the v1 API endpoint, which is still used by Thanos.
Once we upgrade Thanos to v0.35 or higher, the v2 endpoint will be default (see related upstream issue) and we can remove the image override.

Fixes Upgrade Kube-prometheus-stack-60.0.0 #2166

Information to reviewers

Checklist

OlleLarsson · 2025-01-03T14:46:00Z

Did you account for the change that they mention here? We seem to set these to false in the wc kps config

Xartos · 2025-01-03T15:03:44Z

migration/v0.43/apply/10-kube-prometheus-stack.sh

+      current_version=$(helm_do "${cluster}" get metadata -n monitoring kube-prometheus-stack -ojson | jq '.version' | tr -d '"')
+
+      log_info "  - Checking if kube-promethes-stack needs to be upgraded"
+      if [[ ! "${current_version}" < "$(echo -e "${new_version}\n${current_version}" | sort -V | tail -n1)" ]]; then


Question: Does this comparison work with versions?

Wouldn't this just need to be

Suggested change

if [[ ! "${current_version}" < "$(echo -e "${new_version}\n${current_version}" | sort -V | tail -n1)" ]]; then

if [[ "${current_version}" != "${new_version}" ]]; then

The comparison should work in most cases since it sorts lexicographically and the kps helm release will be a couple of major versions behind in most cases. But I like your suggestion, then it should be possible to do downgrade as well. PTAL 78f5a5c

Elias-elastisys

Do we need to care about the PromQL change to the dot token?

Just looking quickly I found usage in both alerts and dashboards, I'm sure they are everywhere.

compliantkubernetes-apps/helmfile.d/charts/prometheus-alerts/templates/records/k8s.rules.yaml

Line 130 in c8783f3

"workload", "$1", "owner_name", "(.*)"

compliantkubernetes-apps/helmfile.d/charts/grafana-dashboards/dashboards/backup-dashboard.json

Line 94 in c8783f3

    
           "expr": "min((time()-kube_job_status_completion_time{job_name=~\"harbor-backup-cronjob-.*\", cluster=~\"$cluster\"})/3600)",

Have you checked if all dashboards behave the same?

anders-elastisys · 2025-01-13T14:53:14Z

Have you checked if all dashboards behave the same?

I did a quick check comparing some dashboards that contains this regex and did not see any noticeable difference.
Difficult to verify all cases where this is found, I am doubting that we have newlines in any label values that would cause problems here, but if we are unsure we could do as suggested in the migration doc and use [^\n] for all occurrences.

anders-elastisys · 2025-01-13T15:41:09Z

@OlleLarsson I missed that, thanks, PTAL bb54e21

Xartos

I did a quick check comparing some dashboards that contains this regex and did not see any noticeable difference.
Difficult to verify all cases where this is found, I am doubting that we have newlines in any label values that would cause problems here, but if we are unsure we could do as suggested in the migration doc and use [^\n] for all occurrences.

Do you know if the upstream dashboards/rules have done this?

migration/v0.43/apply/10-kube-prometheus-stack.sh

anders-elastisys · 2025-01-14T08:25:22Z

@OlleLarsson I missed that, thanks, PTAL bb54e21

Nvm, this got reverted in v64, I will go back to how it was before, since this also causes our unit tests to fail due to schema issues.

anders-elastisys · 2025-01-15T08:38:11Z

Do you know if the upstream dashboards/rules have done this?

Not from what I have seen, you can see that the regular expressions used in the upstream kube-prometheus-stack alerts and dashboards part of this PR did not get changed to address this new change, so I do not think we should be affected.

OlleLarsson · 2025-01-15T08:42:34Z

@OlleLarsson I missed that, thanks, PTAL bb54e21

Nvm, this got reverted in v64, I will go back to how it was before, since this also causes our unit tests to fail due to schema issues.

Great, super that you noticed that they reverted the changes in the release after where they introduced the changes 😄!

viktor-f

I think this looks good. Note that I have not looked very closely at the changelogs.

aarnq

Mainly reviewing the migration.

migration/v0.43/apply/10-kube-prometheus-stack.sh

anders-elastisys · 2025-01-22T10:47:58Z

I updated this PR a bit, moved the migration script to next version since v0.43 is being released atm.
I also upgraded the chart to latest minor version of v67.11 to upgrade Prometheus to v3.1.0 since there was a critical CVE in v3.0. Tested upgrading to this new chart version, did not see any major changes, there were some changes to some alerts upstream that used the le function, but I did not see that we use this in any of our alerts.

anders-elastisys · 2025-01-24T12:01:15Z

@Xartos @aarnq @OlleLarsson since you all left comments, do you want to take another look before I merge?

aarnq

LGTM

This was referenced Dec 30, 2024

Upgrade kube-prometheus-stack chart to v66.1.1 #2341

Closed

Update metrics page for Prometheus v3 UI elastisys/welkin#1018

Open

anders-elastisys marked this pull request as ready for review December 30, 2024 10:15

anders-elastisys requested review from a team as code owners December 30, 2024 10:15

anders-elastisys force-pushed the anders-elastisys/upgrade-kube-prometheus-stack-prometheus-v3 branch from ba80518 to e6fa82f Compare December 30, 2024 14:31

anders-elastisys mentioned this pull request Dec 30, 2024

Upgrade Thanos chart 15.9.2 #2383

Draft

35 tasks

anders-elastisys requested review from Xartos, OlleLarsson, lunkan93 and Elias-elastisys January 3, 2025 12:09

Xartos reviewed Jan 3, 2025

View reviewed changes

Elias-elastisys reviewed Jan 7, 2025

View reviewed changes

anders-elastisys force-pushed the anders-elastisys/upgrade-kube-prometheus-stack-prometheus-v3 branch from d01f58c to 78f5a5c Compare January 13, 2025 15:38

anders-elastisys requested review from Xartos and Elias-elastisys January 13, 2025 15:41

Xartos reviewed Jan 14, 2025

View reviewed changes

migration/v0.43/apply/10-kube-prometheus-stack.sh Outdated Show resolved Hide resolved

anders-elastisys force-pushed the anders-elastisys/upgrade-kube-prometheus-stack-prometheus-v3 branch from 78f5a5c to b706d6e Compare January 15, 2025 08:39

viktor-f approved these changes Jan 15, 2025

View reviewed changes

anders-elastisys requested a review from Xartos January 16, 2025 08:00

aarnq reviewed Jan 17, 2025

View reviewed changes

migration/v0.43/apply/10-kube-prometheus-stack.sh Outdated Show resolved Hide resolved

anders-elastisys requested a review from aarnq January 17, 2025 13:53

Elias-elastisys approved these changes Jan 17, 2025

View reviewed changes

anders-elastisys changed the title ~~Upgrade kube-prometheus-stack to 67.5.0~~ Upgrade kube-prometheus-stack to 67.11.0 Jan 22, 2025

Xartos approved these changes Jan 27, 2025

View reviewed changes

OlleLarsson approved these changes Jan 27, 2025

View reviewed changes

aarnq approved these changes Jan 27, 2025

View reviewed changes

anders-elastisys added 3 commits January 28, 2025 11:31

apps: upgrade kube-prometheus-stack to v67.11.0

8a754bc

apps sc: pin alertmanager version

fc3b6c3

release: add kube-prometheus-stack migration script

8648fb1

anders-elastisys force-pushed the anders-elastisys/upgrade-kube-prometheus-stack-prometheus-v3 branch from 9647c4c to 8648fb1 Compare January 28, 2025 12:19

anders-elastisys merged commit e868dec into main Jan 28, 2025
12 checks passed

anders-elastisys deleted the anders-elastisys/upgrade-kube-prometheus-stack-prometheus-v3 branch January 28, 2025 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade kube-prometheus-stack to 67.11.0 #2381

Upgrade kube-prometheus-stack to 67.11.0 #2381

anders-elastisys commented Dec 30, 2024 •

edited

Loading

OlleLarsson commented Jan 3, 2025

Xartos Jan 3, 2025

anders-elastisys Jan 13, 2025 •

edited

Loading

Elias-elastisys left a comment

anders-elastisys commented Jan 13, 2025

anders-elastisys commented Jan 13, 2025

Xartos left a comment

anders-elastisys commented Jan 14, 2025 •

edited

Loading

anders-elastisys commented Jan 15, 2025

OlleLarsson commented Jan 15, 2025

viktor-f left a comment

aarnq left a comment

anders-elastisys commented Jan 22, 2025

anders-elastisys commented Jan 24, 2025

aarnq left a comment

	if [[ ! "${current_version}" < "$(echo -e "${new_version}\n${current_version}" \| sort -V \| tail -n1)" ]]; then
	if [[ "${current_version}" != "${new_version}" ]]; then

Upgrade kube-prometheus-stack to 67.11.0 #2381

Upgrade kube-prometheus-stack to 67.11.0 #2381

Conversation

anders-elastisys commented Dec 30, 2024 • edited Loading

What kind of PR is this?

Application Developer notice

Security notice

What does this PR do / why do we need this PR?

Information to reviewers

Checklist

OlleLarsson commented Jan 3, 2025

Xartos Jan 3, 2025

Choose a reason for hiding this comment

anders-elastisys Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Elias-elastisys left a comment

Choose a reason for hiding this comment

anders-elastisys commented Jan 13, 2025

anders-elastisys commented Jan 13, 2025

Xartos left a comment

Choose a reason for hiding this comment

anders-elastisys commented Jan 14, 2025 • edited Loading

anders-elastisys commented Jan 15, 2025

OlleLarsson commented Jan 15, 2025

viktor-f left a comment

Choose a reason for hiding this comment

aarnq left a comment

Choose a reason for hiding this comment

anders-elastisys commented Jan 22, 2025

anders-elastisys commented Jan 24, 2025

aarnq left a comment

Choose a reason for hiding this comment

anders-elastisys commented Dec 30, 2024 •

edited

Loading

anders-elastisys Jan 13, 2025 •

edited

Loading

anders-elastisys commented Jan 14, 2025 •

edited

Loading