Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade kube-prometheus-stack to 67.11.0 #2381

Conversation

anders-elastisys
Copy link
Contributor

@anders-elastisys anders-elastisys commented Dec 30, 2024

Warning

This is a public repository, ensure not to disclose:

  • personal data beyond what is necessary for interacting with this pull request, nor
  • business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

  • kind/feature
  • kind/improvement
  • kind/deprecation
  • kind/documentation
  • kind/clean-up
  • kind/bug
  • kind/other

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

  • kind/admin-change
  • kind/dev-change
  • kind/security
  • [kind/adr](set-me)

Application Developer notice

Prometheus has been upgraded to version 3.0. This includes changes to the Prometheus UI. Prometheus V3 comes with some changes that may affect existing PromQL expressions in alerts or dashboards. Please have a look at the Prometheus V3 migration guide.

Security notice

Upgraded Prometheus to v3.1.0 to address CVE-2024-45337

What does this PR do / why do we need this PR?

Noticed that the kube-prometheus-stack was falling behind a bit, this PR upgrades the Helm chart to v67.5.0 which also upgrades Prometheus to v3..
I checked the v3 migration guide and I did not see that we are currently using any of breaking flags or configurations in our default Welkin config, but please verify if this is used in some environments.

This fixes some ARP metrics and a log issue caused by this in the node-exporter (this is mentioned in the linked issue).

Alertmanager in the Mangement cluster is not upgraded, instead the image version is fixed to previous v0.26.0 due to v0.27.0 deprecating the v1 API endpoint, which is still used by Thanos.
Once we upgrade Thanos to v0.35 or higher, the v2 endpoint will be default (see related upstream issue) and we can remove the image override.

Information to reviewers

Checklist

  • Proper commit message prefix on all commits
  • Change checks:
    • The change is transparent
    • The change is disruptive
    • The change requires no migration steps
    • The change requires migration steps
    • The change updates CRDs
    • The change updates the config and the schema
  • Documentation checks:
  • Metrics checks:
    • The metrics are still exposed and present in Grafana after the change
    • The metrics names didn't change (Grafana dashboards and Prometheus alerts required no updates)
    • The metrics names did change (Grafana dashboards and Prometheus alerts required an update)
  • Logs checks:
    • The logs do not show any errors after the change
  • PodSecurityPolicy checks:
    • Any changed Pod is covered by Kubernetes Pod Security Standards
    • Any changed Pod is covered by Gatekeeper Pod Security Policies
    • The change does not cause any Pods to be blocked by Pod Security Standards or Policies
  • NetworkPolicy checks:
    • Any changed Pod is covered by Network Policies
    • The change does not cause any dropped packets in the NetworkPolicy Dashboard
  • Audit checks:
    • The change does not cause any unnecessary Kubernetes audit events
    • The change requires changes to Kubernetes audit policy
  • Falco checks:
    • The change does not cause any alerts to be generated by Falco
  • Bug checks:
    • The bug fix is covered by regression tests

@anders-elastisys anders-elastisys marked this pull request as ready for review December 30, 2024 10:15
@anders-elastisys anders-elastisys requested review from a team as code owners December 30, 2024 10:15
@anders-elastisys anders-elastisys force-pushed the anders-elastisys/upgrade-kube-prometheus-stack-prometheus-v3 branch from ba80518 to e6fa82f Compare December 30, 2024 14:31
@OlleLarsson
Copy link
Contributor

Did you account for the change that they mention here? We seem to set these to false in the wc kps config

current_version=$(helm_do "${cluster}" get metadata -n monitoring kube-prometheus-stack -ojson | jq '.version' | tr -d '"')

log_info " - Checking if kube-promethes-stack needs to be upgraded"
if [[ ! "${current_version}" < "$(echo -e "${new_version}\n${current_version}" | sort -V | tail -n1)" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does this comparison work with versions?

Wouldn't this just need to be

Suggested change
if [[ ! "${current_version}" < "$(echo -e "${new_version}\n${current_version}" | sort -V | tail -n1)" ]]; then
if [[ "${current_version}" != "${new_version}" ]]; then

Copy link
Contributor Author

@anders-elastisys anders-elastisys Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comparison should work in most cases since it sorts lexicographically and the kps helm release will be a couple of major versions behind in most cases. But I like your suggestion, then it should be possible to do downgrade as well. PTAL 78f5a5c

Copy link
Contributor

@Elias-elastisys Elias-elastisys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to care about the PromQL change to the dot token?

Just looking quickly I found usage in both alerts and dashboards, I'm sure they are everywhere.

"expr": "min((time()-kube_job_status_completion_time{job_name=~\"harbor-backup-cronjob-.*\", cluster=~\"$cluster\"})/3600)",

Have you checked if all dashboards behave the same?

@anders-elastisys
Copy link
Contributor Author

Have you checked if all dashboards behave the same?

I did a quick check comparing some dashboards that contains this regex and did not see any noticeable difference.
Difficult to verify all cases where this is found, I am doubting that we have newlines in any label values that would cause problems here, but if we are unsure we could do as suggested in the migration doc and use [^\n] for all occurrences.

@anders-elastisys anders-elastisys force-pushed the anders-elastisys/upgrade-kube-prometheus-stack-prometheus-v3 branch from d01f58c to 78f5a5c Compare January 13, 2025 15:38
@anders-elastisys
Copy link
Contributor Author

@OlleLarsson I missed that, thanks, PTAL bb54e21

Copy link
Contributor

@Xartos Xartos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick check comparing some dashboards that contains this regex and did not see any noticeable difference.
Difficult to verify all cases where this is found, I am doubting that we have newlines in any label values that would cause problems here, but if we are unsure we could do as suggested in the migration doc and use [^\n] for all occurrences.

Do you know if the upstream dashboards/rules have done this?

migration/v0.43/apply/10-kube-prometheus-stack.sh Outdated Show resolved Hide resolved
@anders-elastisys
Copy link
Contributor Author

anders-elastisys commented Jan 14, 2025

@OlleLarsson I missed that, thanks, PTAL bb54e21

Nvm, this got reverted in v64, I will go back to how it was before, since this also causes our unit tests to fail due to schema issues.

@anders-elastisys
Copy link
Contributor Author

Do you know if the upstream dashboards/rules have done this?

Not from what I have seen, you can see that the regular expressions used in the upstream kube-prometheus-stack alerts and dashboards part of this PR did not get changed to address this new change, so I do not think we should be affected.

@anders-elastisys anders-elastisys force-pushed the anders-elastisys/upgrade-kube-prometheus-stack-prometheus-v3 branch from 78f5a5c to b706d6e Compare January 15, 2025 08:39
@OlleLarsson
Copy link
Contributor

@OlleLarsson I missed that, thanks, PTAL bb54e21

Nvm, this got reverted in v64, I will go back to how it was before, since this also causes our unit tests to fail due to schema issues.

Great, super that you noticed that they reverted the changes in the release after where they introduced the changes 😄!

Copy link
Contributor

@viktor-f viktor-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. Note that I have not looked very closely at the changelogs.

Copy link
Contributor

@aarnq aarnq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly reviewing the migration.

migration/v0.43/apply/10-kube-prometheus-stack.sh Outdated Show resolved Hide resolved
@anders-elastisys anders-elastisys changed the title Upgrade kube-prometheus-stack to 67.5.0 Upgrade kube-prometheus-stack to 67.11.0 Jan 22, 2025
@anders-elastisys
Copy link
Contributor Author

I updated this PR a bit, moved the migration script to next version since v0.43 is being released atm.
I also upgraded the chart to latest minor version of v67.11 to upgrade Prometheus to v3.1.0 since there was a critical CVE in v3.0. Tested upgrading to this new chart version, did not see any major changes, there were some changes to some alerts upstream that used the le function, but I did not see that we use this in any of our alerts.

@anders-elastisys
Copy link
Contributor Author

@Xartos @aarnq @OlleLarsson since you all left comments, do you want to take another look before I merge?

Copy link
Contributor

@aarnq aarnq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@anders-elastisys anders-elastisys force-pushed the anders-elastisys/upgrade-kube-prometheus-stack-prometheus-v3 branch from 9647c4c to 8648fb1 Compare January 28, 2025 12:19
@anders-elastisys anders-elastisys merged commit e868dec into main Jan 28, 2025
12 checks passed
@anders-elastisys anders-elastisys deleted the anders-elastisys/upgrade-kube-prometheus-stack-prometheus-v3 branch January 28, 2025 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade Kube-prometheus-stack-60.0.0
6 participants