-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade kube-prometheus-stack to 67.11.0 #2381
Upgrade kube-prometheus-stack to 67.11.0 #2381
Conversation
ba80518
to
e6fa82f
Compare
Did you account for the change that they mention here? We seem to set these to false in the wc kps config |
current_version=$(helm_do "${cluster}" get metadata -n monitoring kube-prometheus-stack -ojson | jq '.version' | tr -d '"') | ||
|
||
log_info " - Checking if kube-promethes-stack needs to be upgraded" | ||
if [[ ! "${current_version}" < "$(echo -e "${new_version}\n${current_version}" | sort -V | tail -n1)" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Does this comparison work with versions?
Wouldn't this just need to be
if [[ ! "${current_version}" < "$(echo -e "${new_version}\n${current_version}" | sort -V | tail -n1)" ]]; then | |
if [[ "${current_version}" != "${new_version}" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comparison should work in most cases since it sorts lexicographically and the kps helm release will be a couple of major versions behind in most cases. But I like your suggestion, then it should be possible to do downgrade as well. PTAL 78f5a5c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to care about the PromQL change to the dot token?
Just looking quickly I found usage in both alerts and dashboards, I'm sure they are everywhere.
compliantkubernetes-apps/helmfile.d/charts/prometheus-alerts/templates/records/k8s.rules.yaml
Line 130 in c8783f3
"workload", "$1", "owner_name", "(.*)" |
compliantkubernetes-apps/helmfile.d/charts/grafana-dashboards/dashboards/backup-dashboard.json
Line 94 in c8783f3
"expr": "min((time()-kube_job_status_completion_time{job_name=~\"harbor-backup-cronjob-.*\", cluster=~\"$cluster\"})/3600)", |
Have you checked if all dashboards behave the same?
I did a quick check comparing some dashboards that contains this regex and did not see any noticeable difference. |
d01f58c
to
78f5a5c
Compare
@OlleLarsson I missed that, thanks, PTAL bb54e21 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick check comparing some dashboards that contains this regex and did not see any noticeable difference.
Difficult to verify all cases where this is found, I am doubting that we have newlines in any label values that would cause problems here, but if we are unsure we could do as suggested in the migration doc and use[^\n]
for all occurrences.
Do you know if the upstream dashboards/rules have done this?
Nvm, this got reverted in v64, I will go back to how it was before, since this also causes our unit tests to fail due to schema issues. |
Not from what I have seen, you can see that the regular expressions used in the upstream kube-prometheus-stack alerts and dashboards part of this PR did not get changed to address this new change, so I do not think we should be affected. |
78f5a5c
to
b706d6e
Compare
Great, super that you noticed that they reverted the changes in the release after where they introduced the changes 😄! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. Note that I have not looked very closely at the changelogs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly reviewing the migration.
I updated this PR a bit, moved the migration script to next version since v0.43 is being released atm. |
@Xartos @aarnq @OlleLarsson since you all left comments, do you want to take another look before I merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
9647c4c
to
8648fb1
Compare
Warning
This is a public repository, ensure not to disclose:
What kind of PR is this?
Required: Mark one of the following that is applicable:
Optional: Mark one or more of the following that are applicable:
Important
Breaking changes should be marked
kind/admin-change
orkind/dev-change
depending on typeCritical security fixes should be marked with
kind/security
Application Developer notice
Prometheus has been upgraded to version 3.0. This includes changes to the Prometheus UI. Prometheus V3 comes with some changes that may affect existing PromQL expressions in alerts or dashboards. Please have a look at the Prometheus V3 migration guide.
Security notice
Upgraded Prometheus to v3.1.0 to address CVE-2024-45337
What does this PR do / why do we need this PR?
Noticed that the
kube-prometheus-stack
was falling behind a bit, this PR upgrades the Helm chart to v67.5.0 which also upgrades Prometheus to v3..I checked the v3 migration guide and I did not see that we are currently using any of breaking flags or configurations in our default Welkin config, but please verify if this is used in some environments.
This fixes some ARP metrics and a log issue caused by this in the node-exporter (this is mentioned in the linked issue).
Alertmanager in the Mangement cluster is not upgraded, instead the image version is fixed to previous v0.26.0 due to v0.27.0 deprecating the
v1
API endpoint, which is still used by Thanos.Once we upgrade Thanos to v0.35 or higher, the
v2
endpoint will be default (see related upstream issue) and we can remove the image override.Information to reviewers
Checklist