-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate metrics seem to be emitted #2262
Comments
Depending on how the story for monitoring metrics on k3s changes based on this ticket, we'll need to open a ticket in Context: rancher/rancher#28787 (comment) |
See if this is an issue in v1.18.x - pretty sure it's not but would like to verify. |
This happens with many different k3s versions. I was able to confirm quickly using v1.18.2 and v1.17.4 as well with a single node k3s cluster. I tried to reproduce this in rke2 as well but it was not reproducible there. Issue Description:Rancher's Monitoring V2 attempts to scrape metrics for kube-proxy, kube-controller-manager, and kube-scheduler. Monitoring V1 never scraped these metrics, so this is only an issue when importing a k3s cluster into a rancher v2.5.x setup and enabling monitoring v2 on the cluster there. I agree with the recommendations noted in the original issue description:
|
I'm not sure the Prometheus exporter architecture and how Kubernetes uses it will allow us to split up the metrics without significantly modifying upstream code. |
Based on recent convos, @aiyengar2 I would like to solve this by having the v2 monitoring for k3s just scrape one of the ports. Let's just do the kube-proxy port. Will this cause any problems on the monitoring side? K3s is different enough that it really makes sense to just scrape a single endpoint. The memory is shared across all k8s components. Copying some relevant bits form a slack convo: @brandond - My understanding is that monitoring v1 didn't scrape the k3s control plane. So, it's been this way forever because as @erikwilson noted there is a common Prometheus instance for everything in the same process. IMO they should fix it by only scraping one port. Me - in actually watching those metrics in prometheus, is it important for the operator to know the source (scheduler vs controller-manager) and do you lose that context if you scrape a single port? @brandond - They might be keying off the job or port or something in the alerts or dashboards? But then anything that expects to have separate metrics for latency or memory or error rates or whatever else would need some rework for k3s anyways. Shared caches and such are one of the things that make k3s so lightweight but it means that a lot of things that would be individually monitorable on rke or rke2 are not here. |
@davidnuzik @maggieliu - this is on the list of the fields "Must Haves" for 2.5. Assuming @aiyengar2 doesn't have any significant push back on the above solution, this issue should bounce back over to the Rancher side. |
Thanks for looking into this! Opened up an issue on the monitoring side to track the effort described in #2262 (comment). |
Since an issue was opened by @aiyengar2 to track this (rancher/rancher#29445) I don't think there's a need for this one to stay open? If I'm mistaken, someone please feel free to re-open this k3s issue. |
Reopened for tracking purposes. The work is being done via rancher/rancher#29445 (Assign to myself in working status as I should check in on the rancher/rancher issue and wait for release, when ready assign to QA to test. Validate. Then close this K3s issue). |
Rancher issue rancher/rancher#29445 is closed and validated. |
The solution as noted above is that k3s is continuing to emit the duplicate metrics. However, rancher monitoring v2 (
Note how this scraping port 10249. In the previous version of monitoring v2, there were the following Daemonsets: Based on the suggestions that were laid out in this issue, I have validated that this has been successfully fixed according to the design. |
@rancher-max if "k3s is continuing to emit the duplicate metrics" then I would think this still needs fixing in K3s thus the issue should probably stay open. If you agree, please reopen. |
I think it's okay to be closed based on this comment: #2262 (comment) |
Environmental Info:
K3s Version:
k3s version v1.18.8+k3s1 (6b59531)
Node(s) CPU architecture, OS, and Version:
Linux 5.4.0-45-generic #49-Ubuntu SMP Wed Aug 26 13:38:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
1 master, 2 workers
Describe the bug:
kube-controller-manager
metrics endpoint is also sendingkube-scheduler
/kubelet
/kube-apiserver
/kube-proxy
metrics. This results in an additional 21,598 extra metrics, which increases the memory requirements for monitoring k3s using Prometheus.Investigated in more detail here: rancher/rancher#28787 (comment).
Steps To Reproduce:
http://localhost:10251/metrics
,http://localhost:10252/metrics
,http://localhost:10249/metrics
and inspect outputExpected behavior:
Metrics at those endpoints only correspond to the actual metrics for those specific components, i.e.
10249 -
kube-proxy
10251 -
kube-scheduler
10252 -
kube-controller-manager
Actual behavior:
All three seem to be emitting the same metrics.
Additional context / logs:
Based on conversation with @brandond , it seems like all the Prometheus lib might be using the same backend for all three exposed ports.
Either the metrics emitted to each endpoint should be backed by a different backend or the metrics should only be emitted on one port (i.e. a set of metrics from the unified
k3s-server
)The text was updated successfully, but these errors were encountered: