Duplicate metrics seem to be emitted #2262

aiyengar2 · 2020-09-16T18:44:00Z

Environmental Info:
K3s Version:
k3s version v1.18.8+k3s1 (6b59531)

Node(s) CPU architecture, OS, and Version:
Linux 5.4.0-45-generic #49-Ubuntu SMP Wed Aug 26 13:38:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
1 master, 2 workers

Describe the bug:
kube-controller-manager metrics endpoint is also sending kube-scheduler / kubelet / kube-apiserver / kube-proxy metrics. This results in an additional 21,598 extra metrics, which increases the memory requirements for monitoring k3s using Prometheus.

Investigated in more detail here: rancher/rancher#28787 (comment).

Steps To Reproduce:

Installed K3s
SSH onto a server node
run curl commands for http://localhost:10251/metrics, http://localhost:10252/metrics, http://localhost:10249/metricsand inspect output

Expected behavior:

Metrics at those endpoints only correspond to the actual metrics for those specific components, i.e.
10249 - kube-proxy
10251 - kube-scheduler
10252 - kube-controller-manager

Actual behavior:

All three seem to be emitting the same metrics.

Additional context / logs:

Based on conversation with @brandond , it seems like all the Prometheus lib might be using the same backend for all three exposed ports.

Either the metrics emitted to each endpoint should be backed by a different backend or the metrics should only be emitted on one port (i.e. a set of metrics from the unified k3s-server)

The text was updated successfully, but these errors were encountered:

aiyengar2 · 2020-09-16T19:22:34Z

Depending on how the story for monitoring metrics on k3s changes based on this ticket, we'll need to open a ticket in rancher/rancher to modify the Rancher Monitoring chart behavior for monitoring k3s clusters once this issue is resolved.

Context: rancher/rancher#28787 (comment)

davidnuzik · 2020-10-07T22:22:41Z

See if this is an issue in v1.18.x - pretty sure it's not but would like to verify.
@rancher-max could you check this? Please leave a comment with your findings.

rancher-max · 2020-10-08T00:11:46Z

This happens with many different k3s versions. I was able to confirm quickly using v1.18.2 and v1.17.4 as well with a single node k3s cluster. I tried to reproduce this in rke2 as well but it was not reproducible there.

Issue Description:

Rancher's Monitoring V2 attempts to scrape metrics for kube-proxy, kube-controller-manager, and kube-scheduler. Monitoring V1 never scraped these metrics, so this is only an issue when importing a k3s cluster into a rancher v2.5.x setup and enabling monitoring v2 on the cluster there.

I agree with the recommendations noted in the original issue description:

Either ensure the appropriate metrics are emitted to each endpoint (WOULD NOT require a change in rancher) OR
Only emit metrics on one port (WOULD require a change in rancher)

brandond · 2020-10-08T02:29:13Z

I'm not sure the Prometheus exporter architecture and how Kubernetes uses it will allow us to split up the metrics without significantly modifying upstream code.

cjellick · 2020-10-08T15:36:42Z

Based on recent convos, @aiyengar2 I would like to solve this by having the v2 monitoring for k3s just scrape one of the ports. Let's just do the kube-proxy port. Will this cause any problems on the monitoring side?

K3s is different enough that it really makes sense to just scrape a single endpoint. The memory is shared across all k8s components.

Copying some relevant bits form a slack convo:

@brandond - My understanding is that monitoring v1 didn't scrape the k3s control plane. So, it's been this way forever because as @erikwilson noted there is a common Prometheus instance for everything in the same process. IMO they should fix it by only scraping one port.

Me - in actually watching those metrics in prometheus, is it important for the operator to know the source (scheduler vs controller-manager) and do you lose that context if you scrape a single port?

@brandond - They might be keying off the job or port or something in the alerts or dashboards? But then anything that expects to have separate metrics for latency or memory or error rates or whatever else would need some rework for k3s anyways. Shared caches and such are one of the things that make k3s so lightweight but it means that a lot of things that would be individually monitorable on rke or rke2 are not here.

cjellick · 2020-10-08T15:38:20Z

@davidnuzik @maggieliu - this is on the list of the fields "Must Haves" for 2.5. Assuming @aiyengar2 doesn't have any significant push back on the above solution, this issue should bounce back over to the Rancher side.

aiyengar2 · 2020-10-08T18:24:27Z

Thanks for looking into this! Opened up an issue on the monitoring side to track the effort described in #2262 (comment).

davidnuzik · 2020-10-08T23:42:55Z

Since an issue was opened by @aiyengar2 to track this (rancher/rancher#29445) I don't think there's a need for this one to stay open? If I'm mistaken, someone please feel free to re-open this k3s issue.

davidnuzik · 2020-10-13T15:14:22Z

Reopened for tracking purposes. The work is being done via rancher/rancher#29445
Once this is in and out in a release, then we may validate and close out this K3s issue.

(Assign to myself in working status as I should check in on the rancher/rancher issue and wait for release, when ready assign to QA to test. Validate. Then close this K3s issue).

davidnuzik · 2020-10-22T19:49:51Z

Rancher issue rancher/rancher#29445 is closed and validated.
@rancher-max can you give this one more quick test as a priority (this week).

rancher-max · 2020-10-23T18:44:03Z

The solution as noted above is that k3s is continuing to emit the duplicate metrics. However, rancher monitoring v2 (v9.4.201) with k3sServer.enabled: true is only scraping from one of the ports. Note the results in rancher/rancher#29445 (comment). It is best seen that this is truly just scraping one port for k3s by running kubectl describe -n cattle-monitoring-system daemonset.apps/pushprox-k3s-server-client. Results show:

Containers:
   pushprox-client:
    Image:      rancher/pushprox-client:v0.1.0-rancher1-client
    Port:       <none>
    Host Port:  <none>
    Command:
      pushprox-client
    Args:
      --fqdn=$(HOST_IP)
      --proxy-url=$(PROXY_URL)
      --metrics-addr=$(PORT)
      --allow-port=10249
      --use-localhost

Note how this scraping port 10249. In the previous version of monitoring v2, there were the following Daemonsets: daemonset.apps/pushprox-kube-controller-manager-client, daemonset.apps/pushprox-kube-proxy-client, and daemonset.apps/pushprox-kube-scheduler-client which scraped ports 10252, 10249, and 10251 respectively.

Based on the suggestions that were laid out in this issue, I have validated that this has been successfully fixed according to the design.

davidnuzik · 2020-10-26T15:43:26Z

@rancher-max if "k3s is continuing to emit the duplicate metrics" then I would think this still needs fixing in K3s thus the issue should probably stay open. If you agree, please reopen.

rancher-max · 2020-10-26T16:07:20Z

I think it's okay to be closed based on this comment: #2262 (comment)

aiyengar2 mentioned this issue Sep 16, 2020

[Monitoring V2] Check k3s cluster with minimum CPU requirements rancher/rancher#28787

Closed

davidnuzik added this to the v1.20 - Backlog milestone Sep 17, 2020

davidnuzik added kind/bug Something isn't working [zube]: Backlog labels Sep 17, 2020

davidnuzik modified the milestones: v1.20 - Backlog, v1.19.3+k3s1 Oct 7, 2020

davidnuzik added [zube]: Next Up and removed [zube]: Backlog labels Oct 7, 2020

davidnuzik added the kind/internal label Oct 7, 2020

davidnuzik assigned erikwilson Oct 7, 2020

aiyengar2 mentioned this issue Oct 8, 2020

[Monitoring V2] Switch to using a single PushProx for k3s by default rancher/rancher#29445

Closed

davidnuzik closed this as completed Oct 8, 2020

zube bot added [zube]: Done and removed [zube]: Next Up labels Oct 8, 2020

davidnuzik reopened this Oct 13, 2020

zube bot added [zube]: To Triage and removed [zube]: Done labels Oct 13, 2020

davidnuzik assigned davidnuzik and unassigned erikwilson Oct 13, 2020

davidnuzik added [zube]: Working and removed [zube]: To Triage labels Oct 13, 2020

davidnuzik modified the milestones: v1.19.3+k3s1, v1.19.3+k3s2 Oct 14, 2020

rancher-max closed this as completed Oct 23, 2020

zube bot added [zube]: Done and removed [zube]: Working labels Oct 23, 2020

Oats87 removed the [zube]: Done label Jan 22, 2021

ricsanfre mentioned this issue Dec 5, 2021

Promethus: Unable to monitor kube-scheduler, kube-proxy and kube-contoller-manager components ricsanfre/pi-cluster#22

Closed

kladiv mentioned this issue Jan 14, 2022

Expose kube-scheduler, kube-proxy and kube-controller metrics endpoints #3619

Closed

ricsanfre mentioned this issue Aug 23, 2022

K3S emitting duplicated metrics in all endpoints (Api server, kubelet, kube-proxy, kube-scheduler, etc) ricsanfre/pi-cluster#67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate metrics seem to be emitted #2262

Duplicate metrics seem to be emitted #2262

aiyengar2 commented Sep 16, 2020

aiyengar2 commented Sep 16, 2020 •

edited

Loading

davidnuzik commented Oct 7, 2020 •

edited

Loading

rancher-max commented Oct 8, 2020

brandond commented Oct 8, 2020

cjellick commented Oct 8, 2020

cjellick commented Oct 8, 2020

aiyengar2 commented Oct 8, 2020

davidnuzik commented Oct 8, 2020

davidnuzik commented Oct 13, 2020 •

edited

Loading

davidnuzik commented Oct 22, 2020

rancher-max commented Oct 23, 2020

davidnuzik commented Oct 26, 2020

rancher-max commented Oct 26, 2020

Duplicate metrics seem to be emitted #2262

Duplicate metrics seem to be emitted #2262

Comments

aiyengar2 commented Sep 16, 2020

aiyengar2 commented Sep 16, 2020 • edited Loading

davidnuzik commented Oct 7, 2020 • edited Loading

rancher-max commented Oct 8, 2020

Issue Description:

brandond commented Oct 8, 2020

cjellick commented Oct 8, 2020

cjellick commented Oct 8, 2020

aiyengar2 commented Oct 8, 2020

davidnuzik commented Oct 8, 2020

davidnuzik commented Oct 13, 2020 • edited Loading

davidnuzik commented Oct 22, 2020

rancher-max commented Oct 23, 2020

davidnuzik commented Oct 26, 2020

rancher-max commented Oct 26, 2020

aiyengar2 commented Sep 16, 2020 •

edited

Loading

davidnuzik commented Oct 7, 2020 •

edited

Loading

davidnuzik commented Oct 13, 2020 •

edited

Loading