[Metricbeat] [Kubernetes] Fix metrics and dashboards from scheduler, proxy and controller manager #34161

constanca-m · 2023-01-03T14:59:54Z

What does this PR do?

This PR updates the metrics collected by controller manager, proxy and scheduler kubernetes endpoint. Dashboards are also updated.

Why is it important?

Scheduler metrics:

process_max_fds added - interesting to relate to the already captured process_open_fds
rest_client_request_duration_seconds replaces deprecated metric http_request_duration_microseconds
rest_client_request_size_bytes replaces deprecated metric http_request_size_bytes
rest_client_response_size_bytes replaces deprecated metric http_requests_total
rest_client_requests_total replaces deprecated metric http_requests_total
workqueue_longest_running_processor_seconds added
workqueue_unfinished_work_seconds added
workqueue_adds_total added
workqueue_depth added
workqueue_retries_total added
scheduler_pending_pods added
scheduler_preemption_victims replaces deprecated scheduler_pod_preemption_victims
scheduler_preemption_attempts_total - interesting to relate to scheduler_preemption_victims
scheduler_scheduling_attempt_duration_seconds replaces deprecated metric scheduler_e2e_scheduling_duration_seconds
scheduler_schedule_attempts_total removed, can be similarly obtained through scheduler_scheduling_attempt_duration_seconds
leader_election_master_status added
queue, event and profile from scheduler_* metrics added
name from workqueue_* and leader_* metrics added
handler removed, not available

Controller manager metrics:

rest_client_response_size_bytes
rest_client_request_size_bytes
node_collector_evictions_total replaces node_collector_evictions_number
url removed, not available

Proxy metrics:

process_max_fds - interesting to relate to the already captured process_open_fds
rest_client_request_duration_seconds replaces deprecated metric http_request_duration_microseconds,
rest_client_request_size_bytes replaces deprecated metric http_request_size_bytes
rest_client_response_size_bytes replaces deprecated metric http_response_size_bytes
handler removed, not available
verb added from rest_client_* (except rest_client_requests_total) captured metrics

scheduler_scheduling_attempt_duration_seconds and rest_client_request_duration_seconds metrics converted to microseconds for reason mentioned in #32486.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

Metrics are relevant
The purpose of the dashboard is clear
The dashboards highlight possible problems

How to test this PR locally

Build the metricbeat image using this PR branch.
Deploy metricbeat with it and proxy, scheduler and controller manager metricsets enabled:

    - module: kubernetes
      metricsets:
        - proxy
      period: 10s
      host: ${NODE_NAME}
      hosts: ["localhost:10249"]
    
    - module: kubernetes
      metricsets:
        - controllermanager
      period: 10s
      host: ${NODE_NAME}
      hosts: ['https://0.0.0.0:10257']
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      ssl.verification_mode: "none"
    
    - module: kubernetes
      metricsets:
        - scheduler
      period: 10s
      host: ${NODE_NAME}
      hosts: ['https://0.0.0.0:10259']
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      ssl.verification_mode: "none"

Use, for example, kind to have access to control plane components endpoint and set up the new dashboards.

Related issues

Relates to [Kubernetes][Dashboards] Proxy - Fixing empty visualisations integrations#4607, [Kubernetes] Fix kube scheduler metricset and dashboard integrations#4810 and Remove deprecated http_request and http_respsone fields from Kube-proxy dataset integrations#4580

Screenshots

Proxy dashboard:

It is not great to have requests, client error and server error counter rates in different visualizations, but I couldn't find a way to put them together without the visualization becoming too confusing to read (following the assumption that there will be more than one host selected, which is seems more likely than not).

Scheduler dashboard:

Unlike proxy dashboard, requests and responses counter rates are put together as it is likely that there will be nodes that don't have control plane components, which makes it more readable.

Controller manager dashboard:

Followed by Workqueue, Process and HTTP Requests similar to scheduler dashboard.

Limitations

Limitation 1: If the user selects a period different than 10s (the default period as of today), the results will likely be incorrect.

Example: metricset.period 5s and minimum interval of 10s.
Data from rest_client_requests_total (counter) example:

Request data	01:00:00h	01:00:05h	01:00:10h
code="200",host="172.18.0.5:6443",method="GET"	10	15	40
code="201",host="172.18.0.5:6443",method="POST"	1	3	7
Total	11	18	47

With an interval of 5s, the total of requests will be 11, 18 and 47 respectively.
With an interval of 10s, the total of requests will be 11 and 65 (18+47), because there is no way to do last_value(sum...) using a time series.
Even in visualizations that don't have that problem, higher intervals might not reflect any spikes.

Limitation 2: Metrics that make use of average aren't enough to identify problems, but it is not always possible to use it with median since some metrics split values between buckets and there is no way to know their real values.

…uler-metrics

mergify · 2023-01-03T15:00:29Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @constanca-m? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-01-03T15:29:10Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-01-04T16:54:28.976+0000
Duration: 59 min 47 sec

Test stats 🧪

Test	Results
Failed	0
Passed	3930
Skipped	876
Total	4806

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

gsantoro · 2023-01-04T10:41:14Z

I can't add comments to linter issues if the line is not part of a diff!!!! so I'll add my comment here

Some issues:

I can't replicate a linter issue invalid operation: cannot compare m.clusterMeta != nil (operator != not defined on untyped nil) (typecheck)

I have googled this error but I can't find an easy way to fix it.

@tetianakravchenko (since you added this line), any idea how to fix it?

I tried to replicate what might happen if we remove the if condition. A for look with range over an empty map (code inside DeepUpdate) should just work. Meaning, that it won't error and it won't execute any code (as expected).

I also found this article where a Better solution propose to use reflection to check for null in case the type is a map.
any suggestions?

…' into fix-kubernetes-deprecated-metrics

This reverts commit 4016a64.

gsantoro · 2023-01-04T18:38:42Z

There is something off with the darwin linter since all those declared but not used (typecheck) at metricbeat/module/kubernetes/util/kubernetes.go don't make sense. many of those complain about a variable that is not used while the variable is actually used.

gsantoro · 2023-01-05T10:46:29Z

I have some questions about some dashboards:

proxy dashboard
- why is it called Rules for latency metrics?
- process
  - cpu usage increase over time in s? is it between 0 and 1 as in 0-100%?
  - resident memory variation in KB/s? is that a rate
scheduler dashboard
- scheduling
  - attempts counter rate per result?? what does it mean?
- workqueue
  - workqueue depth variation in s?
  - current unfinished work in s?
- process
  - same questions as proxy dashboard
- http requests
  - requests and responses is that a rate?

gsantoro · 2023-01-05T10:53:14Z

@constanca-m I assume this PR was raised to get up to date with latest release of Kubernetes. do you mind providing a bit more context in this PR (or link to an issue with the following details) about:

which K8s release we are trying to get up to date
where those metrics changes come from? I suppose the new release might have a list of added/removed metrics.

This is just so that we know in the future where those changes come from

constanca-m · 2023-01-05T11:56:38Z

  * why is it called `Rules` for latency metrics?

I called it Rules since both the visualizations relate to proxy rules:

SyncProxyRulesLatency is the latency of one round of kube-proxy syncing proxy rules.
NetworkProgrammingLatency is defined as the time it took to program the network - from the time the service or pod has changed to the time the change was propagated and the proper kube-proxy rules were synced.

Latency metrics include more visualizations.

  * process       
    * `cpu usage increase over time` in s? is it between 0 and 1 as in 0-100%?

No, it means that between time1 and time2, the CPU time increased one more second, not 100%. In the screenshot, having 0 and 1 is just a coincidence, but the increase is in seconds, not in %. Possible follow up question: why in s and not in %? There is not really an answer on which is better than the other, as they would end up saying the same thing: the increase on CPU time spent. If you meant the % of CPU being used, then there is no metric for that, that I know of.

Metric being used for that visualization is process_cpu_seconds_total: total user and system CPU time spent in seconds.

    * `resident memory variation` in KB/s? is that a rate

Yes, but I think variation is a more clear word for that visualization.

* scheduler dashboard  
  * scheduling      
    * `attempts counter rate per result`?? what does it mean?

It means exactly that: the attempts counter rate split by result. If you check the metric being used, scheduler_scheduling_attempt_duration_seconds, you can see that is broken down by profile and result (code here). Possible results are scheduled, unschedulable and error (see here).

  * workqueue        
    * `workqueue depth variation` in s?

Not in seconds. It is a unit. As in, a queue at 10am can have 10 people, and at 11am, can have 12 people. Then the variation is 2/h people, (12-10)/h.

    * `current unfinished work` in s?

Per definition: how many seconds of work has done that is in progress and hasn't been considered in the longest running processor.

  * http requests
    * `requests and responses` is that a rate?

Yes.

which K8s release we are trying to get up to date
where those metrics changes come from? I suppose the new release might have a list of added/removed metrics.

The answer to these questions is in the README file inside the controller manager/scheduler/proxy folder.

@gsantoro

gsantoro · 2023-01-05T12:55:39Z

@constanca-m

I called it Rules since both the visualizations relate to proxy rules:
SyncProxyRulesLatency is the latency of one round of kube-proxy syncing proxy rules.
NetworkProgrammingLatency is defined as the time it took to program the network - from the time the service or pod has changed to the time the change was propagated and the proper kube-proxy rules were synced.

thanks for the explanation. I am wondering if there is a way to add the above bit somehow to the dashboard. Or even just a link to docs with that explanation might be ok.

No, it means that between time1 and time2, the CPU time increased one more second, not 100%. In the screenshot, having 0 and 1 is just a coincidence, but the increase is in seconds, not in %. Possible follow up question: why in s and not in %? There is not really an answer on which is better than the other, as they would end up saying the same thing: the increase on CPU time spent. If you meant the % of CPU being used, then there is no metric for that, that I know of.

Ok, but it's still a bit confusing that a CPU usage is measured in seconds. Let me see if I understand correctly, let's say in the 10s between two different mesurements, that CPU spent 1s doing its job. From the name process_cpu_seconds_total it's a bit clearer now that the time is the total time in seconds but it still doesn't clarify what's the time1 and time2 time range between two measurements. Is the time range up to us and the frequency of our Elastic probe? Do we do any extra processing on top to get from a seconds_total to a increase over time? Increase over time seems to suggest to me that it can only increase compared to the previous period. Maybe I am being pedantic but without any extra doc/links it is a bit unclear still.

....
for the rest of the answers, thanks for the explanation. it makes much more sense now. But I am still wondering if our visualisation names are enough to convey the meaning of a visualisation or if we should link them to docs with names of raw metrics used and their docs.

Overall I'm trying to reduce the number of SDH that might come our way because a customer doesn't fully grasp the meaning of a visualisation. Please let me know if this makes sense.

constanca-m · 2023-01-05T13:09:56Z

Ok, but it's still a bit confusing that a CPU usage is measured in seconds. Let me see if I understand correctly, let's say in the 10s between two different mesurements, that CPU spent 1s doing its job. From the name process_cpu_seconds_total it's a bit clearer now that the time is the total time in seconds but it still doesn't clarify what's the time1 and time2 time range between two measurements. Is the time range up to us and the frequency of our Elastic probe? Do we do any extra processing on top to get from a seconds_total to a increase over time? Increase over time seems to suggest to me that it can only increase compared to the previous period. Maybe I am being pedantic but without any extra doc/links it is a bit unclear still.

That is correct. The value can only increase, it is a counter.

Overall I'm trying to reduce the number of SDH that might come our way because a customer doesn't fully grasp the meaning of a visualisation. Please let me know if this makes sense.

This makes a lot of sense! I agree completely that the dashboards are not enough without more explanation, but the dashboards are also not the place to have them. So the solution would be, like you said, to create more documentation. I have talked about that with Miguel yesterday and the compromise is: I will create a draft for one of the components and then I will share it so we can see if it makes sense. The document will be added to our documentation (obs-docs) and it will be focused on the metrics we are using and what problems we can detect with them. For this, we will be using visualizations (like the counter rate for client error responses), and not the whole dashboard to save us the trouble of regular maintenance. It is not worth it to do that for all the components right now, without agreeing on a specific structure (it saves more time). I will try to do that in the upcoming days and share it with the team.

Edit: I create an issue for this elastic/observability-docs#2474

…proxy and controller manager (#34161) * Update deprecated metrics from controller manager, scheduler and proxy * Update dashboards (cherry picked from commit 7dc5e22)

tetianakravchenko · 2023-01-04T16:59:25Z

CHANGELOG.next.asciidoc

@@ -106,6 +106,7 @@ https://github.com/elastic/beats/compare/v8.2.0\...main[Check the HEAD diff]
 - Fix kafka dashboard field names {pull}33555[33555]
 - Add tags to events based on parsed identifier. {pull}33472[33472]
 - Support Oracle-specific connection strings in SQL module {issue}32089[32089] {pull}32293[32293]
+- Remove deprecated metrics from controller manager, scheduler and proxy {pull}34161[34161]


you also add a new metrics, isn't it?

tetianakravchenko · 2023-01-05T14:16:04Z

metricbeat/module/kubernetes/controllermanager/controllermanager_test.go

@@ -32,8 +32,8 @@ func TestEventMapping(t *testing.T) {
 	ptest.TestMetricSet(t, "kubernetes", "controllermanager",
 		ptest.TestCases{
 			{
-				MetricsFile:  "./_meta/test/metrics.1.20",


dropping 1.20 since this k8s version is not supported anymore https://endoflife.date/kubernetes

tetianakravchenko · 2023-01-05T14:17:45Z

metricbeat/module/kubernetes/controllermanager/controllermanager.go

@@ -34,20 +34,24 @@ var mapping = &prometheus.MetricsMapping{
 		"process_open_fds":              prometheus.Metric("process.fds.open.count"),
 		"process_max_fds":               prometheus.Metric("process.fds.max.count"),
 		"process_start_time_seconds":    prometheus.Metric("process.started.sec"),
-		// rest_client_request_duration_seconds buckets declared in


is this comment not valid anymore?

It is, I moved it to the README to be in the same file as the other resources

tetianakravchenko · 2023-01-05T14:19:17Z

metricbeat/module/kubernetes/controllermanager/controllermanager.go

@@ -56,7 +60,6 @@ var mapping = &prometheus.MetricsMapping{
 		"host":   prometheus.KeyLabel("host"),
 		"name":   prometheus.KeyLabel("name"),
 		"zone":   prometheus.KeyLabel("zone"),
-		"url":    prometheus.KeyLabel("url"),


this fields was completely removed? from my understanding this fields was just recently introduced - https://github.com/elastic/beats/pull/32037/files#diff-dff4473d065113acb05a9f69b5854880be7761fe5a0267e99f869cb1503e9bde

is this field refer to rest-client metrics? kubernetes/component-base@fd4965b at some point url label was renamed to host, that we already use above, so we drop it in favor of using host, right?

I might be wrong, but I checked the supported versions. The oldest release date for a supported version is 07 Dec 2021 for 1.23. The commit that replace url by host was done on the 19th of November of 2021. I made the assumption that 1.23 version didn't support url anymore.

from my understanding - 1.23 is still using url - https://github.com/kubernetes/component-base/blob/release-1.23/metrics/prometheus/restclient/metrics.go

@gizas any thought on it?

if the difference it's just a single field, I think it's more important to have back compatibility with 1.23 so I'll prefer to keep that field.

That seems to be right @tetianakravchenko, but how do we make sure of that? Do we compare all commits between the release date and today?

@gsantoro Tanya doesn't intend to remove that field, but to use the same approach we use in here - two fields for the same label.

That seems to be right @tetianakravchenko, but how do we make sure of that? Do we compare all commits between the release date and today?

@constanca-m I've selected the release-1.23 branch to see if this release still has url or host

constanca-m · 2023-01-12T12:04:50Z

https://github.com/Mergifyio backport 8.6

mergify · 2023-01-12T12:05:23Z

backport 8.6

✅ Backports have been created

#34188 [8.6](backport #34161) [Metricbeat] [Kubernetes] Fix metrics and dashboards from scheduler, proxy and controller manager has been created for branch 8.6

…proxy and controller manager (#34161) (#34188) * Update deprecated metrics from controller manager, scheduler and proxy * Update dashboards (cherry picked from commit 7dc5e22) Co-authored-by: Constança Manteigas <[email protected]>

…proxy and controller manager (#34161) * Update deprecated metrics from controller manager, scheduler and proxy * Update dashboards

constanca-m added 11 commits December 21, 2022 11:03

fix scheduler

5ce10aa

Fix dashboard

6883257

test data, update README and dashboard

d70fb5f

update dashboard and README

21b40a2

update README and dashboard

08a39e9

update dashboard

a2bcc5b

Merge branch 'fix-kubernetes-proxy-metrics' into fix-kubernetes-sched…

d2466c1

…uler-metrics

Update metrics and dashboards

fd8c333

Update example docs

2713a5b

Add controller manager metric

60a96fd

Update README

f8928dc

constanca-m added bug Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team labels Jan 3, 2023

constanca-m self-assigned this Jan 3, 2023

constanca-m requested a review from a team as a code owner January 3, 2023 14:59

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 3, 2023

Update changelog

af5fc83

constanca-m requested review from a team, gsantoro and devamanv and removed request for a team January 3, 2023 15:22

constanca-m added the backport-v8.6.0 Automated backport with mergify label Jan 3, 2023

Merge branch 'main' into fix-kubernetes-scheduler-metrics

39430eb

constanca-m added 2 commits January 4, 2023 14:24

Check clustermeta

a431246

Merge remote-tracking branch 'origin/fix-kubernetes-scheduler-metrics…

2364212

…' into fix-kubernetes-deprecated-metrics

constanca-m requested a review from a team as a code owner January 4, 2023 13:26

constanca-m requested a review from a team as a code owner January 4, 2023 15:39

constanca-m requested review from cmacknz and faec and removed request for a team January 4, 2023 15:39

Revert "Testing golangci-lint.yml darwin disabled"

81e182d

This reverts commit 4016a64.

cmacknz removed request for cmacknz and faec January 4, 2023 19:11

gsantoro approved these changes Jan 5, 2023

View reviewed changes

constanca-m merged commit 7dc5e22 into elastic:main Jan 5, 2023

mergify bot mentioned this pull request Jan 5, 2023

[8.6](backport #34161) [Metricbeat] [Kubernetes] Fix metrics and dashboards from scheduler, proxy and controller manager #34188

Merged

tetianakravchenko reviewed Jan 5, 2023

View reviewed changes

constanca-m removed the backport-v8.6.0 Automated backport with mergify label Jan 5, 2023

This was referenced Jan 9, 2023

[Metricbeat][Kubernetes] Fix scheduler and proxy dashboard #34209

Merged

[Metrics][Kubernetes] Update proxy, scheduler and controller manager fields and dashboards elastic/integrations#4948

Merged

constanca-m added backport-v8.6.0 Automated backport with mergify and removed backport-v8.6.0 Automated backport with mergify labels Jan 10, 2023

constanca-m deleted the fix-kubernetes-scheduler-metrics branch January 12, 2023 14:47

tetianakravchenko mentioned this pull request Jan 23, 2023

[CI] golangci-lint: no reproducable locally #34344

Closed

gizas mentioned this pull request Feb 15, 2023

Convert Istio dashboards to Kibana Lens elastic/integrations#5268

Merged

4 tasks

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023

[Metricbeat] [Kubernetes] Fix metrics and dashboards from scheduler, …

fd08266

…proxy and controller manager (#34161) * Update deprecated metrics from controller manager, scheduler and proxy * Update dashboards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metricbeat] [Kubernetes] Fix metrics and dashboards from scheduler, proxy and controller manager #34161

[Metricbeat] [Kubernetes] Fix metrics and dashboards from scheduler, proxy and controller manager #34161

constanca-m commented Jan 3, 2023 •

edited

Loading

mergify bot commented Jan 3, 2023

elasticmachine commented Jan 3, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

gsantoro commented Jan 4, 2023 •

edited

Loading

gsantoro commented Jan 4, 2023

gsantoro commented Jan 5, 2023

gsantoro commented Jan 5, 2023

constanca-m commented Jan 5, 2023

gsantoro commented Jan 5, 2023

constanca-m commented Jan 5, 2023 •

edited

Loading

tetianakravchenko Jan 4, 2023

tetianakravchenko Jan 5, 2023

tetianakravchenko Jan 5, 2023

constanca-m Jan 5, 2023

tetianakravchenko Jan 5, 2023 •

edited

Loading

constanca-m Jan 5, 2023

tetianakravchenko Jan 5, 2023

gsantoro Jan 5, 2023

constanca-m Jan 5, 2023

constanca-m Jan 5, 2023

tetianakravchenko Jan 9, 2023

constanca-m commented Jan 12, 2023

mergify bot commented Jan 12, 2023

[Metricbeat] [Kubernetes] Fix metrics and dashboards from scheduler, proxy and controller manager #34161

[Metricbeat] [Kubernetes] Fix metrics and dashboards from scheduler, proxy and controller manager #34161

Conversation

constanca-m commented Jan 3, 2023 • edited Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

Screenshots

Limitations

mergify bot commented Jan 3, 2023

elasticmachine commented Jan 3, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

gsantoro commented Jan 4, 2023 • edited Loading

gsantoro commented Jan 4, 2023

gsantoro commented Jan 5, 2023

gsantoro commented Jan 5, 2023

constanca-m commented Jan 5, 2023

gsantoro commented Jan 5, 2023

constanca-m commented Jan 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tetianakravchenko Jan 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

constanca-m commented Jan 12, 2023

mergify bot commented Jan 12, 2023

✅ Backports have been created

constanca-m commented Jan 3, 2023 •

edited

Loading

elasticmachine commented Jan 3, 2023 •

edited by jenkins-beats-ci bot

Loading

gsantoro commented Jan 4, 2023 •

edited

Loading

constanca-m commented Jan 5, 2023 •

edited

Loading

tetianakravchenko Jan 5, 2023 •

edited

Loading