Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UI] Compare Node Metrics with Average #2925

Closed
ChengYanJin opened this issue Nov 12, 2020 · 13 comments · Fixed by #3078
Closed

[UI] Compare Node Metrics with Average #2925

ChengYanJin opened this issue Nov 12, 2020 · 13 comments · Fixed by #3078
Assignees
Labels
complexity:hard Something that may require up to a week to fix priority:urgent Any issue we should jump in as soon as possible severity:major Major impact on live deployments (e.g. some non-critical feature is not working at all) topic:ui UI-related issues

Comments

@ChengYanJin
Copy link
Contributor

ChengYanJin commented Nov 12, 2020

Component: ui, nodes

Why this is needed:
We want to provide the System Admin a way to compare the select node with the average metric value.

What should be done:
The design:

  1. toggle OFF
    comparison-05-03-v5-metrics-avg-on-variant4-OFF@2x

  2. toggle ON
    comparison-05-03-v5-metrics-avg-on-variant4-ON@2x

  3. toggle ON with hover popup
    comparison-05-03-v5-metrics-avg-on-variant4-ON-popup@2x

Implementation proposal (strongly recommended):

  • https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators
    Please consider using avg (calculate the average over dimensions) to get the average value of the nodes instead of implementing JS Code.
  • Set argument tooltip to true in LineChart in order to enable the tooltip in the charts.
  • When toggling on average, the URL should change as well.
  • (Additional) Please update the color of the lines with the new design(Hardcode the color for now)

Test plan:
TBD

@ChengYanJin ChengYanJin added topic:ui UI-related issues complexity:medium Something that requires one or few days to fix labels Nov 12, 2020
@thomasdanan
Copy link
Contributor

Actually, I am thinking about 3 distinct use cases where this feature could help:

  1. Being able to understand how a particular node is behaving compared to the rest of the nodes
  2. Compare one node to specific other nodes
  3. Quickly identify some nodes having strange behavior

@gdemonet
Copy link
Contributor

Actually, I am thinking about 3 distinct use cases where this feature could help:

1. Being able to understand how a particular node is behaving compared to the rest of the nodes

Could be useful to compare with the average (and maybe percentiles), not just any node(s) / all of them.

2. Compare one node to specific other nodes

3. Quickly identify some nodes having strange behavior

Again, comparing with aggregate values could be both easier for a user to read, and for a developer to handle.

@thomasdanan
Copy link
Contributor

Actually, I am thinking about 3 distinct use cases where this feature could help:

1. Being able to understand how a particular node is behaving compared to the rest of the nodes

Could be useful to compare with the average (and maybe percentiles), not just any node(s) / all of them.

2. Compare one node to specific other nodes

3. Quickly identify some nodes having strange behavior

Again, comparing with aggregate values could be both easier for a user to read, and for a developer to handle.

Yes, I think it is a very good idea @gdemonet (especially for use case 1).
For use case 3, I think the feature we are talking about will not help, because one would have to go through all the nodes, one by one to identify the one having some strange behavior. For this specific use, I think we are more looking for a Top10 kind of representation

@ChengYanJin
Copy link
Contributor Author

Very good point!! @gdemonet
Maybe we should always display the average value in the graph to give the users an idea of how this node behaviors than others.

So I guess we should limit the number of nodes/volumes the user can select @thomasdanan?

@Cuervino
Copy link

This "use case 3" is specific > the user doesn't have to start from a specific node, but want to identify the outliers from the full total.
In this case, a top5/top10 could work.

As for the moment, the logic is still to start from a selected node, what we can do is to have 4 ways of displaying the values

  1. -Raw (only the selected node)
  2. -Versus average values
  3. -Comparison / multiselection
  4. -Versus top5 & bottom5

I don't know how to define the top 5 / bottom 5 > for example, to take the nodes with the current "top latency" values, or to take the one with the "top latency" values over the period?

Anyway, here could be the next iteration.
Untitled

@gdemonet
Copy link
Contributor

Thanks @Cuervino for the fast proposal 👌

I agree that the four proposed scenarios are likely covering most, if not all, our use cases.
However, I think having the user choose between any of them is an extra step which may not be required. Here's my suggestion:

  • keep the toggle from the first design (either looking at a single node, or comparing it with others)
  • in the list of nodes to select for comparison, add a "fake" node for averages
  • top 3 / bottom 3 can be achieved if we have sorting in the list (not for now, and not on the metrics, so I'm not sure that's enough), or using other "fake" nodes

@ChengYanJin ChengYanJin changed the title [UI] Multi-node metrics comparison [UI] Multi-node metrics comparison design Nov 17, 2020
@ChengYanJin
Copy link
Contributor Author

Yes. I would agree with @gdemonet, with so many choices this metrics tab will become extremely crowded. Maybe we should build the top 3/5/10 in the Grafana dashboard or overview page.

At the same time, I am not sure to have a "fake" node appearing in the list. Where this node should locate when we add sorting or filtering?

I am thinking more to have an average/base line displayed in the graph when we active the multi-node comparison mode.
Just to give users a general overview. For more details, they should go to the Grafana node detailed dashboard which is accessible by clicking on "Advanced metrics".

@ChengYanJin ChengYanJin self-assigned this Nov 24, 2020
@ChengYanJin
Copy link
Contributor Author

ChengYanJin commented Nov 24, 2020

Before starting the implementation, I would like to summary a few details, so we can finalize the design.

1. The URL when the multi-node comparsion mode is enabled:

/nodes/<node-name>/metrics?multi=true&selected=<node1>,<node2>,<node3>...

2. The query I should use to get the average metrics of all the nodes?

"avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"system\"}[1m])) by (instance)",

Let's take CPU Usage as an example:

For a specific node:

(((count(count(node_cpu_seconds_total{instance=~"${instanceIP}:${PORT_NODE_EXPORTER}"}) by (cpu))) - avg(sum by (mode)(irate(node_cpu_seconds_total{mode='idle',instance=~"${instanceIP}:${PORT_NODE_EXPORTER}"}[5m])))) * 100) / count(count(node_cpu_seconds_total{instance=~"${instanceIP}:${PORT_NODE_EXPORTER}"}) by (cpu))

Cluster-wide query from Grafana:

CPU Usage

sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=""}) by (cluster)

image

3. Need to figure out how to dynamically add the line(s) in the chart

4. Need to have at least 6 colors (including variant color) ready for the line chart.

@JBWatenbergScality
Copy link
Contributor

I'm not sure how pertinent would it be to enable node to node comparison. Also the purposed screen design attached let me feel that all the tabs should contain aggregated values corresponding to the selected nodes in the table, which is not what we want to represent as the selection should only impact the metrics tab.

The use cases that I would see is comparing each single node with the average of others as Guillaume purposed to enable detection of nodes deviating from classical metrics level.

Additionally we might want to offer the possibility to compare average of metrics by groups of nodes based on their labels (for eg in multisite deployment or segregated network zones we will probably have nodes labelled with those availability information) so it might be interesting to see how the charge is spreaded across those different areas.

@thomasdanan
Copy link
Contributor

@ChengYanJin @gdemonet @Cuervino @JBWatenbergScality
I suggest that we start with a first iteration where one can compare the current node metric with the average of all other nodes, without implementing any kind of node selection to compare with in a first step.

@Cuervino
Copy link

Cuervino commented Jan 8, 2021

Here is a new design for comparing the Node against "Cluster Avg"

  1. toggle OFF
    comparison-05-03-v5-metrics-avg-on-variant4-OFF@2x

  2. toggle ON
    comparison-05-03-v5-metrics-avg-on-variant4-ON@2x

  3. toggle ON with hover popup
    comparison-05-03-v5-metrics-avg-on-variant4-ON-popup@2x

@ChengYanJin ChengYanJin assigned alexis-ld and unassigned ChengYanJin Jan 11, 2021
@gdemonet
Copy link
Contributor

@Cuervino TBH, I find the dotted lines to be almost invisible (maybe I'm blind 😅). Is there any way we can increase contrast so it's easier to spot?

@ChengYanJin
Copy link
Contributor Author

ChengYanJin commented Jan 11, 2021

@gdemonet @Cuervino
We can increase the opacity of the dotted lines.
Or we can choose more visiable stroke dash on vega-lite(the chart library) side. Example
image

@ChengYanJin ChengYanJin changed the title [UI] Multi-node metrics comparison design [UI] Compare Node Metrics with Average Jan 11, 2021
@thomasdanan thomasdanan added priority:urgent Any issue we should jump in as soon as possible severity:major Major impact on live deployments (e.g. some non-critical feature is not working at all) labels Jan 25, 2021
@ChengYanJin ChengYanJin added complexity:hard Something that may require up to a week to fix and removed complexity:medium Something that requires one or few days to fix labels Jan 31, 2021
@thomasdanan thomasdanan added this to the MetalK8s 2.8.0 milestone Feb 1, 2021
alexis-ld added a commit that referenced this issue Feb 8, 2021
alexis-ld added a commit that referenced this issue Feb 8, 2021
Updated charts legend labels and tooltip labels to include node name and fix typos to fit the design

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 8, 2021
This version enables strokeDash property on LineChart

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 8, 2021
Slightly decrease height because adding the legend made the charts overlow

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 8, 2021
alexis-ld added a commit that referenced this issue Feb 8, 2021
.1f formatting in tooltips, slightly smaller font in Legends

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 9, 2021
Change e.target to e.currentTarget to match React synthetic events

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 12, 2021
In case showAvg is true we request the metrics average for the whole cluster aside from the regular metrics in monitoring duck

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 12, 2021
Switch to enable/disable cluster avg metrics (data lives in URL)
Display cluster avg as a dashed line in node metrics charts

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 12, 2021
Slightly adapt charts titles and charts dimensions to fit the new design

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 12, 2021
In order to make node cluster avg display work

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 16, 2021
In case showAvg is true we request the metrics average for the whole cluster aside from the regular metrics in monitoring duck

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 16, 2021
Switch to enable/disable cluster avg metrics (data lives in URL)
Display cluster avg as a dashed line in node metrics charts

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 16, 2021
Slightly adapt charts titles and charts dimensions to fit the new design

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 16, 2021
In order to make node cluster avg display work

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 16, 2021
In case showAvg is true we request the metrics average for the whole cluster aside from the regular metrics in monitoring duck

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 16, 2021
Switch to enable/disable cluster avg metrics (data lives in URL)
Display cluster avg as a dashed line in node metrics charts

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 16, 2021
Slightly adapt charts titles and charts dimensions to fit the new design

Refs: #2925
alexis-ld added a commit that referenced this issue Feb 16, 2021
In order to make node cluster avg display work

Refs: #2925
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity:hard Something that may require up to a week to fix priority:urgent Any issue we should jump in as soon as possible severity:major Major impact on live deployments (e.g. some non-critical feature is not working at all) topic:ui UI-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants