Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui: timeseries: add per-node CPU and memory usage charts #23733

Closed
wants to merge 2 commits into from

Conversation

vilterp
Copy link
Contributor

@vilterp vilterp commented Mar 12, 2018

In #23379, @petermattis asked for the ability to switch from aggregated to per-node charts, to see what nodes are overloaded. That would be a fairly general change to our TS code; in the meantime it seems useful to at least be able to see per-node CPU and memory to get a basic sense of what nodes are overloaded.

Before:

CPU time aggregated across the cluster:
image
Memory usage by type, but not by node:
image

After:

  • Add per-node CPU time and per-node memory
    image

@vilterp vilterp requested a review from a team March 12, 2018 16:10
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@bdarnell
Copy link
Contributor

Why did you remove the aggregate CPU time? Both are useful but at different times. We either need to keep both or have an option to switch between them. Ideally we'd show the aggregated view by default with some sort of alert to call attention to the per-node view if there is a significant discrepancy.

Pete Vilter added 2 commits March 12, 2018 13:16
Release note (admin ui change): add per-node CPU time graph
Release note (admin ui change): add per-node resident set size graph
@vilterp
Copy link
Contributor Author

vilterp commented Mar 12, 2018

Removed it because I didn't really see the utility of aggregated CPU (it doesn't help you debug which nodes are overloaded), but just added it back in and updated the screenshots — I guess it's helpful to see how much CPU you're using overall.

@vilterp
Copy link
Contributor Author

vilterp commented Mar 12, 2018

Toggling between them or alerting if there's a discrepancy in per-node usage would be nice. Would probably be hard to choose a good discrepancy threshold though. Anecdotally, looking at Jordan's cluster the usage between nodes is pretty different, with a bunch of nodes near zero and some above one:
image

(As an aside, having a bar chart in these tooltips to compare values would be great)

@vilterp vilterp requested review from couchand and mrtracy March 12, 2018 18:08
Copy link
Contributor

@mrtracy mrtracy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but had a suggestion for tooltips.

<LineGraph
title="Memory Usage (per-node RSS)"
sources={nodeSources}
tooltip={`Per-node Resident Set Size.`}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we've explained what RSS means anywhere else, including the aggregated Memory Usage chart where RSS is just referred to as "Total Memory".

I think (per-node Total Memory) is fine for the title, while the tooltip could be Total Memory used by the CockroachDB process on each node."

Copy link
Contributor

@couchand couchand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that something fancier would be nice, but is definitely out-of-scope for now.

One suggestion about the titles: "per node" and "per type" or something, since they're both (potentially) aggregated along different dimensions.

tooltip={`CPU usage per node, in CPU seconds per second. E.g. if a node has four
cores all at 100%, the value will be 4.`}
>
<Axis units={AxisUnits.Count} label="cpu seconds / second">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend that we keep the units consistent between the two CPU charts.

@@ -17,7 +17,62 @@ export default function (props: GraphDashboardProps) {
</LineGraph>,

<LineGraph
title="Memory Usage"
title="CPU Time (per-node)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why'd you swap the charts?

title="Memory Usage"
title="CPU Time (per-node)"
sources={nodeSources}
tooltip={`CPU usage per node, in CPU seconds per second. E.g. if a node has four
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should either sum the user and system time or specifically call out that this is just user time. I think I prefer that we sum them, since system time is nontrivial.

@vilterp
Copy link
Contributor Author

vilterp commented Jul 10, 2018

Superseded by work under #27085

@vilterp vilterp closed this Jul 10, 2018
@vilterp vilterp deleted the cpu-pct branch July 10, 2018 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants