feat:add node group health and back off metrics #6396

guopeng0 · 2023-12-21T10:19:51Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This pull request aims to introduce new metrics to display the health status and back off situation of node groups in the cluster. Currently, there are cluster health metrics, but this PR intends to add additional metrics at the node group level.

The two new metrics being added are:
node_group_healthiness: This metric will indicate the health status of each node group, providing clear visibility into whether the node group is currently healthy or not. If a node group remains unhealthy for a prolonged period, it can trigger alerts for further investigation.
node_group_backoff_status: This metric displays the back off status of each node group, along with the specific reasons causing it.

Implementation Details:
The proposed changes involve adding an updateMetrics function to the ClusterStateRegistry class. This function will be called at the end of each RunOnce execution, ensuring that the new metrics are regularly updated. While these metrics are primarily used in scale-up scenarios, it is important to keep them updated continuously, not just during scale-up.

We kindly request your review of this PR and your thoughts on the proposed implementation.
Thank you for your time and consideration.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

x13n · 2023-12-28T10:23:48Z

cluster-autoscaler/metrics/metrics.go

+// UpdateNodeGroupBackOffStatus records if node group is backoff for not autoscaling
+func UpdateNodeGroupBackOffStatus(nodeGroup string, backOff bool, reason string) {
+	if backOff {
+		nodeGroupBackOffStatus.WithLabelValues(nodeGroup, reason).Set(1)


This will not work the way you want: once the backoff is over, the value of 1 will keep on being emitted for the old reason. Consider:

UpdateNodeGroupBackOffStatus("foo", true, "reason")

Under /metrics endpoint you'll get something like:

node_group_backoff_status{node_group=foo,reason=reason} 1

UpdateNodeGroupBackOffStatus("foo", false, "")

Under /metrics endpoint you'll get something like:

node_group_backoff_status{node_group=foo,reason=reason} 1 node_group_backoff_status{node_group=foo,reason=""} 0

What you may want to do instead is to use a string metric with value equal to the backoff reason.

I agree with the suggestion and will make the necessary modifications as soon as possible.

Modifications done and PR submitted.

Apologies for not being clear - I think the current version of the code will have exactly the same issue: I expect reason to be an empty string every time backoff is false. If you want to track backoff reasons like this, you need to keep track of the last reason value and explicitly clear it when backoff becomes false. Alternatively, you could implement a custom collector that will only emit metrics with up-to-date reason. This would be the cleanest solution. With the existing code, in the scenario from my previous comment, after backoff is over, you get:

node_group_backoff_status{node_group=foo,reason=reason} 1 node_group_backoff_status{node_group=foo,reason=""} 0

If you instead track nodegroup -> reason mapping and clear the last reason you will instead get:

node_group_backoff_status{node_group=foo,reason=reason} 0

If you decide to write a custom collector, once backoff is over, you could simply get empty output:

Thank you for raising this issue, and you’re right, I misunderstood it. It is indeed not expected behavior that the previous nodeGroup backoff metric continues to be reported as 1 after the backoff is over. I will take some time to reconsider how to address this and submit the code changes as soon as possible.

Modifications done and PR submitted.

@x13n Please review my changes

x13n · 2023-12-28T10:28:45Z

/assign

x13n · 2024-01-05T12:48:07Z

cluster-autoscaler/metrics/metrics.go

+// UpdateNodeGroupBackOffStatus records if node group is backoff for not autoscaling
+func UpdateNodeGroupBackOffStatus(nodeGroup string, backOff bool, errorClass cloudprovider.InstanceErrorClass) {
+	if backOff {
+		nodeGroupBackOffStatus.WithLabelValues(nodeGroup).Set(float64(errorClass))


That is quite hacky and also meaningless when aggregated. Adding error codes together doesn't make sense and yet it is exactly what will happen when querying for multiple node groups. The previous approach with label for error code makes sense, you just need to make sure it stops reporting 1 when the node group is no longer in backoff.

@x13n
Thank you for your suggestion. I just wanted to confirm the expected metric behavior:
There are two ways to display this metrics since it involves different handling scenarios based on the reason.
I am not sure which approach is more appropriate, and I would appreciate any advice you can provide.

Option 1:

When there is no backoff and no backoff has occurred before,
node_group_backoff_status{node_group=foo,reason=""} 0

When there is a backoff
node_group_backoff_status{node_group=foo,reason=reason1} 1

When the reason changes (although this probability is low), it is possible to see
node_group_backoff_status{node_group=foo,reason=reason1} 0 node_group_backoff_status{node_group=foo,reason=reason2} 1

After recovery
node_group_backoff_status{node_group=foo,reason=reason1} 0 node_group_backoff_status{node_group=foo,reason=reason2} 0

Option 2:

When there is no backoff and no backoff has occurred before,
node_group_backoff_status{node_group=foo,reason=""} 0

When there is a backoff
node_group_backoff_status{node_group=foo,reason=reason1} 1

When the reason changes (although this probability is low), it is possible to see
node_group_backoff_status{node_group=foo,reason=reason2} 1

After recovery
node_group_backoff_status{node_group=foo,reason=reason2} 0

I think both are fine, though I slightly prefer the former. The reason is that whatever is collecting these metrics, will be able to clearly say that a given metric stream is "done" when backoff reason changes or goes away. With the second approach I think there can be artifacts on the graph, since a dashboard won't correlate different streams without some careful query crafting.

The way I would implement this is by keeping a map with {node_group, reason} pair as a key. It would then be updated in bulk:

Iterate through the map and set all values to 0

Iterate through the node groups and set corresponding {node_group, reason} value to 1 (or true). Optionally, inject {node_group, ""} keys set to 0 for all node groups that are not in backoff. Not sure if that second part is actually useful for anything, perhaps not.

Use a custom collector that will read values from the map when prometheus endpoint is scraped.

Understood, thank you! and submit the code changes as soon as possible.

Modifications done and PR submitted. Because the errorCode is provided by the nodeProvider and is open, this collector is dynamic and will search all historical collections of reported reasons.

@x13n Please review my changes

x13n

Apologies for late response, added some comments.

x13n · 2024-01-19T08:51:51Z

cluster-autoscaler/clusterstate/clusterstate.go

+
+// updateNodeGroupBackoffStatusMetrics returns information about backoff situation and reason of the node group
+func (csr *ClusterStateRegistry) updateNodeGroupBackoffStatusMetrics(nodeGroup string, backoffStatus backoff.Status) {
+	backoffReasonStatus := make(BackoffReasonStatus)


nit: Do you actually need to allocate this map every time? Zeroing the existing one would have the exact same effect, right? You're iterating over it anyway.