Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat:add node group health and back off metrics #6396

Merged
12 changes: 12 additions & 0 deletions cluster-autoscaler/clusterstate/clusterstate.go
Original file line number Diff line number Diff line change
Expand Up @@ -456,6 +456,18 @@ func (csr *ClusterStateRegistry) updateNodeGroupMetrics() {
metrics.UpdateNodeGroupsCount(autoscaled, autoprovisioned)
}

// UpdateSafeScaleUpMetricsForNodeGroup queries the health status and backoff situation of the node group and updates metrics
func (csr *ClusterStateRegistry) UpdateSafeScaleUpMetricsForNodeGroup(now time.Time) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the ForNodeGroup suffix and the comment above suggest this is about a specific node group, while in fact it iterates over all of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications done and PR submitted

for _, nodeGroup := range csr.cloudProvider.NodeGroups() {
if !nodeGroup.Exist() {
continue
}
metrics.UpdateNodeGroupHealthStatus(nodeGroup.Id(), csr.IsNodeGroupHealthy(nodeGroup.Id()))
backoffStatus := csr.backoff.BackoffStatus(nodeGroup, csr.nodeInfosForGroups[nodeGroup.Id()], now)
metrics.UpdateNodeGroupBackOffStatus(nodeGroup.Id(), backoffStatus.IsBackedOff, backoffStatus.ErrorInfo.ErrorCode)
}
}

// IsNodeGroupSafeToScaleUp returns information about node group safety to be scaled up now.
func (csr *ClusterStateRegistry) IsNodeGroupSafeToScaleUp(nodeGroup cloudprovider.NodeGroup, now time.Time) NodeGroupScalingSafety {
isHealthy := csr.IsNodeGroupHealthy(nodeGroup.Id())
Expand Down
1 change: 1 addition & 0 deletions cluster-autoscaler/core/static_autoscaler.go
Original file line number Diff line number Diff line change
Expand Up @@ -405,6 +405,7 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) caerrors.AutoscalerErr
if err != nil {
klog.Errorf("AutoscalingStatusProcessor error: %v.", err)
}
a.clusterStateRegistry.UpdateSafeScaleUpMetricsForNodeGroup(currentTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would better fit as an AutoscalingStatusProcessor implementation - all the relevant params are already passed in the Process call above. Let's update this function:

func NewDefaultAutoscalingStatusProcessor() AutoscalingStatusProcessor {
return &NoOpAutoscalingStatusProcessor{}
}

to return a new processor that does the metric update. The new processor would also be a better place to keep the backoffReasonStatus map - clusterStateRegistry should be responsible for tracking the cluster state, not for metrics handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications done and PR submitted

}()

// Check if there are any nodes that failed to register in Kubernetes
Expand Down
36 changes: 36 additions & 0 deletions cluster-autoscaler/metrics/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,22 @@ var (
}, []string{"node_group"},
)

nodesGroupHealthiness = k8smetrics.NewGaugeVec(
&k8smetrics.GaugeOpts{
Namespace: caNamespace,
Name: "node_group_healthiness",
Help: "Whether or not node group is healthy enough for autoscaling. 1 if it is, 0 otherwise.",
}, []string{"node_group"},
)

nodeGroupBackOffStatus = k8smetrics.NewGaugeVec(
&k8smetrics.GaugeOpts{
Namespace: caNamespace,
Name: "node_group_backoff_status",
Help: "Whether or not node group is backoff for not autoscaling. 1 if it is, 0 otherwise.",
}, []string{"node_group", "reason"},
)

/**** Metrics related to autoscaler execution ****/
lastActivity = k8smetrics.NewGaugeVec(
&k8smetrics.GaugeOpts{
Expand Down Expand Up @@ -431,6 +447,8 @@ func RegisterAll(emitPerNodeGroupMetrics bool) {
legacyregistry.MustRegister(nodesGroupMinNodes)
legacyregistry.MustRegister(nodesGroupMaxNodes)
legacyregistry.MustRegister(nodesGroupTargetSize)
legacyregistry.MustRegister(nodesGroupHealthiness)
legacyregistry.MustRegister(nodeGroupBackOffStatus)
}
}

Expand Down Expand Up @@ -536,6 +554,24 @@ func UpdateNodeGroupTargetSize(targetSizes map[string]int) {
}
}

// UpdateNodeGroupHealthStatus records if node group is healthy to autoscaling
func UpdateNodeGroupHealthStatus(nodeGroup string, healthy bool) {
if healthy {
nodesGroupHealthiness.WithLabelValues(nodeGroup).Set(1)
} else {
nodesGroupHealthiness.WithLabelValues(nodeGroup).Set(0)
}
}

// UpdateNodeGroupBackOffStatus records if node group is backoff for not autoscaling
func UpdateNodeGroupBackOffStatus(nodeGroup string, backOff bool, reason string) {
if backOff {
nodeGroupBackOffStatus.WithLabelValues(nodeGroup, reason).Set(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work the way you want: once the backoff is over, the value of 1 will keep on being emitted for the old reason. Consider:

  1. UpdateNodeGroupBackOffStatus("foo", true, "reason")

    Under /metrics endpoint you'll get something like:

    node_group_backoff_status{node_group=foo,reason=reason} 1
    
  2. UpdateNodeGroupBackOffStatus("foo", false, "")

    Under /metrics endpoint you'll get something like:

    node_group_backoff_status{node_group=foo,reason=reason} 1
    node_group_backoff_status{node_group=foo,reason=""} 0
    

What you may want to do instead is to use a string metric with value equal to the backoff reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the suggestion and will make the necessary modifications as soon as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications done and PR submitted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for not being clear - I think the current version of the code will have exactly the same issue: I expect reason to be an empty string every time backoff is false. If you want to track backoff reasons like this, you need to keep track of the last reason value and explicitly clear it when backoff becomes false. Alternatively, you could implement a custom collector that will only emit metrics with up-to-date reason. This would be the cleanest solution. With the existing code, in the scenario from my previous comment, after backoff is over, you get:

node_group_backoff_status{node_group=foo,reason=reason} 1
node_group_backoff_status{node_group=foo,reason=""} 0

If you instead track nodegroup -> reason mapping and clear the last reason you will instead get:

node_group_backoff_status{node_group=foo,reason=reason} 0

If you decide to write a custom collector, once backoff is over, you could simply get empty output:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for raising this issue, and you’re right, I misunderstood it. It is indeed not expected behavior that the previous nodeGroup backoff metric continues to be reported as 1 after the backoff is over. I will take some time to reconsider how to address this and submit the code changes as soon as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications done and PR submitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x13n Please review my changes

} else {
nodeGroupBackOffStatus.WithLabelValues(nodeGroup, reason).Set(0)
}
}

// RegisterError records any errors preventing Cluster Autoscaler from working.
// No more than one error should be recorded per loop.
func RegisterError(err errors.AutoscalerError) {
Expand Down
Loading