-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat:add node group health and back off metrics #6396
Changes from 1 commit
044c03d
1255c95
89241e4
849e9e7
23843ad
ae0ab53
5773f50
68e661f
4b9d4b1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -405,6 +405,7 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) caerrors.AutoscalerErr | |||||||
if err != nil { | ||||||||
klog.Errorf("AutoscalingStatusProcessor error: %v.", err) | ||||||||
} | ||||||||
a.clusterStateRegistry.UpdateSafeScaleUpMetricsForNodeGroup(currentTime) | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would better fit as an autoscaler/cluster-autoscaler/processors/status/autoscaling_status_processor.go Lines 33 to 35 in a3a29cf
to return a new processor that does the metric update. The new processor would also be a better place to keep the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Modifications done and PR submitted |
||||||||
}() | ||||||||
|
||||||||
// Check if there are any nodes that failed to register in Kubernetes | ||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -208,6 +208,22 @@ var ( | |
}, []string{"node_group"}, | ||
) | ||
|
||
nodesGroupHealthiness = k8smetrics.NewGaugeVec( | ||
&k8smetrics.GaugeOpts{ | ||
Namespace: caNamespace, | ||
Name: "node_group_healthiness", | ||
Help: "Whether or not node group is healthy enough for autoscaling. 1 if it is, 0 otherwise.", | ||
}, []string{"node_group"}, | ||
) | ||
|
||
nodeGroupBackOffStatus = k8smetrics.NewGaugeVec( | ||
&k8smetrics.GaugeOpts{ | ||
Namespace: caNamespace, | ||
Name: "node_group_backoff_status", | ||
Help: "Whether or not node group is backoff for not autoscaling. 1 if it is, 0 otherwise.", | ||
}, []string{"node_group", "reason"}, | ||
) | ||
|
||
/**** Metrics related to autoscaler execution ****/ | ||
lastActivity = k8smetrics.NewGaugeVec( | ||
&k8smetrics.GaugeOpts{ | ||
|
@@ -431,6 +447,8 @@ func RegisterAll(emitPerNodeGroupMetrics bool) { | |
legacyregistry.MustRegister(nodesGroupMinNodes) | ||
legacyregistry.MustRegister(nodesGroupMaxNodes) | ||
legacyregistry.MustRegister(nodesGroupTargetSize) | ||
legacyregistry.MustRegister(nodesGroupHealthiness) | ||
legacyregistry.MustRegister(nodeGroupBackOffStatus) | ||
} | ||
} | ||
|
||
|
@@ -536,6 +554,24 @@ func UpdateNodeGroupTargetSize(targetSizes map[string]int) { | |
} | ||
} | ||
|
||
// UpdateNodeGroupHealthStatus records if node group is healthy to autoscaling | ||
func UpdateNodeGroupHealthStatus(nodeGroup string, healthy bool) { | ||
if healthy { | ||
nodesGroupHealthiness.WithLabelValues(nodeGroup).Set(1) | ||
} else { | ||
nodesGroupHealthiness.WithLabelValues(nodeGroup).Set(0) | ||
} | ||
} | ||
|
||
// UpdateNodeGroupBackOffStatus records if node group is backoff for not autoscaling | ||
func UpdateNodeGroupBackOffStatus(nodeGroup string, backOff bool, reason string) { | ||
if backOff { | ||
nodeGroupBackOffStatus.WithLabelValues(nodeGroup, reason).Set(1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will not work the way you want: once the backoff is over, the value of 1 will keep on being emitted for the old reason. Consider:
What you may want to do instead is to use a string metric with value equal to the backoff reason. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with the suggestion and will make the necessary modifications as soon as possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Modifications done and PR submitted. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Apologies for not being clear - I think the current version of the code will have exactly the same issue: I expect
If you instead track nodegroup -> reason mapping and clear the last reason you will instead get:
If you decide to write a custom collector, once backoff is over, you could simply get empty output:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for raising this issue, and you’re right, I misunderstood it. It is indeed not expected behavior that the previous nodeGroup backoff metric continues to be reported as 1 after the backoff is over. I will take some time to reconsider how to address this and submit the code changes as soon as possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Modifications done and PR submitted. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @x13n Please review my changes |
||
} else { | ||
nodeGroupBackOffStatus.WithLabelValues(nodeGroup, reason).Set(0) | ||
} | ||
} | ||
|
||
// RegisterError records any errors preventing Cluster Autoscaler from working. | ||
// No more than one error should be recorded per loop. | ||
func RegisterError(err errors.AutoscalerError) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the
ForNodeGroup
suffix and the comment above suggest this is about a specific node group, while in fact it iterates over all of them.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifications done and PR submitted