-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat:add node group health and back off metrics #6396
feat:add node group health and back off metrics #6396
Conversation
// UpdateNodeGroupBackOffStatus records if node group is backoff for not autoscaling | ||
func UpdateNodeGroupBackOffStatus(nodeGroup string, backOff bool, reason string) { | ||
if backOff { | ||
nodeGroupBackOffStatus.WithLabelValues(nodeGroup, reason).Set(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work the way you want: once the backoff is over, the value of 1 will keep on being emitted for the old reason. Consider:
-
UpdateNodeGroupBackOffStatus("foo", true, "reason")
Under /metrics endpoint you'll get something like:
node_group_backoff_status{node_group=foo,reason=reason} 1
-
UpdateNodeGroupBackOffStatus("foo", false, "")
Under /metrics endpoint you'll get something like:
node_group_backoff_status{node_group=foo,reason=reason} 1 node_group_backoff_status{node_group=foo,reason=""} 0
What you may want to do instead is to use a string metric with value equal to the backoff reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the suggestion and will make the necessary modifications as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifications done and PR submitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for not being clear - I think the current version of the code will have exactly the same issue: I expect reason
to be an empty string every time backoff
is false
. If you want to track backoff reasons like this, you need to keep track of the last reason
value and explicitly clear it when backoff
becomes false
. Alternatively, you could implement a custom collector that will only emit metrics with up-to-date reason. This would be the cleanest solution. With the existing code, in the scenario from my previous comment, after backoff is over, you get:
node_group_backoff_status{node_group=foo,reason=reason} 1
node_group_backoff_status{node_group=foo,reason=""} 0
If you instead track nodegroup -> reason mapping and clear the last reason you will instead get:
node_group_backoff_status{node_group=foo,reason=reason} 0
If you decide to write a custom collector, once backoff is over, you could simply get empty output:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for raising this issue, and you’re right, I misunderstood it. It is indeed not expected behavior that the previous nodeGroup backoff metric continues to be reported as 1 after the backoff is over. I will take some time to reconsider how to address this and submit the code changes as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifications done and PR submitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@x13n Please review my changes
/assign |
// UpdateNodeGroupBackOffStatus records if node group is backoff for not autoscaling | ||
func UpdateNodeGroupBackOffStatus(nodeGroup string, backOff bool, errorClass cloudprovider.InstanceErrorClass) { | ||
if backOff { | ||
nodeGroupBackOffStatus.WithLabelValues(nodeGroup).Set(float64(errorClass)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is quite hacky and also meaningless when aggregated. Adding error codes together doesn't make sense and yet it is exactly what will happen when querying for multiple node groups. The previous approach with label for error code makes sense, you just need to make sure it stops reporting 1 when the node group is no longer in backoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@x13n
Thank you for your suggestion. I just wanted to confirm the expected metric behavior:
There are two ways to display this metrics since it involves different handling scenarios based on the reason.
I am not sure which approach is more appropriate, and I would appreciate any advice you can provide.
- Option 1:
When there is no backoff and no backoff has occurred before,
node_group_backoff_status{node_group=foo,reason=""} 0
When there is a backoff
node_group_backoff_status{node_group=foo,reason=reason1} 1
When the reason changes (although this probability is low), it is possible to see
node_group_backoff_status{node_group=foo,reason=reason1} 0 node_group_backoff_status{node_group=foo,reason=reason2} 1
After recovery
node_group_backoff_status{node_group=foo,reason=reason1} 0 node_group_backoff_status{node_group=foo,reason=reason2} 0
- Option 2:
When there is no backoff and no backoff has occurred before,
node_group_backoff_status{node_group=foo,reason=""} 0
When there is a backoff
node_group_backoff_status{node_group=foo,reason=reason1} 1
When the reason changes (although this probability is low), it is possible to see
node_group_backoff_status{node_group=foo,reason=reason2} 1
After recovery
node_group_backoff_status{node_group=foo,reason=reason2} 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both are fine, though I slightly prefer the former. The reason is that whatever is collecting these metrics, will be able to clearly say that a given metric stream is "done" when backoff reason changes or goes away. With the second approach I think there can be artifacts on the graph, since a dashboard won't correlate different streams without some careful query crafting.
The way I would implement this is by keeping a map with {node_group, reason} pair as a key. It would then be updated in bulk:
- Iterate through the map and set all values to 0
- Iterate through the node groups and set corresponding {node_group, reason} value to 1 (or
true
). Optionally, inject {node_group, ""} keys set to 0 for all node groups that are not in backoff. Not sure if that second part is actually useful for anything, perhaps not. - Use a custom collector that will read values from the map when prometheus endpoint is scraped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, thank you! and submit the code changes as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifications done and PR submitted. Because the errorCode is provided by the nodeProvider and is open, this collector is dynamic and will search all historical collections of reported reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@x13n Please review my changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for late response, added some comments.
|
||
// updateNodeGroupBackoffStatusMetrics returns information about backoff situation and reason of the node group | ||
func (csr *ClusterStateRegistry) updateNodeGroupBackoffStatusMetrics(nodeGroup string, backoffStatus backoff.Status) { | ||
backoffReasonStatus := make(BackoffReasonStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do you actually need to allocate this map every time? Zeroing the existing one would have the exact same effect, right? You're iterating over it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifications done and PR submitted
@@ -116,6 +116,9 @@ type ScaleUpFailure struct { | |||
Time time.Time | |||
} | |||
|
|||
// BackoffReasonStatus contains information about backoff status and reason | |||
type BackoffReasonStatus map[string]int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the values here will always be either 0 or 1, maybe map[string]bool
would make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifications done and PR submitted
@@ -267,6 +267,11 @@ const ( | |||
InstanceDeleting InstanceState = 3 | |||
) | |||
|
|||
const ( | |||
// UnknownErrorCode means that the cloud provider has not provided an error code. | |||
UnknownErrorCode = "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this in cloudprovider
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifications done and PR submitted
@@ -462,6 +467,38 @@ func (csr *ClusterStateRegistry) updateNodeGroupMetrics() { | |||
metrics.UpdateNodeGroupsCount(autoscaled, autoprovisioned) | |||
} | |||
|
|||
// UpdateSafeScaleUpMetricsForNodeGroup queries the health status and backoff situation of the node group and updates metrics | |||
func (csr *ClusterStateRegistry) UpdateSafeScaleUpMetricsForNodeGroup(now time.Time) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the ForNodeGroup
suffix and the comment above suggest this is about a specific node group, while in fact it iterates over all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifications done and PR submitted
@@ -406,6 +406,7 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) caerrors.AutoscalerErr | |||
if err != nil { | |||
klog.Errorf("AutoscalingStatusProcessor error: %v.", err) | |||
} | |||
a.clusterStateRegistry.UpdateSafeScaleUpMetricsForNodeGroup(currentTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would better fit as an AutoscalingStatusProcessor
implementation - all the relevant params are already passed in the Process
call above. Let's update this function:
autoscaler/cluster-autoscaler/processors/status/autoscaling_status_processor.go
Lines 33 to 35 in a3a29cf
func NewDefaultAutoscalingStatusProcessor() AutoscalingStatusProcessor { | |
return &NoOpAutoscalingStatusProcessor{} | |
} |
to return a new processor that does the metric update. The new processor would also be a better place to keep the backoffReasonStatus
map - clusterStateRegistry
should be responsible for tracking the cluster state, not for metrics handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifications done and PR submitted
…om/guopeng0/autoscaler into feature/node_group_healthy_metrics
Thanks for the changes! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: guopeng0, x13n The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
How should this work for AWS when there are no spots available? For example Thanks |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This pull request aims to introduce new metrics to display the health status and back off situation of node groups in the cluster. Currently, there are cluster health metrics, but this PR intends to add additional metrics at the node group level.
The two new metrics being added are:
node_group_healthiness: This metric will indicate the health status of each node group, providing clear visibility into whether the node group is currently healthy or not. If a node group remains unhealthy for a prolonged period, it can trigger alerts for further investigation.
node_group_backoff_status: This metric displays the back off status of each node group, along with the specific reasons causing it.
Implementation Details:
The proposed changes involve adding an updateMetrics function to the ClusterStateRegistry class. This function will be called at the end of each RunOnce execution, ensuring that the new metrics are regularly updated. While these metrics are primarily used in scale-up scenarios, it is important to keep them updated continuously, not just during scale-up.
We kindly request your review of this PR and your thoughts on the proposed implementation.
Thank you for your time and consideration.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
NONE
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: