Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat:add node group health and back off metrics #6396

Merged

Conversation

guopeng0
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

This pull request aims to introduce new metrics to display the health status and back off situation of node groups in the cluster. Currently, there are cluster health metrics, but this PR intends to add additional metrics at the node group level.

The two new metrics being added are:
node_group_healthiness: This metric will indicate the health status of each node group, providing clear visibility into whether the node group is currently healthy or not. If a node group remains unhealthy for a prolonged period, it can trigger alerts for further investigation.
node_group_backoff_status: This metric displays the back off status of each node group, along with the specific reasons causing it.

Implementation Details:
The proposed changes involve adding an updateMetrics function to the ClusterStateRegistry class. This function will be called at the end of each RunOnce execution, ensuring that the new metrics are regularly updated. While these metrics are primarily used in scale-up scenarios, it is important to keep them updated continuously, not just during scale-up.

We kindly request your review of this PR and your thoughts on the proposed implementation.
Thank you for your time and consideration.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 21, 2023
@k8s-ci-robot k8s-ci-robot requested a review from x13n December 21, 2023 10:20
// UpdateNodeGroupBackOffStatus records if node group is backoff for not autoscaling
func UpdateNodeGroupBackOffStatus(nodeGroup string, backOff bool, reason string) {
if backOff {
nodeGroupBackOffStatus.WithLabelValues(nodeGroup, reason).Set(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work the way you want: once the backoff is over, the value of 1 will keep on being emitted for the old reason. Consider:

  1. UpdateNodeGroupBackOffStatus("foo", true, "reason")

    Under /metrics endpoint you'll get something like:

    node_group_backoff_status{node_group=foo,reason=reason} 1
    
  2. UpdateNodeGroupBackOffStatus("foo", false, "")

    Under /metrics endpoint you'll get something like:

    node_group_backoff_status{node_group=foo,reason=reason} 1
    node_group_backoff_status{node_group=foo,reason=""} 0
    

What you may want to do instead is to use a string metric with value equal to the backoff reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the suggestion and will make the necessary modifications as soon as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications done and PR submitted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for not being clear - I think the current version of the code will have exactly the same issue: I expect reason to be an empty string every time backoff is false. If you want to track backoff reasons like this, you need to keep track of the last reason value and explicitly clear it when backoff becomes false. Alternatively, you could implement a custom collector that will only emit metrics with up-to-date reason. This would be the cleanest solution. With the existing code, in the scenario from my previous comment, after backoff is over, you get:

node_group_backoff_status{node_group=foo,reason=reason} 1
node_group_backoff_status{node_group=foo,reason=""} 0

If you instead track nodegroup -> reason mapping and clear the last reason you will instead get:

node_group_backoff_status{node_group=foo,reason=reason} 0

If you decide to write a custom collector, once backoff is over, you could simply get empty output:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for raising this issue, and you’re right, I misunderstood it. It is indeed not expected behavior that the previous nodeGroup backoff metric continues to be reported as 1 after the backoff is over. I will take some time to reconsider how to address this and submit the code changes as soon as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications done and PR submitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x13n Please review my changes

@x13n
Copy link
Member

x13n commented Dec 28, 2023

/assign

@guopeng0 guopeng0 requested a review from x13n December 29, 2023 06:46
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 29, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 2, 2024
// UpdateNodeGroupBackOffStatus records if node group is backoff for not autoscaling
func UpdateNodeGroupBackOffStatus(nodeGroup string, backOff bool, errorClass cloudprovider.InstanceErrorClass) {
if backOff {
nodeGroupBackOffStatus.WithLabelValues(nodeGroup).Set(float64(errorClass))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is quite hacky and also meaningless when aggregated. Adding error codes together doesn't make sense and yet it is exactly what will happen when querying for multiple node groups. The previous approach with label for error code makes sense, you just need to make sure it stops reporting 1 when the node group is no longer in backoff.

Copy link
Contributor Author

@guopeng0 guopeng0 Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x13n
Thank you for your suggestion. I just wanted to confirm the expected metric behavior:
There are two ways to display this metrics since it involves different handling scenarios based on the reason.
I am not sure which approach is more appropriate, and I would appreciate any advice you can provide.

  • Option 1:

When there is no backoff and no backoff has occurred before,
node_group_backoff_status{node_group=foo,reason=""} 0

When there is a backoff
node_group_backoff_status{node_group=foo,reason=reason1} 1

When the reason changes (although this probability is low), it is possible to see
node_group_backoff_status{node_group=foo,reason=reason1} 0 node_group_backoff_status{node_group=foo,reason=reason2} 1

After recovery
node_group_backoff_status{node_group=foo,reason=reason1} 0 node_group_backoff_status{node_group=foo,reason=reason2} 0

  • Option 2:

When there is no backoff and no backoff has occurred before,
node_group_backoff_status{node_group=foo,reason=""} 0

When there is a backoff
node_group_backoff_status{node_group=foo,reason=reason1} 1

When the reason changes (although this probability is low), it is possible to see
node_group_backoff_status{node_group=foo,reason=reason2} 1

After recovery
node_group_backoff_status{node_group=foo,reason=reason2} 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both are fine, though I slightly prefer the former. The reason is that whatever is collecting these metrics, will be able to clearly say that a given metric stream is "done" when backoff reason changes or goes away. With the second approach I think there can be artifacts on the graph, since a dashboard won't correlate different streams without some careful query crafting.

The way I would implement this is by keeping a map with {node_group, reason} pair as a key. It would then be updated in bulk:

  1. Iterate through the map and set all values to 0
  2. Iterate through the node groups and set corresponding {node_group, reason} value to 1 (or true). Optionally, inject {node_group, ""} keys set to 0 for all node groups that are not in backoff. Not sure if that second part is actually useful for anything, perhaps not.
  3. Use a custom collector that will read values from the map when prometheus endpoint is scraped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thank you! and submit the code changes as soon as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications done and PR submitted. Because the errorCode is provided by the nodeProvider and is open, this collector is dynamic and will search all historical collections of reported reasons.

Copy link
Contributor Author

@guopeng0 guopeng0 Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x13n Please review my changes

@guopeng0 guopeng0 requested a review from x13n January 13, 2024 11:37
Copy link
Member

@x13n x13n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for late response, added some comments.


// updateNodeGroupBackoffStatusMetrics returns information about backoff situation and reason of the node group
func (csr *ClusterStateRegistry) updateNodeGroupBackoffStatusMetrics(nodeGroup string, backoffStatus backoff.Status) {
backoffReasonStatus := make(BackoffReasonStatus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do you actually need to allocate this map every time? Zeroing the existing one would have the exact same effect, right? You're iterating over it anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications done and PR submitted

@@ -116,6 +116,9 @@ type ScaleUpFailure struct {
Time time.Time
}

// BackoffReasonStatus contains information about backoff status and reason
type BackoffReasonStatus map[string]int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the values here will always be either 0 or 1, maybe map[string]bool would make more sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications done and PR submitted

@@ -267,6 +267,11 @@ const (
InstanceDeleting InstanceState = 3
)

const (
// UnknownErrorCode means that the cloud provider has not provided an error code.
UnknownErrorCode = "unknown"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in cloudprovider package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications done and PR submitted

@@ -462,6 +467,38 @@ func (csr *ClusterStateRegistry) updateNodeGroupMetrics() {
metrics.UpdateNodeGroupsCount(autoscaled, autoprovisioned)
}

// UpdateSafeScaleUpMetricsForNodeGroup queries the health status and backoff situation of the node group and updates metrics
func (csr *ClusterStateRegistry) UpdateSafeScaleUpMetricsForNodeGroup(now time.Time) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the ForNodeGroup suffix and the comment above suggest this is about a specific node group, while in fact it iterates over all of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications done and PR submitted

@@ -406,6 +406,7 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) caerrors.AutoscalerErr
if err != nil {
klog.Errorf("AutoscalingStatusProcessor error: %v.", err)
}
a.clusterStateRegistry.UpdateSafeScaleUpMetricsForNodeGroup(currentTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would better fit as an AutoscalingStatusProcessor implementation - all the relevant params are already passed in the Process call above. Let's update this function:

func NewDefaultAutoscalingStatusProcessor() AutoscalingStatusProcessor {
return &NoOpAutoscalingStatusProcessor{}
}

to return a new processor that does the metric update. The new processor would also be a better place to keep the backoffReasonStatus map - clusterStateRegistry should be responsible for tracking the cluster state, not for metrics handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications done and PR submitted

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 23, 2024
@guopeng0 guopeng0 requested a review from x13n January 24, 2024 02:28
@x13n
Copy link
Member

x13n commented Jan 24, 2024

Thanks for the changes!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: guopeng0, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2024
@k8s-ci-robot k8s-ci-robot merged commit a2f8902 into kubernetes:master Jan 24, 2024
6 checks passed
@lukasmrtvy
Copy link

How should this work for AWS when there are no spots available? For example Could not launch Spot Instances. UnfulfillableCapacity - Unable to fulfill capacity due to your request configuration. Please adjust your request and try again. Launching EC2 instance failed.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants