Skip to content

Commit

Permalink
Fix deadlock in DecreaseTargetSize by filtering placeholders
Browse files Browse the repository at this point in the history
This commit resolves an issue where the DecreaseTargetSize function could enter a deadlock state by attempting to scale down when only placeholder nodes, marked as 'placeholderUnfulfillableStatus', are present. The updated function now filters out these placeholder nodes before calculating if a decrease in target size is permissible, ensuring that only operational nodes are considered in the scaling process. This prevents erroneous scaling activities that could impact cluster stability and node management in AWS Auto Scaling Groups.

- Added checks to exclude placeholders in DecreaseTargetSize calculations.

- Enhanced logging for better clarity when instance statuses are fetched but continue despite errors.

Fixes #6128
  • Loading branch information
ruiscosta committed May 10, 2024
1 parent 3dda9c1 commit f731506
Showing 1 changed file with 28 additions and 4 deletions.
32 changes: 28 additions & 4 deletions cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go
Original file line number Diff line number Diff line change
Expand Up @@ -288,22 +288,46 @@ func (ng *AwsNodeGroup) AtomicIncreaseSize(delta int) error {
// DecreaseTargetSize decreases the target size of the node group. This function
// doesn't permit to delete any existing node and can be used only to reduce the
// request for new nodes that have not been yet fulfilled. Delta should be negative.
// It is assumed that cloud provider will not delete the existing nodes if the size
// when there is an option to just decrease the target.
// It is assumed that the cloud provider will not delete the existing nodes if there is an
// option to just decrease the target.
func (ng *AwsNodeGroup) DecreaseTargetSize(delta int) error {
// Ensure that the delta is negative to prevent an increase in size
if delta >= 0 {
return fmt.Errorf("size decrease size must be negative")
return fmt.Errorf("size decrease must be negative")
}

// Get the current size of the ASG
size := ng.asg.curSize

// Retrieve all nodes of the ASG
nodes, err := ng.awsManager.GetAsgNodes(ng.asg.AwsRef)
if err != nil {
return err
return err // Error retrieving nodes
}

// Filter out any nodes that are marked as placeholders and cannot be fulfilled
filteredNodes := make([]AwsInstanceRef, 0)
for i := range nodes {
node := nodes[i]
instanceStatus, err := ng.awsManager.GetInstanceStatus(node)
if err != nil {
// Log if unable to get instance status but continue processing
klog.V(4).Infof("Could not get instance status, continuing anyways: %v", err)
} else if instanceStatus != nil && *instanceStatus == placeholderUnfulfillableStatus {
// Skip over placeholders that are unfulfillable
continue
}
filteredNodes = append(filteredNodes, node)
}
nodes = filteredNodes

// Check that the new size will not be less than the number of existing nodes
if int(size)+delta < len(nodes) {
return fmt.Errorf("attempt to delete existing nodes targetSize:%d delta:%d existingNodes: %d",
size, delta, len(nodes))
}

// Set the new size of the ASG
return ng.awsManager.SetAsgSize(ng.asg, size+delta)
}

Expand Down

0 comments on commit f731506

Please sign in to comment.