Don't pile up successive full refreshes during AWS scaledowns

Force refreshing everything at every `DeleteNodes` calls causes slow down and throttling on large clusters with many ASGs (and lot of activity). that function might be called several times in a row during scale-down (once for each ASG having a node to be removed). Each time the forced refresh will re-discover all ASGs, all LaunchConfigurations, then re-list all instances from disovered ASGs. That immediate refresh isn't required anyway, as the cache's DeleteInstances concrete implementation will decrement the nodegroup size, and we can schedule a grouped refresh for the next loop iteration. As a later step, I'm considering spliting the asgCache.generate() function to support per ASG refreshes (and maybe per ASG caches TTLs + jitter, to spread API calls). But that should address the current issue for now.
kubernetes · Jan 6, 2021 · 0f745a5 · 0f745a5
1 parent 7761d70
commit 0f745a5
Showing 1 changed file with 3 additions and 2 deletions.
diff --git a/cluster-autoscaler/cloudprovider/aws/aws_manager.go b/cluster-autoscaler/cloudprovider/aws/aws_manager.go
@@ -294,8 +294,9 @@ func (m *AwsManager) DeleteInstances(instances []*AwsInstanceRef) error {
 	if err := m.asgCache.DeleteInstances(instances); err != nil {
 		return err
 	}
-	klog.V(2).Infof("Some ASG instances might have been deleted, forcing ASG list refresh")
-	return m.forceRefresh()
+	klog.V(2).Infof("Some ASG instances might have been deleted, scheduling an ASG list refresh")
+	m.lastRefresh = time.Now().Add(-refreshInterval)
+	return nil
 }
 
 // GetAsgNodes returns Asg nodes.