Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scale-down-delay-after-delete parameter doesn't work properly #3568

Closed
marwanad opened this issue Sep 30, 2020 · 12 comments
Closed

scale-down-delay-after-delete parameter doesn't work properly #3568

marwanad opened this issue Sep 30, 2020 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@marwanad
Copy link
Member

I believe that started around the time the switch to the new scale down processor happened.

The parameter goes into effect and puts scaledown in cooldown mode based on the lastScaleDownDeleteTime.

a.lastScaleDownDeleteTime.Add(a.ScaleDownDelayAfterDelete).After(currentTime)

However, this is only set when the scale down status result is ScaleDownNodeDeleted

if scaleDownStatus.Result == status.ScaleDownNodeDeleted {
a.lastScaleDownDeleteTime = currentTime
a.clusterStateRegistry.Recalculate()
}

And it seems with the switch to the async deletions, this is no longer set:

go func() {
// Finishing the delete process once this goroutine is over.
var result status.NodeDeleteResult
defer func() { sd.nodeDeletionTracker.AddNodeDeleteResult(toRemove.Node.Name, result) }()
defer sd.nodeDeletionTracker.SetNonEmptyNodeDeleteInProgress(false)
nodeGroup, found := candidateNodeGroups[toRemove.Node.Name]
if !found {
result = status.NodeDeleteResult{ResultType: status.NodeDeleteErrorFailedToDelete, Err: errors.NewAutoscalerError(
errors.InternalError, "failed to find node group for %s", toRemove.Node.Name)}
return
}
result = sd.deleteNode(toRemove.Node, toRemove.PodsToReschedule, nodeGroup)
if result.ResultType != status.NodeDeleteOk {
klog.Errorf("Failed to delete %s: %v", toRemove.Node.Name, result.Err)
return
}
if readinessMap[toRemove.Node.Name] {
metrics.RegisterScaleDown(1, gpu.GetGpuTypeForMetrics(gpuLabel, availableGPUTypes, toRemove.Node, nodeGroup), metrics.Underutilized)
} else {
metrics.RegisterScaleDown(1, gpu.GetGpuTypeForMetrics(gpuLabel, availableGPUTypes, toRemove.Node, nodeGroup), metrics.Unready)
}
}()

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 30, 2020
@marwanad
Copy link
Member Author

@MaciekPytel @towca

@marwanad
Copy link
Member Author

Should we be returning ScaleDownNodeDeleted instead of ScaleDownNodeDeleteStarted - it seems that the later isn't used around so it seems like the most straightforward fix

@towca
Copy link
Collaborator

towca commented Oct 1, 2020

Thanks for pointing this out Marwan! Although I'd stick to using ScaleDownNodeDeleteStarted since it better conveys what actually happens. I'll send out a fix shortly.

@towca
Copy link
Collaborator

towca commented Oct 1, 2020

The bug should be fixed by #3570.

@towca towca closed this as completed Oct 1, 2020
@ryaneorth
Copy link

thanks @towca @marwanad for finding/fixing this issue! any idea when/if this will make it into a patch release for versions 1.16.x/1.17.x?

@ryaneorth
Copy link

@MaciekPytel - maybe you know the answer to the above question?

@marwanad
Copy link
Member Author

marwanad commented Oct 9, 2020

@ryaneorth we can prepare the cherry-pick PRs. I'm guessing we'll have one set of patch releases before K8s 1.20 and another around 1.20.

@ryaneorth
Copy link

Sounds great, thanks @marwanad . I'm happy to perform the cherry-picks if you'd like - let me know!

@marwanad
Copy link
Member Author

marwanad commented Oct 9, 2020

@ryaneorth that would be great, thanks!

@ryaneorth
Copy link

Done!

#3597
#3598
#3599
#3600

@ryaneorth
Copy link

All of the above cherry-picks are complete. @MaciekPytel - do you have any information as to when the next patches will be released?

@marwanad
Copy link
Member Author

@ryaneorth keep an 👀 on #3611

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants