-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Termination now only evicts pods that don't tolerate Unschedulable #479
Conversation
Will any of this change with https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown ? |
}, | ||
}); err != nil { | ||
// If an eviction fails, we need to eventually try again | ||
zap.S().Debugf("Continuing after failing to evict pod %s from node %s, %s", p.Name, p.Spec.NodeName, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to be more crisp with this log line (i.e. 429 or not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to figure out if we can differentiate the case where PDBs are misconfigured vs a server side error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to punt on this for an upcoming PR on more robust pod evictions. Currently, both 429 and 500 error codes will need an exponential back off and retry, which is currently what we do.
@@ -92,7 +99,27 @@ func (t *Terminator) terminate(ctx context.Context, node *v1.Node) error { | |||
return nil | |||
} | |||
|
|||
// getPods returns a list of pods scheduled to a node | |||
// evictPods returns true if there are no evictable pods | |||
func (t *Terminator) evictPods(ctx context.Context, pods []*v1.Pod) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to make this function return (bool,error) and differentiate between a 429 (backoff) and other errors like 403.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if that's quite necessary as both would require an exponential backoff and retry. Anytime we would be returning false here, we would need to retry anyways. I can see in the future as we implement more robust eviction logic that this might be necessary, but as we don't want to block all evictions on one failed eviction, this works right now.
pkg/utils/pod/scheduling.go
Outdated
} | ||
|
||
// ToleratesTaint returns true if the pod tolerates the taint | ||
// ToleratesAllTaints returns an error if the pod does not tolerate the taints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToleratesAllTaints
-> ToleratesTaints
? the taints
-> all of the taints
?
Issue #, if available:
When draining nodes, we did not gracefully evict any daemonset pods, but left them to forcefully terminate along with the node, when no other pods remained.
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.