New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Termination now only evicts pods that don't tolerate Unschedulable #479

Merged

njtran merged 4 commits into aws:main from njtran:drainDaemon

Jun 30, 2021

Contributor

njtran commented Jun 25, 2021

Issue #, if available:
When draining nodes, we did not gracefully evict any daemonset pods, but left them to forcefully terminate along with the node, when no other pods remained.

Description of changes:

This change makes it so that we evict all pods that won't be rescheduled even if the node is cordoned (tolerates unschedulable)
Additionally, I remove the duplicated un-used Reconcile() function in the Reallocation controller.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ellistarn reviewed

View reviewed changes

pkg/utils/pod/scheduling.go Outdated Show resolved Hide resolved

Contributor

JacobGabrielson commented Jun 28, 2021

Will any of this change with https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown ?

JacobGabrielson reviewed

View reviewed changes

pkg/controllers/terminating/v1alpha1/terminate.go Outdated Show resolved Hide resolved

ellistarn reviewed

View reviewed changes

pkg/controllers/terminating/v1alpha1/terminate.go Outdated Show resolved Hide resolved

ellistarn reviewed

View reviewed changes

pkg/controllers/terminating/v1alpha1/terminate.go Outdated Show resolved Hide resolved

ellistarn reviewed

View reviewed changes

pkg/controllers/terminating/v1alpha1/terminate.go Outdated Show resolved Hide resolved

ellistarn reviewed

View reviewed changes

pkg/controllers/terminating/v1alpha1/suite_test.go Outdated Show resolved Hide resolved

ellistarn reviewed

View reviewed changes

pkg/controllers/terminating/v1alpha1/terminate.go Outdated

+              				},
+              			}); err != nil {
+              				// If an eviction fails, we need to eventually try again
+              				zap.S().Debugf("Continuing after failing to evict pod %s from node %s, %s", p.Name, p.Spec.NodeName, err.Error())

Contributor

ellistarn Jun 29, 2021

I think we need to be more crisp with this log line (i.e. 429 or not)

Contributor

ellistarn Jun 29, 2021

We also need to figure out if we can differentiate the case where PDBs are misconfigured vs a server side error.

Contributor Author

njtran Jun 29, 2021

I want to punt on this for an upcoming PR on more robust pod evictions. Currently, both 429 and 500 error codes will need an exponential back off and retry, which is currently what we do.

ellistarn reviewed

View reviewed changes

pkg/controllers/terminating/v1alpha1/terminate.go Outdated Show resolved Hide resolved

ellistarn reviewed

View reviewed changes

pkg/controllers/terminating/v1alpha1/terminate.go Outdated

               	return nil
               }
-              // getPods returns a list of pods scheduled to a node
+              // evictPods returns true if there are no evictable pods
+              func (t *Terminator) evictPods(ctx context.Context, pods []*v1.Pod) bool {

Contributor

ellistarn Jun 29, 2021

I think we need to make this function return (bool,error) and differentiate between a 429 (backoff) and other errors like 403.

Contributor Author

njtran Jun 29, 2021

I don't know if that's quite necessary as both would require an exponential backoff and retry. Anytime we would be returning false here, we would need to retry anyways. I can see in the future as we implement more robust eviction logic that this might be necessary, but as we don't want to block all evictions on one failed eviction, this works right now.

JacobGabrielson reviewed

View reviewed changes

pkg/controllers/terminating/v1alpha1/terminate.go Outdated Show resolved Hide resolved

pkg/utils/pod/scheduling.go Outdated

-              }
-              // ToleratesTaint returns true if the pod tolerates the taint
+              // ToleratesAllTaints returns an error if the pod does not tolerate the taints

Contributor

JacobGabrielson Jun 29, 2021

ToleratesAllTaints -> ToleratesTaints? the taints -> all of the taints?

njtran added 3 commits

June 30, 2021 14:56


          merge conflicts

cc97aee


          merge conflicts

7f015a1


          merge conflicts

e80de49

ellistarn previously approved these changes

View reviewed changes


          rebase conflicts

5809c2f

njtran dismissed ellistarn’s stale review via

5809c2f

June 30, 2021 22:37

ellistarn approved these changes

View reviewed changes

njtran merged commit 2f939eb into aws:main

njtran deleted the drainDaemon branch

July 1, 2021 19:54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet