Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Termination now only evicts pods that don't tolerate Unschedulable #479

Merged
merged 4 commits into from
Jun 30, 2021
Merged

Termination now only evicts pods that don't tolerate Unschedulable #479

merged 4 commits into from
Jun 30, 2021

Conversation

njtran
Copy link
Contributor

@njtran njtran commented Jun 25, 2021

Issue #, if available:
When draining nodes, we did not gracefully evict any daemonset pods, but left them to forcefully terminate along with the node, when no other pods remained.

Description of changes:

  • This change makes it so that we evict all pods that won't be rescheduled even if the node is cordoned (tolerates unschedulable)
  • Additionally, I remove the duplicated un-used Reconcile() function in the Reallocation controller.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@JacobGabrielson
Copy link
Contributor

},
}); err != nil {
// If an eviction fails, we need to eventually try again
zap.S().Debugf("Continuing after failing to evict pod %s from node %s, %s", p.Name, p.Spec.NodeName, err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be more crisp with this log line (i.e. 429 or not)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to figure out if we can differentiate the case where PDBs are misconfigured vs a server side error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to punt on this for an upcoming PR on more robust pod evictions. Currently, both 429 and 500 error codes will need an exponential back off and retry, which is currently what we do.

@@ -92,7 +99,27 @@ func (t *Terminator) terminate(ctx context.Context, node *v1.Node) error {
return nil
}

// getPods returns a list of pods scheduled to a node
// evictPods returns true if there are no evictable pods
func (t *Terminator) evictPods(ctx context.Context, pods []*v1.Pod) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make this function return (bool,error) and differentiate between a 429 (backoff) and other errors like 403.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if that's quite necessary as both would require an exponential backoff and retry. Anytime we would be returning false here, we would need to retry anyways. I can see in the future as we implement more robust eviction logic that this might be necessary, but as we don't want to block all evictions on one failed eviction, this works right now.

pkg/controllers/terminating/v1alpha1/terminate.go Outdated Show resolved Hide resolved
}

// ToleratesTaint returns true if the pod tolerates the taint
// ToleratesAllTaints returns an error if the pod does not tolerate the taints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToleratesAllTaints -> ToleratesTaints? the taints -> all of the taints?

ellistarn
ellistarn previously approved these changes Jun 30, 2021
@njtran njtran merged commit 2f939eb into aws:main Jun 30, 2021
@njtran njtran deleted the drainDaemon branch July 1, 2021 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants