Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigger termination logic on watch events #446

Merged
merged 4 commits into from
Jun 24, 2021
Merged

Trigger termination logic on watch events #446

merged 4 commits into from
Jun 24, 2021

Conversation

njtran
Copy link
Contributor

@njtran njtran commented Jun 10, 2021

Issue #, if available:

Description of changes:

  • Termination controller reacts to all node events, and reconciles on a per-node basis, rather than querying listNodes for every reconcile.
  • GenericController can work with controllers triggered on an interval or by watch events.
  • Added status patching to properly work based on the result of the reconcile loop.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

func (p *InstanceProvider) Terminate(ctx context.Context, nodes []*v1.Node) error {
if len(nodes) == 0 {
return nil
func (p *InstanceProvider) Terminate(ctx context.Context, node *v1.Node) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is EC2's delete instance default quota? We're moving from an n -> 1 model for deletion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is in line with how we're doing watch events, I'd like to re-evaluate this when we get to scale testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI It's pretty high for TerminateInstances so probably fine but a good thing to watch for when scale testing:

1000 Max and 20 Refill rate

https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html#throttling-limits

}
zap.S().Infof("Terminated node %s", node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logline will conflict with :95

Copy link
Contributor Author

@njtran njtran Jun 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Node controller will clean up the node even if we fail to delete, I'll change :95 to be delete node object and :97 to refer to terminated instance.

WDYT?

@@ -116,6 +116,7 @@ const (
ProvisionerUnderutilizedPhase = "underutilized"
ProvisionerTerminablePhase = "terminable"
ProvisionerDrainingPhase = "draining"
ProvisionerDrainedPhase = "drained"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this phase is ever set or used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do some checks to see if it has the label, but I forgot to add the patch that changes the node object to contain this. Thanks.

@@ -41,10 +40,10 @@ func (c *GenericController) Reconcile(ctx context.Context, req reconcile.Request
// 1. Read Spec
resource := c.For()
if err := c.Get(ctx, req.NamespacedName, resource); err != nil {
if errors.IsNotFound(err) {
if errors.IsNotFound(err) || errors.IsGone(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could cause the resource to be "Gone". I'm not aware of kubernetes using this. I also haven't seen it in other operator implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jacob recently commented about this. Here's the doc that convinced me to include this.
https://kubernetes.io/docs/reference/using-api/api-concepts/#410-gone-responses

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I hadn't done all my homework so it's possibly unnecessary. According to the docs you found it looks like maybe not?

When using resourceVersionMatch=NotOlderThan and limit is set, clients must handle HTTP 410 "Gone" responses. For example, the client might retry with a newer resourceVersion or fall back to resourceVersion="".

(I'm not sure if NotOlderThan and limit is actually being passed anywhere under the covers?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessarily we're using latest here, so Gone is not possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, seems worth reverting - apologies for any churn I may have caused here.


// terminateNodes takes in a list of expired non-drained nodes and calls drain on them
func (t *Terminator) drain(ctx context.Context, node *v1.Node) (bool, error) {
if phase, ok := node.Labels[provisioning.ProvisionerPhaseLabel]; ok && phase == provisioning.ProvisionerDrainedPhase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need both of these checks, if not ok, then phase will be "". You can just do node.Labels[provisioning.ProvisionerPhaseLabel] == provisioning.ProvisionerDrainedPhase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also -- looking at the code, how does this work? Is ProvisionerDrainedPhase ever set? IMO, we shouldn't even need ProvisionerDrainedPhase

return false
},
// Deleting nodes directly is not supported.
// Our termination design will add a webhook to validate delete requests, and then send an update call.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably drop both of these comments. Consider inlining all of these funcs as well

CreateFunc: func(e event.CreateEvent) bool { return false }, // no-op
DeleteFunc: func(e event.DeleteEvent) bool { return false }, // no-op
...

}

// getPods returns a list of pods scheduled to a node
func (t *Terminator) getPods(ctx context.Context, nodeName string) ([]*v1.Pod, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider just removing this function entirely. You're looping through podList.Items anyways, so it may not be worth it to convert to a []*v1.Pod. This function doesn't actually remove any lines of code in the implementation above, you could do:

podList := &v1.PodList{}
if err := t.kubeClient.List(ctx, pods, client.MatchingFields{"spec.nodeName": node.Name}); err != nil {		
    return nil, fmt.Errorf("listing pods on node %s, %w", nodeName, err)	
}
empty := true
for _, p := range podList.Items {
   ... 
}

@njtran njtran changed the title Trigger termination logic on watch events [WAITING FOR #466 MERGE] Trigger termination logic on watch events Jun 22, 2021
@njtran njtran changed the title [WAITING FOR #466 MERGE] Trigger termination logic on watch events Trigger termination logic on watch events Jun 23, 2021
"go.uber.org/zap"
v1 "k8s.io/api/core/v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the v1 alias isn't needed here.

return reconcile.Result{}, fmt.Errorf("terminating nodes, %w", err)
}
}
return reconcile.Result{Requeue: !drained}, nil
}

func (c *Controller) Watches(context.Context) (source.Source, handler.EventHandler, builder.WatchesOption) {
func (c *Controller) Watches() (source.Source, handler.EventHandler, builder.WatchesOption) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add comments to all these public funcs in the controller. Some have comments and some don't, but it'd be nice for people not familiar with writing controllers, like me, to understand what these funcs do at a high level.

@@ -46,7 +46,7 @@ type Controller interface {
// a. Source: the resource that is being watched
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should start with the func and be Watches

zap.S().Errorf("Controller failed to reconcile kind %s, %s",
resource.GetObjectKind().GroupVersionKind().Kind, err.Error())
return reconcile.Result{Requeue: true}, nil
return result, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to patch if we get an error?

@@ -46,7 +46,7 @@ type Controller interface {
// a. Source: the resource that is being watched
// b. EventHandler: which controller objects to be reconciled
// c. WatchesOption: which events can be filtered out before processed
Watches(context.Context) (source.Source, handler.EventHandler, builder.WatchesOption)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the Interval() API here now as well, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm keeping the Interval() here as the allocation an reallocation controllers still rely on it.

@@ -91,6 +91,7 @@ rules:
resources:
- nodes
- pods
- nodes/status
Copy link
Contributor

@ellistarn ellistarn Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be patching node status.

if _, err := c.Controller.Reconcile(ctx, resource); err != nil {
// 4. Reconcile
result, err := c.Controller.Reconcile(ctx, resource)
if err != nil {
zap.S().Errorf("Controller failed to reconcile kind %s, %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oneline!

@ellistarn ellistarn merged commit 687795b into aws:main Jun 24, 2021
@njtran njtran deleted the terminationWatcher branch July 1, 2021 19:54
gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants