-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trigger termination logic on watch events #446
Conversation
pkg/cloudprovider/aws/instance.go
Outdated
func (p *InstanceProvider) Terminate(ctx context.Context, nodes []*v1.Node) error { | ||
if len(nodes) == 0 { | ||
return nil | ||
func (p *InstanceProvider) Terminate(ctx context.Context, node *v1.Node) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is EC2's delete instance default quota? We're moving from an n -> 1 model for deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is in line with how we're doing watch events, I'd like to re-evaluate this when we get to scale testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI It's pretty high for TerminateInstances
so probably fine but a good thing to watch for when scale testing:
1000 Max and 20 Refill rate
https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html#throttling-limits
pkg/controllers/provisioning/v1alpha1/reallocation/utilization.go
Outdated
Show resolved
Hide resolved
} | ||
zap.S().Infof("Terminated node %s", node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logline will conflict with :95
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the Node controller will clean up the node even if we fail to delete, I'll change :95 to be delete node object
and :97 to refer to terminated instance
.
WDYT?
@@ -116,6 +116,7 @@ const ( | |||
ProvisionerUnderutilizedPhase = "underutilized" | |||
ProvisionerTerminablePhase = "terminable" | |||
ProvisionerDrainingPhase = "draining" | |||
ProvisionerDrainedPhase = "drained" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like this phase is ever set or used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do some checks to see if it has the label, but I forgot to add the patch that changes the node object to contain this. Thanks.
pkg/controllers/controller.go
Outdated
@@ -41,10 +40,10 @@ func (c *GenericController) Reconcile(ctx context.Context, req reconcile.Request | |||
// 1. Read Spec | |||
resource := c.For() | |||
if err := c.Get(ctx, req.NamespacedName, resource); err != nil { | |||
if errors.IsNotFound(err) { | |||
if errors.IsNotFound(err) || errors.IsGone(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could cause the resource to be "Gone". I'm not aware of kubernetes using this. I also haven't seen it in other operator implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jacob recently commented about this. Here's the doc that convinced me to include this.
https://kubernetes.io/docs/reference/using-api/api-concepts/#410-gone-responses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I hadn't done all my homework so it's possibly unnecessary. According to the docs you found it looks like maybe not?
When using resourceVersionMatch=NotOlderThan and limit is set, clients must handle HTTP 410 "Gone" responses. For example, the client might retry with a newer resourceVersion or fall back to resourceVersion="".
(I'm not sure if NotOlderThan and limit is actually being passed anywhere under the covers?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necessarily we're using latest here, so Gone is not possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, seems worth reverting - apologies for any churn I may have caused here.
|
||
// terminateNodes takes in a list of expired non-drained nodes and calls drain on them | ||
func (t *Terminator) drain(ctx context.Context, node *v1.Node) (bool, error) { | ||
if phase, ok := node.Labels[provisioning.ProvisionerPhaseLabel]; ok && phase == provisioning.ProvisionerDrainedPhase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need both of these checks, if not ok, then phase will be "". You can just do node.Labels[provisioning.ProvisionerPhaseLabel] == provisioning.ProvisionerDrainedPhase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also -- looking at the code, how does this work? Is ProvisionerDrainedPhase ever set? IMO, we shouldn't even need ProvisionerDrainedPhase
return false | ||
}, | ||
// Deleting nodes directly is not supported. | ||
// Our termination design will add a webhook to validate delete requests, and then send an update call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably drop both of these comments. Consider inlining all of these funcs as well
CreateFunc: func(e event.CreateEvent) bool { return false }, // no-op
DeleteFunc: func(e event.DeleteEvent) bool { return false }, // no-op
...
} | ||
|
||
// getPods returns a list of pods scheduled to a node | ||
func (t *Terminator) getPods(ctx context.Context, nodeName string) ([]*v1.Pod, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider just removing this function entirely. You're looping through podList.Items anyways, so it may not be worth it to convert to a []*v1.Pod. This function doesn't actually remove any lines of code in the implementation above, you could do:
podList := &v1.PodList{}
if err := t.kubeClient.List(ctx, pods, client.MatchingFields{"spec.nodeName": node.Name}); err != nil {
return nil, fmt.Errorf("listing pods on node %s, %w", nodeName, err)
}
empty := true
for _, p := range podList.Items {
...
}
"go.uber.org/zap" | ||
v1 "k8s.io/api/core/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the v1 alias isn't needed here.
return reconcile.Result{}, fmt.Errorf("terminating nodes, %w", err) | ||
} | ||
} | ||
return reconcile.Result{Requeue: !drained}, nil | ||
} | ||
|
||
func (c *Controller) Watches(context.Context) (source.Source, handler.EventHandler, builder.WatchesOption) { | ||
func (c *Controller) Watches() (source.Source, handler.EventHandler, builder.WatchesOption) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add comments to all these public funcs in the controller. Some have comments and some don't, but it'd be nice for people not familiar with writing controllers, like me, to understand what these funcs do at a high level.
pkg/controllers/types.go
Outdated
@@ -46,7 +46,7 @@ type Controller interface { | |||
// a. Source: the resource that is being watched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should start with the func and be Watches
pkg/controllers/controller.go
Outdated
zap.S().Errorf("Controller failed to reconcile kind %s, %s", | ||
resource.GetObjectKind().GroupVersionKind().Kind, err.Error()) | ||
return reconcile.Result{Requeue: true}, nil | ||
return result, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to patch if we get an error?
@@ -46,7 +46,7 @@ type Controller interface { | |||
// a. Source: the resource that is being watched | |||
// b. EventHandler: which controller objects to be reconciled | |||
// c. WatchesOption: which events can be filtered out before processed | |||
Watches(context.Context) (source.Source, handler.EventHandler, builder.WatchesOption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the Interval() API here now as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm keeping the Interval() here as the allocation an reallocation controllers still rely on it.
@@ -91,6 +91,7 @@ rules: | |||
resources: | |||
- nodes | |||
- pods | |||
- nodes/status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be patching node status.
pkg/controllers/controller.go
Outdated
if _, err := c.Controller.Reconcile(ctx, resource); err != nil { | ||
// 4. Reconcile | ||
result, err := c.Controller.Reconcile(ctx, resource) | ||
if err != nil { | ||
zap.S().Errorf("Controller failed to reconcile kind %s, %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oneline!
Issue #, if available:
Description of changes:
listNodes
for every reconcile.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.