-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling the ability to drain daemon set pods, with priority #8619
Conversation
Hi @mmerrill3. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
ea5d85a
to
552f94d
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mmerrill3 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
552f94d
to
094b9bd
Compare
…f deletion Signed-off-by: mmerrill3 <[email protected]>
/ok-to-test |
@mmerrill3: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even without the problem that the code in this PR is likely to evict a daemonset pod before a workload pod depending on it, I have the serious reservations about the approach that I've mentioned in the linked issue. If there are daemonsets with local state that it is desirable to flush, this should be addressed for all cases of node deletion, not just rolling update. A better approach would be to address graceful termination, #7119.
@@ -141,6 +151,9 @@ type RollingUpdateOptions struct { | |||
// InstanceGroupRoles is the list of roles we should rolling-update | |||
// if not specified, all instance groups will be updated | |||
InstanceGroupRoles []string | |||
|
|||
// IgnoreDaemonsets when a node is drained, daemon set pods will also be drained if false. Default is true. | |||
IgnoreDaemonsets bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike negative options. Better would be to name it EvictDaemonsets
.
Is using the same flag name/sense as kubectl drain
important? They're not really ignoring daemonsets for the rolling update.
//for backwards compatibility, the default priority is highest. If a daemon-set is found | ||
//and ignore-daemonsets is set to false, the priority will be the lowest. | ||
//This sort will return the order of pods to delete, with the highest priority first. | ||
sort.SliceStable(pods, func(i, j int) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
evictPods()
uses goroutines to evict all pods in parallel, so sorting the list has little effect. DaemonSet
pods will still be evicted before the workloads that depend on them.
return list | ||
} | ||
|
||
func TestRollingUpdateDaemonSetMixedPods(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a test of the draining code, not of rolling update.
This only tests the ordering of the list returned by GetPodsForDeletion
. It does not test anything about how that list is handled by DeleteOrEvictPods
, which is why it failed to catch that the sorting was ineffective.
} | ||
status.priority = priority |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is a little convoluted. It takes the podDeleteStatus
of the last-called filter, except replaces the priority
field with the lowest priority. Except if the podDeleteStatus
indicates non-deletion, the priority of that last call is never considered, even using podDeletionPriorityHighest
if the first filter indicated non-deletion.
I don't think the code should be modifying the returned podDeleteStatus
structs.
} | ||
|
||
return makePodDeleteStatusWithWarning(false, daemonSetWarning) | ||
return makePodDeleteStatusWithWarning(false, daemonSetWarning, podDeletionPriorityLowest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should this and the previous be Lowest
? Better would be podDeletionPriorityDaemonset
.
groups := make(map[string]*cloudinstances.CloudInstanceGroup) | ||
makeGroup(groups, c.K8sClient, cloud, "node-1", kopsapi.InstanceGroupRoleNode, 1, 1) | ||
err := c.RollingUpdate(groups, cluster, &kopsapi.InstanceGroupList{}) | ||
assert.NoError(t, err, "rolling update") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't testing a whole lot. The test would pass even if the IgnoreDaemonsets
field was unused.
Run |
@johngmyers thanks for all the feedback and your insight. Before I go forward, do we want to address all the use cases where pods can get drained from a node, or just the rolling update feature of kops? To confirm, I see three use cases. One is a user initiated rolling update, via kops. The next use case is node deletion via an ASG, or even via the cloud providers' console or a CLI. Finally, there's node deletion for any other reason. Yes, this PR only addresses the first use case, which is the user initiated rolling update command. To address #7119, I'm not sure if the kubelet sends a delete node request to the API server in this case. If so, maybe we could leverage that. That's nice since the rolling-update also sends a delete node signal to the API server, and we can come up with one solution, based around that delete node API request. |
no, kubelet does not send a delete node API request when the instances are terminated via the AWS console |
I feel like we are at a design impasse here, since one use case is a CLI use case where we do a rolling update, and kops is in the flow. The other use case is an ASG or another tool terminating an instance, and kops is not in the flow. The solutions in #7119, in particular the last one where a systemd service is registered at shutdown, are not initiated by kops. The rolling update can be controlled by kops, so that's what this PR is for. I don't see a way, without introducing another component (operator, systemd unit, daemon set) of handling the "cloud" termination gracefully. Maybe that's the hang up. During a rolling update, I care when the delete request is made to the API server. In the other cases, the delete node API call doesn't get made, and its not relevant. |
According to the comment in the code, which was added in e982087, rolling update deletes the node because the cloud provider could create a replacement with the same name and that could get the same node object with the cordoning in place. One possibility would be to move the node deletion to the systemd stop script, assuming we would have enough confidence that it would reliably run. It seems odd to me that kubelet would reuse a node resource like this. One would think that the deletion-through-console case would also suffer from such stale data in reused node resources. Perhaps we could have kops-controller delete such orphaned nodes when the replacement instance goes through bootstrapping? @justinsb what would you say to moving the node deletion that is in rolling update to someplace later? The assertion is that Weave uses the node deletion event as a signal to deallocate that node's IP address assignments, which is a problem if that node's Weave pod is still running and communicating with other Weave pods. |
@mmerrill3 to answer your first question, I would prefer we spend some time exploring whether we can find a systematic solution. If we can, then work done on rolling update would likely be wasted. If there were to be more work on rolling update, it would probably be to add a second drain call to evict the daemonsets. I'm not sure this would need to be an option. |
Thanks @johngmyers, I like your idea that the node deletion doesn't need to occur during rolling update, but rather moving the node deletion action to a workflow that handles graceful node termination, like discussed in #7119. That would allow us to do graceful terminations the same way, for both use cases. |
this PR is going to be closed in support of a different approach. |
@mmerrill3: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
closing to come at the from a different angle. |
This PR gives the ability to drain daemon set pods during a rolling update. Daemon set pods will be deleted last. This is to address issue (#8391)
Signed-off-by: mmerrill3 [email protected]