Enabling the ability to drain daemon set pods, with priority #8619

mmerrill3 · 2020-02-24T17:49:46Z

This PR gives the ability to drain daemon set pods during a rolling update. Daemon set pods will be deleted last. This is to address issue (#8391)

Signed-off-by: mmerrill3 [email protected]

k8s-ci-robot · 2020-02-24T17:49:55Z

Hi @mmerrill3. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2020-02-24T17:52:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mmerrill3
To complete the pull request process, please assign zetaab
You can assign the PR to them by writing /assign @zetaab in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…f deletion Signed-off-by: mmerrill3 <[email protected]>

hakman · 2020-02-24T19:53:39Z

/ok-to-test

k8s-ci-robot · 2020-02-24T20:00:36Z

@mmerrill3: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-kops-verify-bazel	`094b9bd`	link	`/test pull-kops-verify-bazel`
pull-kops-e2e-kubernetes-aws	`094b9bd`	link	`/test pull-kops-e2e-kubernetes-aws`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

johngmyers

Even without the problem that the code in this PR is likely to evict a daemonset pod before a workload pod depending on it, I have the serious reservations about the approach that I've mentioned in the linked issue. If there are daemonsets with local state that it is desirable to flush, this should be addressed for all cases of node deletion, not just rolling update. A better approach would be to address graceful termination, #7119.

johngmyers · 2020-02-25T04:54:41Z

cmd/kops/rollingupdatecluster.go

@@ -141,6 +151,9 @@ type RollingUpdateOptions struct {
 	// InstanceGroupRoles is the list of roles we should rolling-update
 	// if not specified, all instance groups will be updated
 	InstanceGroupRoles []string
+
+	// IgnoreDaemonsets when a node is drained, daemon set pods will also be drained if false.  Default is true.
+	IgnoreDaemonsets bool


I dislike negative options. Better would be to name it EvictDaemonsets.

Is using the same flag name/sense as kubectl drain important? They're not really ignoring daemonsets for the rolling update.

johngmyers · 2020-02-25T05:04:45Z

pkg/drain/drain.go

+	//for backwards compatibility, the default priority is highest.  If a daemon-set is found
+	//and ignore-daemonsets is set to false, the priority will be the lowest.
+	//This sort will return the order of pods to delete, with the highest priority first.
+	sort.SliceStable(pods, func(i, j int) bool {


evictPods() uses goroutines to evict all pods in parallel, so sorting the list has little effect. DaemonSet pods will still be evicted before the workloads that depend on them.

johngmyers · 2020-02-25T05:08:14Z

pkg/drain/drain_test.go

+	return list
+}
+
+func TestRollingUpdateDaemonSetMixedPods(t *testing.T) {


This is a test of the draining code, not of rolling update.

This only tests the ordering of the list returned by GetPodsForDeletion. It does not test anything about how that list is handled by DeleteOrEvictPods, which is why it failed to catch that the sorting was ineffective.

johngmyers · 2020-02-25T05:15:07Z

pkg/drain/drain.go

 		}
+		status.priority = priority


The logic here is a little convoluted. It takes the podDeleteStatus of the last-called filter, except replaces the priority field with the lowest priority. Except if the podDeleteStatus indicates non-deletion, the priority of that last call is never considered, even using podDeletionPriorityHighest if the first filter indicated non-deletion.

I don't think the code should be modifying the returned podDeleteStatus structs.

johngmyers · 2020-02-25T05:22:10Z

pkg/drain/filters.go

 	}

-	return makePodDeleteStatusWithWarning(false, daemonSetWarning)
+	return makePodDeleteStatusWithWarning(false, daemonSetWarning, podDeletionPriorityLowest)


Why should this and the previous be Lowest? Better would be podDeletionPriorityDaemonset.

johngmyers · 2020-02-25T05:25:22Z

pkg/instancegroups/rollingupdate_test.go

+	groups := make(map[string]*cloudinstances.CloudInstanceGroup)
+	makeGroup(groups, c.K8sClient, cloud, "node-1", kopsapi.InstanceGroupRoleNode, 1, 1)
+	err := c.RollingUpdate(groups, cluster, &kopsapi.InstanceGroupList{})
+	assert.NoError(t, err, "rolling update")


This isn't testing a whole lot. The test would pass even if the IgnoreDaemonsets field was unused.

johngmyers · 2020-02-25T05:55:27Z

Run hack/update-bazel.sh and commit the results in order to pass the pull-kops-verify-bazel and Travis checks.

mmerrill3 · 2020-02-25T14:45:27Z

@johngmyers thanks for all the feedback and your insight. Before I go forward, do we want to address all the use cases where pods can get drained from a node, or just the rolling update feature of kops? To confirm, I see three use cases. One is a user initiated rolling update, via kops. The next use case is node deletion via an ASG, or even via the cloud providers' console or a CLI. Finally, there's node deletion for any other reason.

Yes, this PR only addresses the first use case, which is the user initiated rolling update command.

To address #7119, I'm not sure if the kubelet sends a delete node request to the API server in this case. If so, maybe we could leverage that. That's nice since the rolling-update also sends a delete node signal to the API server, and we can come up with one solution, based around that delete node API request.

mmerrill3 · 2020-02-25T20:13:34Z

no, kubelet does not send a delete node API request when the instances are terminated via the AWS console

mmerrill3 · 2020-02-25T20:59:07Z

I feel like we are at a design impasse here, since one use case is a CLI use case where we do a rolling update, and kops is in the flow. The other use case is an ASG or another tool terminating an instance, and kops is not in the flow. The solutions in #7119, in particular the last one where a systemd service is registered at shutdown, are not initiated by kops.

The rolling update can be controlled by kops, so that's what this PR is for. I don't see a way, without introducing another component (operator, systemd unit, daemon set) of handling the "cloud" termination gracefully.
Moreover, the solutions for #7119 won't help when we want to make sure that all pods are removed before we delete the node through the k8s API server during a rolling update. For instance, if I setup a node with a systemd stop script to gracefully remove all the pods upon node termination, it will have ran too late. The node will already have been deleted from the API server. My goal is to gracefully remove all pods before the calls is made to the API server to delete the node.

Maybe that's the hang up. During a rolling update, I care when the delete request is made to the API server. In the other cases, the delete node API call doesn't get made, and its not relevant.

johngmyers · 2020-02-26T05:44:45Z

According to the comment in the code, which was added in e982087, rolling update deletes the node because the cloud provider could create a replacement with the same name and that could get the same node object with the cordoning in place.

One possibility would be to move the node deletion to the systemd stop script, assuming we would have enough confidence that it would reliably run.

It seems odd to me that kubelet would reuse a node resource like this. One would think that the deletion-through-console case would also suffer from such stale data in reused node resources. Perhaps we could have kops-controller delete such orphaned nodes when the replacement instance goes through bootstrapping?

@justinsb what would you say to moving the node deletion that is in rolling update to someplace later? The assertion is that Weave uses the node deletion event as a signal to deallocate that node's IP address assignments, which is a problem if that node's Weave pod is still running and communicating with other Weave pods.

johngmyers · 2020-02-26T06:02:03Z

@mmerrill3 to answer your first question, I would prefer we spend some time exploring whether we can find a systematic solution. If we can, then work done on rolling update would likely be wasted.

If there were to be more work on rolling update, it would probably be to add a second drain call to evict the daemonsets. I'm not sure this would need to be an option.

mmerrill3 · 2020-03-02T22:43:58Z

Thanks @johngmyers, I like your idea that the node deletion doesn't need to occur during rolling update, but rather moving the node deletion action to a workflow that handles graceful node termination, like discussed in #7119. That would allow us to do graceful terminations the same way, for both use cases.
Nodeup could add a systemd "shutdown" unit to be invoked when the instance is being shutdown. We can make that configurable, default turned on, as a kops configuration general option. That shutdown systemd service needs to be able to send API commands to the API server, so we would need to maybe do something like protokube. As opposed to being ran to setup volumes, DNS, etc, a new "postkube" container would be ran after we are done with the node as a member of the cluster, and allow us to do node deletion, removing any remaining pods if need be, gracefully close mounts, or anything else really.

mmerrill3 · 2020-03-04T21:06:22Z

this PR is going to be closed in support of a different approach.

k8s-ci-robot · 2020-03-04T21:06:29Z

@mmerrill3: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mmerrill3 · 2020-03-05T14:40:55Z

closing to come at the from a different angle.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 24, 2020

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 24, 2020

k8s-ci-robot requested review from joshbranham and robinpercy February 24, 2020 17:52

mmerrill3 force-pushed the feature/kops-8391 branch from ea5d85a to 552f94d Compare February 24, 2020 17:52

mmerrill3 force-pushed the feature/kops-8391 branch from 552f94d to 094b9bd Compare February 24, 2020 19:45

Enabling the ability to drain daemon set pods, and to set the order o…

094b9bd

…f deletion Signed-off-by: mmerrill3 <[email protected]>

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 24, 2020

johngmyers requested changes Feb 25, 2020

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 4, 2020

mmerrill3 closed this Mar 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling the ability to drain daemon set pods, with priority #8619

Enabling the ability to drain daemon set pods, with priority #8619

mmerrill3 commented Feb 24, 2020

k8s-ci-robot commented Feb 24, 2020

k8s-ci-robot commented Feb 24, 2020

hakman commented Feb 24, 2020

k8s-ci-robot commented Feb 24, 2020

johngmyers left a comment

johngmyers Feb 25, 2020 •

edited

Loading

johngmyers Feb 25, 2020

johngmyers Feb 25, 2020

johngmyers Feb 25, 2020

johngmyers Feb 25, 2020

johngmyers Feb 25, 2020

johngmyers commented Feb 25, 2020

mmerrill3 commented Feb 25, 2020

mmerrill3 commented Feb 25, 2020

mmerrill3 commented Feb 25, 2020

johngmyers commented Feb 26, 2020

johngmyers commented Feb 26, 2020

mmerrill3 commented Mar 2, 2020

mmerrill3 commented Mar 4, 2020

k8s-ci-robot commented Mar 4, 2020

mmerrill3 commented Mar 5, 2020

Enabling the ability to drain daemon set pods, with priority #8619

Enabling the ability to drain daemon set pods, with priority #8619

Conversation

mmerrill3 commented Feb 24, 2020

k8s-ci-robot commented Feb 24, 2020

k8s-ci-robot commented Feb 24, 2020

hakman commented Feb 24, 2020

k8s-ci-robot commented Feb 24, 2020

johngmyers left a comment

Choose a reason for hiding this comment

johngmyers Feb 25, 2020 • edited Loading

Choose a reason for hiding this comment

johngmyers Feb 25, 2020

Choose a reason for hiding this comment

johngmyers Feb 25, 2020

Choose a reason for hiding this comment

johngmyers Feb 25, 2020

Choose a reason for hiding this comment

johngmyers Feb 25, 2020

Choose a reason for hiding this comment

johngmyers Feb 25, 2020

Choose a reason for hiding this comment

johngmyers commented Feb 25, 2020

mmerrill3 commented Feb 25, 2020

mmerrill3 commented Feb 25, 2020

mmerrill3 commented Feb 25, 2020

johngmyers commented Feb 26, 2020

johngmyers commented Feb 26, 2020

mmerrill3 commented Mar 2, 2020

mmerrill3 commented Mar 4, 2020

k8s-ci-robot commented Mar 4, 2020

mmerrill3 commented Mar 5, 2020

johngmyers Feb 25, 2020 •

edited

Loading