Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling the ability to drain daemon set pods, with priority #8619

Closed
wants to merge 1 commit into from

Conversation

mmerrill3
Copy link

This PR gives the ability to drain daemon set pods during a rolling update. Daemon set pods will be deleted last. This is to address issue (#8391)

Signed-off-by: mmerrill3 [email protected]

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 24, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @mmerrill3. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 24, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mmerrill3
To complete the pull request process, please assign zetaab
You can assign the PR to them by writing /assign @zetaab in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hakman
Copy link
Member

hakman commented Feb 24, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 24, 2020
@k8s-ci-robot
Copy link
Contributor

@mmerrill3: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kops-verify-bazel 094b9bd link /test pull-kops-verify-bazel
pull-kops-e2e-kubernetes-aws 094b9bd link /test pull-kops-e2e-kubernetes-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Member

@johngmyers johngmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without the problem that the code in this PR is likely to evict a daemonset pod before a workload pod depending on it, I have the serious reservations about the approach that I've mentioned in the linked issue. If there are daemonsets with local state that it is desirable to flush, this should be addressed for all cases of node deletion, not just rolling update. A better approach would be to address graceful termination, #7119.

@@ -141,6 +151,9 @@ type RollingUpdateOptions struct {
// InstanceGroupRoles is the list of roles we should rolling-update
// if not specified, all instance groups will be updated
InstanceGroupRoles []string

// IgnoreDaemonsets when a node is drained, daemon set pods will also be drained if false. Default is true.
IgnoreDaemonsets bool
Copy link
Member

@johngmyers johngmyers Feb 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike negative options. Better would be to name it EvictDaemonsets.

Is using the same flag name/sense as kubectl drain important? They're not really ignoring daemonsets for the rolling update.

//for backwards compatibility, the default priority is highest. If a daemon-set is found
//and ignore-daemonsets is set to false, the priority will be the lowest.
//This sort will return the order of pods to delete, with the highest priority first.
sort.SliceStable(pods, func(i, j int) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evictPods() uses goroutines to evict all pods in parallel, so sorting the list has little effect. DaemonSet pods will still be evicted before the workloads that depend on them.

return list
}

func TestRollingUpdateDaemonSetMixedPods(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test of the draining code, not of rolling update.

This only tests the ordering of the list returned by GetPodsForDeletion. It does not test anything about how that list is handled by DeleteOrEvictPods, which is why it failed to catch that the sorting was ineffective.

}
status.priority = priority
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is a little convoluted. It takes the podDeleteStatus of the last-called filter, except replaces the priority field with the lowest priority. Except if the podDeleteStatus indicates non-deletion, the priority of that last call is never considered, even using podDeletionPriorityHighest if the first filter indicated non-deletion.

I don't think the code should be modifying the returned podDeleteStatus structs.

}

return makePodDeleteStatusWithWarning(false, daemonSetWarning)
return makePodDeleteStatusWithWarning(false, daemonSetWarning, podDeletionPriorityLowest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this and the previous be Lowest? Better would be podDeletionPriorityDaemonset.

groups := make(map[string]*cloudinstances.CloudInstanceGroup)
makeGroup(groups, c.K8sClient, cloud, "node-1", kopsapi.InstanceGroupRoleNode, 1, 1)
err := c.RollingUpdate(groups, cluster, &kopsapi.InstanceGroupList{})
assert.NoError(t, err, "rolling update")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't testing a whole lot. The test would pass even if the IgnoreDaemonsets field was unused.

@johngmyers
Copy link
Member

Run hack/update-bazel.sh and commit the results in order to pass the pull-kops-verify-bazel and Travis checks.

@mmerrill3
Copy link
Author

@johngmyers thanks for all the feedback and your insight. Before I go forward, do we want to address all the use cases where pods can get drained from a node, or just the rolling update feature of kops? To confirm, I see three use cases. One is a user initiated rolling update, via kops. The next use case is node deletion via an ASG, or even via the cloud providers' console or a CLI. Finally, there's node deletion for any other reason.

Yes, this PR only addresses the first use case, which is the user initiated rolling update command.

To address #7119, I'm not sure if the kubelet sends a delete node request to the API server in this case. If so, maybe we could leverage that. That's nice since the rolling-update also sends a delete node signal to the API server, and we can come up with one solution, based around that delete node API request.

@mmerrill3
Copy link
Author

no, kubelet does not send a delete node API request when the instances are terminated via the AWS console

@mmerrill3
Copy link
Author

I feel like we are at a design impasse here, since one use case is a CLI use case where we do a rolling update, and kops is in the flow. The other use case is an ASG or another tool terminating an instance, and kops is not in the flow. The solutions in #7119, in particular the last one where a systemd service is registered at shutdown, are not initiated by kops.

The rolling update can be controlled by kops, so that's what this PR is for. I don't see a way, without introducing another component (operator, systemd unit, daemon set) of handling the "cloud" termination gracefully.
Moreover, the solutions for #7119 won't help when we want to make sure that all pods are removed before we delete the node through the k8s API server during a rolling update. For instance, if I setup a node with a systemd stop script to gracefully remove all the pods upon node termination, it will have ran too late. The node will already have been deleted from the API server. My goal is to gracefully remove all pods before the calls is made to the API server to delete the node.

Maybe that's the hang up. During a rolling update, I care when the delete request is made to the API server. In the other cases, the delete node API call doesn't get made, and its not relevant.

@johngmyers
Copy link
Member

According to the comment in the code, which was added in e982087, rolling update deletes the node because the cloud provider could create a replacement with the same name and that could get the same node object with the cordoning in place.

One possibility would be to move the node deletion to the systemd stop script, assuming we would have enough confidence that it would reliably run.

It seems odd to me that kubelet would reuse a node resource like this. One would think that the deletion-through-console case would also suffer from such stale data in reused node resources. Perhaps we could have kops-controller delete such orphaned nodes when the replacement instance goes through bootstrapping?

@justinsb what would you say to moving the node deletion that is in rolling update to someplace later? The assertion is that Weave uses the node deletion event as a signal to deallocate that node's IP address assignments, which is a problem if that node's Weave pod is still running and communicating with other Weave pods.

@johngmyers
Copy link
Member

@mmerrill3 to answer your first question, I would prefer we spend some time exploring whether we can find a systematic solution. If we can, then work done on rolling update would likely be wasted.

If there were to be more work on rolling update, it would probably be to add a second drain call to evict the daemonsets. I'm not sure this would need to be an option.

@mmerrill3
Copy link
Author

Thanks @johngmyers, I like your idea that the node deletion doesn't need to occur during rolling update, but rather moving the node deletion action to a workflow that handles graceful node termination, like discussed in #7119. That would allow us to do graceful terminations the same way, for both use cases.
Nodeup could add a systemd "shutdown" unit to be invoked when the instance is being shutdown. We can make that configurable, default turned on, as a kops configuration general option. That shutdown systemd service needs to be able to send API commands to the API server, so we would need to maybe do something like protokube. As opposed to being ran to setup volumes, DNS, etc, a new "postkube" container would be ran after we are done with the node as a member of the cluster, and allow us to do node deletion, removing any remaining pods if need be, gracefully close mounts, or anything else really.

@mmerrill3
Copy link
Author

this PR is going to be closed in support of a different approach.

@k8s-ci-robot
Copy link
Contributor

@mmerrill3: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 4, 2020
@mmerrill3
Copy link
Author

closing to come at the from a different angle.

@mmerrill3 mmerrill3 closed this Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants