-
Notifications
You must be signed in to change notification settings - Fork 465
kube-aws: Drain nodes before shutting them down #465
Conversation
a0117e5
to
df90a06
Compare
@mumoshu this looks awesome, very exciting! I don't know that we need the \cc @aaronlevy @cgag |
--kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml \ | ||
drain $$(hostname) \ | ||
--force' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also need --ignore-daemonsets=true? The man page implies to me that you'd need both if you had a daemonset and an unmanaged pod, even though the name "force" makes me think "force" would be enough.
@mumoshu can you provide a little more detail regarding:
|
enable: true | ||
content: | | ||
[Unit] | ||
Description=drain this k8s node to make running pods time to gracefully shut down before stopping kubelet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing how this unit keeps the kubelet running until drained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like systemd shutdown down services in the reverse of the boot order[1], so After=kubelet.service
should mean this should be shut down before the kubelet.
[1] https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Before=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, thanks @cgag
To recap in-person discussions between @aaronlevy, @philips and myself: Summary- we will leave this node-draining behavior gated on experimental now as @mumoshu has it in this PR, defaulting to false. Node draining is certainly a good thing to do if you're taking a node out of commission forever. If you're just rebooting the machine, it's not clear if draining it is the desirable thing to do. Factors in this decision include number of pods on that node, how long it takes to re-schedule them, and whether or not those pods have any affitinty for resources on that host. (An example discussed would be a local-cache implemented via host mount). In the future:
@mumoshu: could we get a rebase on this PR please ;) |
9eb9197
to
eb27164
Compare
@colhom @aaronlevy @cgag @philips Thanks for your review 🙇 Regarding the experimental bit, I agree with leaving it as is. The context is that, I believe my tests around this functionality does not cover edge cases I can't even imagine of, hence experimental. As I'm planning to deploy CoreOS+K8s onto my production environment, I wished I could end up with a battle-tested solution which we can have some confidence marking it non-experimental. Googling the whole internet and consulting myself does not reveal something like that for me though 😃 The example of a local-cache implemented via host mount you've mentioned is very insightful for me. I personally think of it as one of those (not even edge?) cases. Also, the integration based on taints-and-tolerations and more fine-grained events seems very interesting/promising to me 👍 |
@cgag Thanks for the good catch #465 (comment)! I have confirmed that it was actually failing when draining a node running DaemonSet-managed pods and fixed it in ae43f64 |
@aaronlevy Sure. I had wished if a kubelet could stop gracefully when in shutting-down process CoreOS tries to stop it(by sending SIGTERM to it via systemd via docker). So I have tested it. My test showed that kubelet doesn't gracefully stop while CoreOS is shutting down. I have confirmed it by repeating |
nodeDrainer: NodeDrainer{ | ||
Enabled: false, | ||
KubectlImage: "mumoshu/kubectl:v1.2.2", | ||
KubectlCommand: "kubectl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the default Kubectlmage and KubectlCommand values are changed when the /hyperkube kubectl
issue is fixed, we'll have to edit tests like to reflect the changes to defaults.
I would like to see constants defaultHyperkubeImage
and defaultKubectlCommand
:
- defined in
config.go
- applied as defaults to the Cluster struct here
- used in the config tests like this which infer the default value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds nice! Just added 3171bd6 for it 👍
@mumoshu I talked with @aaronlevy and we have decided to take a stab a the upstream issue hyperkube issue (kubernetes/kubernetes#24088) that's blocking this PR. Give us a day or two to look into this. |
There is also this issue which takes a different approach for CoreOS autoupdates. They are somewhat orthogonal but highly related: coreos/bugs#1274 |
kubernetes/kubernetes#25112 has been merged. @aaronlevy could we cherry-pick this into the next hyperkube image? |
@mumoshu what you describe is more or less the expected behavior. The actual window for marking the node as unhealthy / pod eviction can be set via flags on the controller-manager (e.g. @colhom are you sure that's the right issue? |
Ooops I meant kubernetes/kubernetes#25512. @mumoshu we're going to wait until v1.3 (with that PR in it) is released before going forward wit this PR. |
@colhom I'm happy to work on it now as you've merged the 1.3-updates branch into master. |
@aaronlevy Thanks for confirming! Yes, it might be an expected behavior. As it's been a quite long time since we've talked, let me recap for clarification. I believe a node shut-down in (1)CoreOS's auto updating process (2)an ASG scale-in (3)a manual maintenance (4)an EC2 failure or etc will result in some down-time/decreased functional pods until Kubernetes marks the node not-ready/unschedulable. Time to not-ready/unschedulable seems to be configurable via flags you've mentioned. What I wanted to do here is, mark the node before kubernetes does, to completely avoid errors/delays in this case, so that kubernetes can react to decreased pods immediately. Also, even with changes in this PR, in some cases I believe you'll need additional steps to make a node actually out of service without returning errors to customers. E.g. when the node running pods or services exposing your web service through node/host ports for external load balancing. I guess the whole procedure would be as follows:
It is not addressed in this PR and I don't even know how I can do the whole thing cleanly enough. |
Does rolling replacement update on controller ASG, followed by workers Punts on upgrading etcd cluster- simply makes sure resource definitions don't change after create.
@colhom Definitely yes! Can I just rebase this branch onto the latest |
@colhom sure 👍 |
…time to gracefully stop. This change basically achieves it by running `docker run IMAGE kubectl drain THE_NODE --force` on the to-be-shut-down node before the kubelet gets SIGTERM in a CoreOS' shutdown process. Without this change, kubelets getting SIGTERM without a prior `drain` results in unfunctional(actually not schedulable) pods to have statuses `Ready`. With this change, when an ASG's desired cap decreased, a node's status changes over time as follows: On desired cap change: STATUS=Ready On shut-down started: STATUS=Ready,SchedulingDisabled (<- Pods are stopped and status is changed by `kubectl drain`) On shut-down finished: Status=NotReady,SchedulingDisabled (<- It's `NotReady` but it won't result in a down-time because we already stopped both further scheduling and pods) After a minute: The node disappears from the output of `kubectl get nodes` Note that: * This applies to manual shutdowns(via running `sudo systemctl shutdown` for example) and automated shutdowns(triggered by AWS AutoScaling when nodes get rotated out of a group) * We currently depend on the community docker image `mumoshu/kubectl` because `kubectl` included in the official `coreos/hyperkube` image doesn't work due to the issue kubernetes/kubernetes#24088 in Kubernetes. Once the issue is fixed and the CoreOS team published the new hyperkube image with the updated Kubernetes, we can remove that dependency. * The author considers this an experimental feature. So you shouldn't expect configuration API regarding this stable. It may change in the future. Re-format code introduced in the previous with gofmt to conform the rules and make the build pases
…Set-managed pods Before this change, you could reproduce the failure running `sudo sytemctl stop kube-node-drainer` which showed a error message indicating this issue like: ``` May 09 05:04:08 ip-10-0-0-234.ap-northeast-1.compute.internal sh[15376]: error: DaemonSet-managed pods: cassandra-e3tdb (use --ignore-daemonsets to ignore) May 09 05:04:08 ip-10-0-0-234.ap-northeast-1.compute.internal systemd[1]: kube-node-drainer.service: Control process exited, code=exited status=1 ``` After this change, you can drain a node even if it was running DaemonSet-managed pods: ``` May 09 10:41:08 ip-10-0-0-202.ap-northeast-1.compute.internal sh[7671]: node "ip-10-0-0-202.ap-northeast-1.compute.internal" cordoned May 09 10:41:08 ip-10-0-0-202.ap-northeast-1.compute.internal sh[7671]: WARNING: Skipping DaemonSet-managed pods: cassandra-jzfqo May 09 10:41:08 ip-10-0-0-202.ap-northeast-1.compute.internal sh[7671]: node "ip-10-0-0-202.ap-northeast-1.compute.internal" drained ```
3171bd6
to
d2a9085
Compare
@colhom Rebased and fixed a failing test! |
Does this really depend on #608? I think this would be a great feature to have with spot instances, which can go away due to price fluctuations, but prices usually don't fluctuate in all AZs at once, so there are usually healthy nodes on other AZs to quickly take over the load. |
kube-aws development has moved to its own top-level repository. If this is still something you would like to be merged, please re-open this PR under the new repository at: https://github.com/coreos/kube-aws |
Drain nodes before shutting them down to give running pods time to gracefully stop.
This change basically achieves it by running
docker run IMAGE kubectl drain THE_NODE --force
on the to-be-shut-down node before the kubelet gets SIGTERM in a CoreOS' shutdown process.Depends #608
Refs #340
Without this change, kubelets getting SIGTERM without a prior
drain
results in unfunctional(actually not schedulable) pods to have statusesReady
.With this change, when an ASG's desired cap decreased, a node's status changes over time as follows:
On desired cap change:
=> STATUS=Ready
On shut-down started:
=> STATUS=Ready,SchedulingDisabled (<- Pods are stopped and status is changed by
kubectl drain
)On shut-down finished:
=> Status=NotReady,SchedulingDisabled (<- It's
NotReady
but it won't result in a down-time because we already stopped both further scheduling and pods)After a minute:
=> The node disappears from the output of
kubectl get nodes
Note that:
sudo systemctl shutdown
for example) and automated shutdowns(triggered by AWS AutoScaling when nodes get rotated out of a group)mumoshu/kubectl
becausekubectl
included in the officialcoreos/hyperkube
image doesn't work due to the issue hyperkube + kubectl - unknown flag kubernetes/kubernetes#24088 in Kubernetes. Once the issue is fixed and the CoreOS team published the new hyperkube image with the updated Kubernetes, we can remove that dependency.