Add reserve compute resources kubelet flags #2982

itskingori · 2017-07-17T21:12:24Z

Adds kubelet flags required to set values necessary to configure resources available for system daemons. Relevant reading:

Closes #1869.

k8s-ci-robot · 2017-07-17T21:12:31Z

Hi @itskingori. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

blakebarnett

LGTM

chrislovecnm · 2017-07-17T22:37:46Z

PR looks great!

@itskingori did we want to do another PR to add defaults for this to kops? What is convenient for you? Once testing is done I can take another look, and probably give you a lgtm. Let me know if you wanted to add to this PR or add another.

Thanks again

/ok-to-test

justinsb · 2017-07-18T03:43:39Z

LGTM also. I didn't see any of these flags set by the kubernetes GCE installer, so I'm guessing we don't want to set them by default. But would love to hear about the use-case for this @itskingori

justinsb · 2017-07-18T03:44:42Z

Hmmm ... ignore my question about the use-case - I should have read the issue first. Now of course the question is why aren't these by GCE? :-)

itskingori · 2017-07-18T07:07:52Z

@justinsb @chrislovecnm some context ...

Problem, what let me down this path

On kops-1.5.1/k8s1.5.3 I always kept getting context GC exceeded errors. When this happens the node has to be replaced. Restarting Docker, restarting Kubelet doesn't help. There's no one who can explain why this happens nor does anyone have a fix. I gave up on this. In hindsight (and I have no proof of this, I think it might be related to what I'm about to explain).

SO ... I just recently upgraded to kops 1.6.2/k8s1.6.2. I'm no longer getting the context GC exceeded errors but I'm getting another kind of problem. Kubernetes seems to handle problems better now though. Every now and then pods keep locked in a state of transition. By state of transition I mean that if a pod was Terminating, it stays stuck at that ... and if a node was ContainerCreating it stays stuck at that. It's almost as if Kubernetes isn't able to update the state of the pod because Docker didn't promise to do what it promised to do.

$ kubectl get pods -o wide --namespace=$ENVIRONMENT --no-headers
grafana-3323403255-kzg2w                      1/2       CrashLoopBackOff    8         1h
grafana-mysql-0                               0/1       ContainerCreating   0         3m

Unlike the previous issue ... there's a way to recover. Turns out that these pods that are stuck in this state way have dead containers (correlation is not causation but let's go on). When we have dead containers /var/log/syslog is full of these:

Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: W0708 09:22:04.991325   13520 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "grafana-2597417898-prpwf_sandbox": Cannot find the network namespace, skipping pod network status for container {"docker" "4a2b4b51b0d64a64e8ffaaac53110c1b4ba019f37b755f09e67acb069d3e865f"}
Jul  8 09:22:04 ip-10-83-59-150 dockerd[1358]: time="2017-07-08T09:22:04.992595490Z" level=error msg="Handler for GET /v1.24/containers/331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f/json returned error: open /var/lib/docker/overlay/69353e3da8b9d11a16ee77343d8aa0208ca33bb5d66b300bfb9ea9e997e6d1ea/lower-id: no such file or directory"
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992798   13520 remote_runtime.go:273] ContainerStatus "331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f" from runtime service failed: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992827   13520 kuberuntime_container.go:385] ContainerStatus for 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f error: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992840   13520 kuberuntime_manager.go:858] getPodContainerStatuses for pod "grafana-2597417898-prpwf_sandbox(fe4927e6-6345-11e7-8ca3-0e5eca502a9e)" failed: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992858   13520 generic.go:239] PLEG: Ignoring events for pod grafana-2597417898-prpwf/sandbox: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f

If I clean them (dead containers) out with the below command ... they are able to proceed (and /var/log/syslog is not devoid of the aforementioned errors going forward).

$ docker rm $(docker ps -a | grep "Dead" | awk '{print $1}')

I'd also like to point out that the cluster also suddenly terminates nodes are replaces them 🤷‍♂️ ... I can't find out any reasons why. While I'm glad that Kubernetes self-heals ... I'd just prefer if it doesn't replace a node when it doesn't have to. By the way, I tried to set up a cronjob that does this (clean dead containers) every minutes but I keep waking up to find out that a bunch of my minions replaced). Which is why I filed #2928 (comment).

And so, all this leaves me with many questions:

Why do we have a dead container in the first place?
What is a dead container anyway? No one seems to know.
Why wasn't kubelet able to clean up the dead pod (assuming it was supposed to)?
Why is it that when I manually remove the dead pod ... any pod that was stuck in transition continues?

Lot's of people are having this issue ... and it's unsolved. Initially I thought it was a CNI issue but @bboreham confirmed it wasn't #2928 (comment). And then he said:

Looks to me like your Docker is very unhappy, but I have no idea why.

And I went hmmm ... 🤔 🤔 🤔. I learnt that there's a difference between node capacity and allocatable capacity. And it looks something like this ...

      Node Capacity
---------------------------
|     kube-reserved       |
|-------------------------|
|     system-reserved     |
|-------------------------|
|    eviction-threshold   |
|-------------------------|
|                         |
|      allocatable        |
|   (available for pods)  |
|                         |
|                         |
---------------------------

I began to wonder ... everyone talks about setting resource requests and limits for pods ... which are then are bin-packed onto the node without little talk of the resources that the underlying system needs to run i.e. kernel, docker, sshd etc etc. Who makes sure the these non-pod processes have what they need?

Turns out no one does. 😅

I checked the kops source ... and I saw we only set eviction-thresholds. I figured that the best reference to what these values should be GKE ... which to my surprise (just like Justin) ... doesn't set them.

Ok, enough backstory ... what I think the problem is!

The TL;DR. Starved non-pod resources ... probably critical system processes like Docker which need resources but don't have access to them. This is particularly difficult to debug because it depends on what is running on the cluster ... which people don't usually share. A few log lines aren't enough to provide that context.

I think the system is getting OOM'd in my case (I have no way to prove this) and the only protection I have is a RAM<100MB eviction threshold.

You know those status two status checks that AWS has on the EC2 console? Like this 👇

In a span of 3 days, I'd say 7 minions would fail that check ☝️ ... and I get the exclamation mark right next to the green tick. There are many ways to troubleshoot this but it all involves (1) getting only to box which I can't because I can't SSH in and (2) looking at logs which I don't have because the thing that ships my logs to an ELK probably crashed as well. 😰

For example ... I deployed about 7 apps to my cluster. Each app consists of about 4 types of pods. Because I didn't know the resource expectations of my cluster I had only set requests. To make it worse ... I set these requests are long time ago when some of the pods had low usage so now ... months later, several of the pods were sitting constantly above their resource requests i.e. I had set the requests to 1GB and it was using 1.6GB constantly ... and so on.

I hope you can see the potential problem here. Kubernetes will schedule based on resource requests but then several of the pods are gonna use up way more than what the node can handle. Memory being the biggest concern because it's not compressible. And free -m doesn't look pretty (very little free memory).

What I'm doing to try solve the problem

The first thing I've done is set higher requests and give breathing room so there's some free memory even within the resource request band. That is, just because an app sits at 1GB doesn't mean the requests/limits are set at 1GB/1.5GB ... I'm giving it more breathing room. Something like 1.5GB/2GB. And I've set limits while at it. This would the memory pressure problem but since everything can burst doesn't really guarantee the system of resources for non-pods if even half of my pods burst constantly above resource requests.

The second thing I want to do (because it's a quick win over using these flags) is create a pod who's purpose is to reserve resources for non-pod processes - source). I'm probably use a DaemonSet for this to act as a placeholder pod on each node.

---
apiVersion: v1
kind: Pod
metadata:
  name: resource-reserver
spec:
  containers:
    – name: sleep-forever
      image: gcr.io/google_containers/pause:0.8.0
      resources:
        # requests would default to limits and therefore set it to Guaranteed QoS
        limits:
          cpu: 100m
          memory: 1000Mi

The third thing I was to do is set these flags so that the scheduler can deduct them from node capacity during scheduling. This feels more intrusive and less flexible if you need to change ... but we should consider setting these, IMHO.

Answers to your questions

did we want to do another PR to add defaults for this to kops? What is convenient for you? Once testing is done I can take another look, and probably give you a lgtm. Let me know if you wanted to add to this PR or add another.

@chrislovecnm I didn't want to set default because GKE/GCE don't seem set these (I asked a friend if his node capacity is the same as allocatable). Also I'm not too sure what the proper values should be especially the cgroup stuff ... I got those values from the examples in the Kubernetes source.

I didn't see any of these flags set by the kubernetes GCE installer, so I'm guessing we don't want to set them by default. But would love to hear about the use-case for this ...

@justinsb I guess I've given you more context above. About why they don't set these, I think it's because the requirements of these said system processes varies depending on load? Larger cluster probably need more ... and so they provide a benchmark tool to try help with coming up with what values to set. See:

I was hoping to leverage the kops team's experience here by creating a PR and spurring a discussion of what these values should be ... or what sensible defaults we could have. I do think we should have these set even if to small values but be cognisant that it will affect people with small clusters.

chrislovecnm · 2017-07-18T16:09:53Z

Talked with @itskingori on slack, and we are going to merge this.

@justinsb marking this as blocks next as this is a big issue that @itskingori and I assume others are hitting.

chrislovecnm · 2017-07-18T16:10:06Z

/assign
/lgtm

blakebarnett · 2017-07-18T16:27:33Z

Related issue (though may only be disk-pressure caused) kubernetes/kubernetes#45896 (comment)

Only fixed in 1.7 :(

Automatic merge from submit-queue Add documentation on handling node resources At a minimum, this is meant to give more context on why the feature in #2982 was added and attempts to give some recommendations of what to consider when evaluating node system resources. I hope this spurs some discussion and that the recommendations I make maybe be assessed further. For example ... in one of the links I referenced, we're advised to set `system-reserved` **only if we know what we are doing** (which I can't say I do 💯% ... 🤷‍♂️) and we're even warned to only set it if you really need to.

### What I decided to PR back support to [reserve resources](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/) to kube-aws. If you want more background, the author of a related Kops PR does a fantastic job of documenting his journey/struggle with very similar issues to ours: kubernetes/kops#2982 I realise this is already been possible with $KUBELET_OPTS but believe the config options important enough to warrant being promoted to first class citizens to give them more visibility alongside the other kubelet config options. ### Why We’ve had problems with Docker daemons locking with loads of orphaned/dead containers lying around not being cleaned up - this is most definitely a docker daemon problem caused by our usage of old AMIs. During these investigations we noticed that some of our apps appear badly sized and doing some weird things on the hosts - things that are interfering with kube daemons and OS daemons, and could be mitigated by reserving resources. ### Notes #### Defaults I thought I’d check Kops to see if they have any idea about sensible defaults but it appears they just took the values in the documentation (source: the author of the kops PR) - as such, I’ve chosen to not set any default values, outside of just mentioning the kubernetes docs ones in the cluster.yaml comments. Please shout if you feel like kube-aws should enforce some minimal ones. #### cgroups & enforcement I haven’t included —kube-reserved-cgroup or —system-reserved-cgroup options since I’m unsure if they apply right now to the way kube-aws's systemd units are structure. I’ve also left out —enforce-node-allocatable as that relies on the cgroup options being set. Please shout If you’d like me to add them. ### Changelog * Add kube & system reserved resources to controller and worker configs and templated kubelet options * Add unit test for kube & system resources on kubelet options * Add kube & system reserved example configuration into root cluster.yaml template

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 17, 2017

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 17, 2017

itskingori added 3 commits July 17, 2017 23:36

Add node resource allocation flags for kubelet

5946450

Add generated code

05834ee

Update cluster spec documentation with node reservation flags

38c4d2a

itskingori mentioned this pull request Jul 17, 2017

Reserve kube/system resources (Node Allocatable Resources) #1869

Closed

blakebarnett approved these changes Jul 17, 2017

View reviewed changes

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 17, 2017

justinsb approved these changes Jul 18, 2017

View reviewed changes

chrislovecnm added the blocks-next label Jul 18, 2017

chrislovecnm merged commit e75e069 into kubernetes:master Jul 18, 2017

k8s-ci-robot assigned chrislovecnm Jul 18, 2017

itskingori deleted the node_allocatable_resources branch July 18, 2017 16:19

This was referenced Jul 18, 2017

Unstable Kubernetes v1.6.6 cluster created with Kops 1.6.2 #2928

Closed

Add documentation on handling node resources #2992

Merged

rushtehrani mentioned this pull request Jul 24, 2017

Container pods stuck ins tate unknown kubernetes/kubernetes#43279

Closed

pidah mentioned this pull request Nov 28, 2017

add configurable hard eviction threshold option Azure/acs-engine#1843

Merged

tyrannasaurusbanks mentioned this pull request Jun 7, 2018

Promote compute reservation options [Small] kubernetes-retired/kube-aws#1356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reserve compute resources kubelet flags #2982

Add reserve compute resources kubelet flags #2982

itskingori commented Jul 17, 2017

k8s-ci-robot commented Jul 17, 2017

blakebarnett left a comment

chrislovecnm commented Jul 17, 2017

justinsb commented Jul 18, 2017

justinsb commented Jul 18, 2017

itskingori commented Jul 18, 2017

chrislovecnm commented Jul 18, 2017

chrislovecnm commented Jul 18, 2017

blakebarnett commented Jul 18, 2017

Add reserve compute resources kubelet flags #2982

Add reserve compute resources kubelet flags #2982

Conversation

itskingori commented Jul 17, 2017

k8s-ci-robot commented Jul 17, 2017

blakebarnett left a comment

Choose a reason for hiding this comment

chrislovecnm commented Jul 17, 2017

justinsb commented Jul 18, 2017

justinsb commented Jul 18, 2017

itskingori commented Jul 18, 2017

Problem, what let me down this path

Ok, enough backstory ... what I think the problem is!

What I'm doing to try solve the problem

Answers to your questions

chrislovecnm commented Jul 18, 2017

chrislovecnm commented Jul 18, 2017

blakebarnett commented Jul 18, 2017