Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reserve compute resources kubelet flags #2982

Merged
merged 3 commits into from
Jul 18, 2017
Merged

Add reserve compute resources kubelet flags #2982

merged 3 commits into from
Jul 18, 2017

Conversation

itskingori
Copy link
Member

Adds kubelet flags required to set values necessary to configure resources available for system daemons. Relevant reading:

Closes #1869.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 17, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @itskingori. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 17, 2017
Copy link

@blakebarnett blakebarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chrislovecnm
Copy link
Contributor

PR looks great!

@itskingori did we want to do another PR to add defaults for this to kops? What is convenient for you? Once testing is done I can take another look, and probably give you a lgtm. Let me know if you wanted to add to this PR or add another.

Thanks again

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 17, 2017
@justinsb
Copy link
Member

LGTM also. I didn't see any of these flags set by the kubernetes GCE installer, so I'm guessing we don't want to set them by default. But would love to hear about the use-case for this @itskingori

@justinsb
Copy link
Member

Hmmm ... ignore my question about the use-case - I should have read the issue first. Now of course the question is why aren't these by GCE? :-)

@itskingori
Copy link
Member Author

@justinsb @chrislovecnm some context ...


Problem, what let me down this path

On kops-1.5.1/k8s1.5.3 I always kept getting context GC exceeded errors. When this happens the node has to be replaced. Restarting Docker, restarting Kubelet doesn't help. There's no one who can explain why this happens nor does anyone have a fix. I gave up on this. In hindsight (and I have no proof of this, I think it might be related to what I'm about to explain).

SO ... I just recently upgraded to kops 1.6.2/k8s1.6.2. I'm no longer getting the context GC exceeded errors but I'm getting another kind of problem. Kubernetes seems to handle problems better now though. Every now and then pods keep locked in a state of transition. By state of transition I mean that if a pod was Terminating, it stays stuck at that ... and if a node was ContainerCreating it stays stuck at that. It's almost as if Kubernetes isn't able to update the state of the pod because Docker didn't promise to do what it promised to do.

$ kubectl get pods -o wide --namespace=$ENVIRONMENT --no-headers
grafana-3323403255-kzg2w                      1/2       CrashLoopBackOff    8         1h
grafana-mysql-0                               0/1       ContainerCreating   0         3m

Unlike the previous issue ... there's a way to recover. Turns out that these pods that are stuck in this state way have dead containers (correlation is not causation but let's go on). When we have dead containers /var/log/syslog is full of these:

Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: W0708 09:22:04.991325   13520 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "grafana-2597417898-prpwf_sandbox": Cannot find the network namespace, skipping pod network status for container {"docker" "4a2b4b51b0d64a64e8ffaaac53110c1b4ba019f37b755f09e67acb069d3e865f"}
Jul  8 09:22:04 ip-10-83-59-150 dockerd[1358]: time="2017-07-08T09:22:04.992595490Z" level=error msg="Handler for GET /v1.24/containers/331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f/json returned error: open /var/lib/docker/overlay/69353e3da8b9d11a16ee77343d8aa0208ca33bb5d66b300bfb9ea9e997e6d1ea/lower-id: no such file or directory"
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992798   13520 remote_runtime.go:273] ContainerStatus "331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f" from runtime service failed: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992827   13520 kuberuntime_container.go:385] ContainerStatus for 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f error: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992840   13520 kuberuntime_manager.go:858] getPodContainerStatuses for pod "grafana-2597417898-prpwf_sandbox(fe4927e6-6345-11e7-8ca3-0e5eca502a9e)" failed: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992858   13520 generic.go:239] PLEG: Ignoring events for pod grafana-2597417898-prpwf/sandbox: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f

If I clean them (dead containers) out with the below command ... they are able to proceed (and /var/log/syslog is not devoid of the aforementioned errors going forward).

$ docker rm $(docker ps -a | grep "Dead" | awk '{print $1}')

I'd also like to point out that the cluster also suddenly terminates nodes are replaces them 🤷‍♂️ ... I can't find out any reasons why. While I'm glad that Kubernetes self-heals ... I'd just prefer if it doesn't replace a node when it doesn't have to. By the way, I tried to set up a cronjob that does this (clean dead containers) every minutes but I keep waking up to find out that a bunch of my minions replaced). Which is why I filed #2928 (comment).

And so, all this leaves me with many questions:

  1. Why do we have a dead container in the first place?
  2. What is a dead container anyway? No one seems to know.
  3. Why wasn't kubelet able to clean up the dead pod (assuming it was supposed to)?
  4. Why is it that when I manually remove the dead pod ... any pod that was stuck in transition continues?

Lot's of people are having this issue ... and it's unsolved. Initially I thought it was a CNI issue but @bboreham confirmed it wasn't #2928 (comment). And then he said:

Looks to me like your Docker is very unhappy, but I have no idea why.

And I went hmmm ... 🤔 🤔 🤔. I learnt that there's a difference between node capacity and allocatable capacity. And it looks something like this ...

      Node Capacity
---------------------------
|     kube-reserved       |
|-------------------------|
|     system-reserved     |
|-------------------------|
|    eviction-threshold   |
|-------------------------|
|                         |
|      allocatable        |
|   (available for pods)  |
|                         |
|                         |
---------------------------

I began to wonder ... everyone talks about setting resource requests and limits for pods ... which are then are bin-packed onto the node without little talk of the resources that the underlying system needs to run i.e. kernel, docker, sshd etc etc. Who makes sure the these non-pod processes have what they need?

Turns out no one does. 😅

I checked the kops source ... and I saw we only set eviction-thresholds. I figured that the best reference to what these values should be GKE ... which to my surprise (just like Justin) ... doesn't set them.

Ok, enough backstory ... what I think the problem is!

The TL;DR. Starved non-pod resources ... probably critical system processes like Docker which need resources but don't have access to them. This is particularly difficult to debug because it depends on what is running on the cluster ... which people don't usually share. A few log lines aren't enough to provide that context.

I think the system is getting OOM'd in my case (I have no way to prove this) and the only protection I have is a RAM<100MB eviction threshold.

You know those status two status checks that AWS has on the EC2 console? Like this 👇

status-check-tab

In a span of 3 days, I'd say 7 minions would fail that check ☝️ ... and I get the exclamation mark right next to the green tick. There are many ways to troubleshoot this but it all involves (1) getting only to box which I can't because I can't SSH in and (2) looking at logs which I don't have because the thing that ships my logs to an ELK probably crashed as well. 😰

For example ... I deployed about 7 apps to my cluster. Each app consists of about 4 types of pods. Because I didn't know the resource expectations of my cluster I had only set requests. To make it worse ... I set these requests are long time ago when some of the pods had low usage so now ... months later, several of the pods were sitting constantly above their resource requests i.e. I had set the requests to 1GB and it was using 1.6GB constantly ... and so on.

I hope you can see the potential problem here. Kubernetes will schedule based on resource requests but then several of the pods are gonna use up way more than what the node can handle. Memory being the biggest concern because it's not compressible. And free -m doesn't look pretty (very little free memory).

What I'm doing to try solve the problem

The first thing I've done is set higher requests and give breathing room so there's some free memory even within the resource request band. That is, just because an app sits at 1GB doesn't mean the requests/limits are set at 1GB/1.5GB ... I'm giving it more breathing room. Something like 1.5GB/2GB. And I've set limits while at it. This would the memory pressure problem but since everything can burst doesn't really guarantee the system of resources for non-pods if even half of my pods burst constantly above resource requests.

The second thing I want to do (because it's a quick win over using these flags) is create a pod who's purpose is to reserve resources for non-pod processes - source). I'm probably use a DaemonSet for this to act as a placeholder pod on each node.

---
apiVersion: v1
kind: Pod
metadata:
  name: resource-reserver
spec:
  containers:
    – name: sleep-forever
      image: gcr.io/google_containers/pause:0.8.0
      resources:
        # requests would default to limits and therefore set it to Guaranteed QoS
        limits:
          cpu: 100m
          memory: 1000Mi

The third thing I was to do is set these flags so that the scheduler can deduct them from node capacity during scheduling. This feels more intrusive and less flexible if you need to change ... but we should consider setting these, IMHO.

Answers to your questions

did we want to do another PR to add defaults for this to kops? What is convenient for you? Once testing is done I can take another look, and probably give you a lgtm. Let me know if you wanted to add to this PR or add another.

@chrislovecnm I didn't want to set default because GKE/GCE don't seem set these (I asked a friend if his node capacity is the same as allocatable). Also I'm not too sure what the proper values should be especially the cgroup stuff ... I got those values from the examples in the Kubernetes source.

I didn't see any of these flags set by the kubernetes GCE installer, so I'm guessing we don't want to set them by default. But would love to hear about the use-case for this ...

@justinsb I guess I've given you more context above. About why they don't set these, I think it's because the requirements of these said system processes varies depending on load? Larger cluster probably need more ... and so they provide a benchmark tool to try help with coming up with what values to set. See:

I was hoping to leverage the kops team's experience here by creating a PR and spurring a discussion of what these values should be ... or what sensible defaults we could have. I do think we should have these set even if to small values but be cognisant that it will affect people with small clusters.

@chrislovecnm
Copy link
Contributor

Talked with @itskingori on slack, and we are going to merge this.

@justinsb marking this as blocks next as this is a big issue that @itskingori and I assume others are hitting.

@chrislovecnm chrislovecnm merged commit e75e069 into kubernetes:master Jul 18, 2017
@chrislovecnm
Copy link
Contributor

/assign
/lgtm

@itskingori itskingori deleted the node_allocatable_resources branch July 18, 2017 16:19
@blakebarnett
Copy link

Related issue (though may only be disk-pressure caused) kubernetes/kubernetes#45896 (comment)

Only fixed in 1.7 :(

k8s-github-robot pushed a commit that referenced this pull request Aug 14, 2017
Automatic merge from submit-queue

Add documentation on handling node resources

At a minimum, this is meant to give more context on why the feature in #2982 was added and attempts to give some recommendations of what to consider when evaluating node system resources.

I hope this spurs some discussion and that the recommendations I make maybe be assessed further. For example ... in one of the links I referenced, we're advised to set `system-reserved` **only if we know what we are doing** (which I can't say I do 💯% ... 🤷‍♂️) and we're even warned to only set it if you really need to.
mumoshu pushed a commit to kubernetes-retired/kube-aws that referenced this pull request Jun 11, 2018
### What

I decided to PR back support to [reserve resources](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/) to kube-aws.
If you want more background, the author of a related Kops PR does a fantastic job of documenting his journey/struggle with very similar issues to ours:
kubernetes/kops#2982

I realise this is already been possible with $KUBELET_OPTS but believe the config options important enough to warrant being promoted to first class citizens to give them more visibility alongside the other kubelet config options.

### Why

We’ve had problems with Docker daemons locking with loads of orphaned/dead containers lying around not being cleaned up - this is most definitely a docker daemon problem caused by our usage of old AMIs. 
During these investigations we noticed that some of our apps appear badly sized and doing some weird things on the hosts - things that are interfering with kube daemons and OS daemons, and could be mitigated by reserving resources.

### Notes

#### Defaults

I thought I’d check Kops to see if they have any idea about sensible defaults but it appears they just took the values in the documentation (source: the author of the kops PR) - as such, I’ve chosen to
not set any default values, outside of just mentioning the kubernetes docs ones in the cluster.yaml comments.
Please shout if you feel like kube-aws should enforce some minimal ones.

#### cgroups & enforcement

I haven’t included —kube-reserved-cgroup or —system-reserved-cgroup options since I’m unsure if they apply right now to the way kube-aws's systemd units are structure. I’ve also left out —enforce-node-allocatable as that relies on the cgroup options being set.
Please shout If you’d like me to add them. 

### Changelog

* Add kube & system reserved resources to controller and worker configs and templated kubelet options

* Add unit test for kube & system resources on kubelet options

* Add kube & system reserved example configuration into root cluster.yaml template
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocks-next cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants