-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reserve compute resources kubelet flags #2982
Add reserve compute resources kubelet flags #2982
Conversation
Hi @itskingori. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR looks great! @itskingori did we want to do another PR to add defaults for this to kops? What is convenient for you? Once testing is done I can take another look, and probably give you a lgtm. Let me know if you wanted to add to this PR or add another. Thanks again /ok-to-test |
LGTM also. I didn't see any of these flags set by the kubernetes GCE installer, so I'm guessing we don't want to set them by default. But would love to hear about the use-case for this @itskingori |
Hmmm ... ignore my question about the use-case - I should have read the issue first. Now of course the question is why aren't these by GCE? :-) |
@justinsb @chrislovecnm some context ... Problem, what let me down this pathOn kops-1.5.1/k8s1.5.3 I always kept getting context GC exceeded errors. When this happens the node has to be replaced. Restarting Docker, restarting Kubelet doesn't help. There's no one who can explain why this happens nor does anyone have a fix. I gave up on this. In hindsight (and I have no proof of this, I think it might be related to what I'm about to explain). SO ... I just recently upgraded to kops 1.6.2/k8s1.6.2. I'm no longer getting the context GC exceeded errors but I'm getting another kind of problem. Kubernetes seems to handle problems better now though. Every now and then pods keep locked in a state of transition. By state of transition I mean that if a pod was
Unlike the previous issue ... there's a way to recover. Turns out that these pods that are stuck in this state way have dead containers (correlation is not causation but let's go on). When we have dead containers
If I clean them (dead containers) out with the below command ... they are able to proceed (and
I'd also like to point out that the cluster also suddenly terminates nodes are replaces them 🤷♂️ ... I can't find out any reasons why. While I'm glad that Kubernetes self-heals ... I'd just prefer if it doesn't replace a node when it doesn't have to. By the way, I tried to set up a cronjob that does this (clean dead containers) every minutes but I keep waking up to find out that a bunch of my minions replaced). Which is why I filed #2928 (comment). And so, all this leaves me with many questions:
Lot's of people are having this issue ... and it's unsolved. Initially I thought it was a CNI issue but @bboreham confirmed it wasn't #2928 (comment). And then he said:
And I went hmmm ... 🤔 🤔 🤔. I learnt that there's a difference between node capacity and allocatable capacity. And it looks something like this ...
I began to wonder ... everyone talks about setting resource requests and limits for pods ... which are then are bin-packed onto the node without little talk of the resources that the underlying system needs to run i.e. kernel, docker, sshd etc etc. Who makes sure the these non-pod processes have what they need? Turns out no one does. 😅 I checked the kops source ... and I saw we only set Ok, enough backstory ... what I think the problem is!The TL;DR. Starved non-pod resources ... probably critical system processes like Docker which need resources but don't have access to them. This is particularly difficult to debug because it depends on what is running on the cluster ... which people don't usually share. A few log lines aren't enough to provide that context. I think the system is getting OOM'd in my case (I have no way to prove this) and the only protection I have is a RAM<100MB eviction threshold. You know those status two status checks that AWS has on the EC2 console? Like this 👇 In a span of 3 days, I'd say 7 minions would fail that check ☝️ ... and I get the exclamation mark right next to the green tick. There are many ways to troubleshoot this but it all involves (1) getting only to box which I can't because I can't SSH in and (2) looking at logs which I don't have because the thing that ships my logs to an ELK probably crashed as well. 😰 For example ... I deployed about 7 apps to my cluster. Each app consists of about 4 types of pods. Because I didn't know the resource expectations of my cluster I had only set requests. To make it worse ... I set these requests are long time ago when some of the pods had low usage so now ... months later, several of the pods were sitting constantly above their resource requests i.e. I had set the requests to 1GB and it was using 1.6GB constantly ... and so on. I hope you can see the potential problem here. Kubernetes will schedule based on resource requests but then several of the pods are gonna use up way more than what the node can handle. Memory being the biggest concern because it's not compressible. And What I'm doing to try solve the problemThe first thing I've done is set higher requests and give breathing room so there's some free memory even within the resource request band. That is, just because an app sits at 1GB doesn't mean the requests/limits are set at 1GB/1.5GB ... I'm giving it more breathing room. Something like 1.5GB/2GB. And I've set limits while at it. This would the memory pressure problem but since everything can burst doesn't really guarantee the system of resources for non-pods if even half of my pods burst constantly above resource requests. The second thing I want to do (because it's a quick win over using these flags) is create a pod who's purpose is to reserve resources for non-pod processes - source). I'm probably use a DaemonSet for this to act as a placeholder pod on each node. ---
apiVersion: v1
kind: Pod
metadata:
name: resource-reserver
spec:
containers:
– name: sleep-forever
image: gcr.io/google_containers/pause:0.8.0
resources:
# requests would default to limits and therefore set it to Guaranteed QoS
limits:
cpu: 100m
memory: 1000Mi The third thing I was to do is set these flags so that the scheduler can deduct them from node capacity during scheduling. This feels more intrusive and less flexible if you need to change ... but we should consider setting these, IMHO. Answers to your questions
@chrislovecnm I didn't want to set default because GKE/GCE don't seem set these (I asked a friend if his node capacity is the same as allocatable). Also I'm not too sure what the proper values should be especially the cgroup stuff ... I got those values from the examples in the Kubernetes source.
@justinsb I guess I've given you more context above. About why they don't set these, I think it's because the requirements of these said system processes varies depending on load? Larger cluster probably need more ... and so they provide a benchmark tool to try help with coming up with what values to set. See:
I was hoping to leverage the kops team's experience here by creating a PR and spurring a discussion of what these values should be ... or what sensible defaults we could have. I do think we should have these set even if to small values but be cognisant that it will affect people with small clusters. |
Talked with @itskingori on slack, and we are going to merge this. @justinsb marking this as blocks next as this is a big issue that @itskingori and I assume others are hitting. |
/assign |
Related issue (though may only be disk-pressure caused) kubernetes/kubernetes#45896 (comment) Only fixed in 1.7 :( |
Automatic merge from submit-queue Add documentation on handling node resources At a minimum, this is meant to give more context on why the feature in #2982 was added and attempts to give some recommendations of what to consider when evaluating node system resources. I hope this spurs some discussion and that the recommendations I make maybe be assessed further. For example ... in one of the links I referenced, we're advised to set `system-reserved` **only if we know what we are doing** (which I can't say I do 💯% ... 🤷♂️) and we're even warned to only set it if you really need to.
### What I decided to PR back support to [reserve resources](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/) to kube-aws. If you want more background, the author of a related Kops PR does a fantastic job of documenting his journey/struggle with very similar issues to ours: kubernetes/kops#2982 I realise this is already been possible with $KUBELET_OPTS but believe the config options important enough to warrant being promoted to first class citizens to give them more visibility alongside the other kubelet config options. ### Why We’ve had problems with Docker daemons locking with loads of orphaned/dead containers lying around not being cleaned up - this is most definitely a docker daemon problem caused by our usage of old AMIs. During these investigations we noticed that some of our apps appear badly sized and doing some weird things on the hosts - things that are interfering with kube daemons and OS daemons, and could be mitigated by reserving resources. ### Notes #### Defaults I thought I’d check Kops to see if they have any idea about sensible defaults but it appears they just took the values in the documentation (source: the author of the kops PR) - as such, I’ve chosen to not set any default values, outside of just mentioning the kubernetes docs ones in the cluster.yaml comments. Please shout if you feel like kube-aws should enforce some minimal ones. #### cgroups & enforcement I haven’t included —kube-reserved-cgroup or —system-reserved-cgroup options since I’m unsure if they apply right now to the way kube-aws's systemd units are structure. I’ve also left out —enforce-node-allocatable as that relies on the cgroup options being set. Please shout If you’d like me to add them. ### Changelog * Add kube & system reserved resources to controller and worker configs and templated kubelet options * Add unit test for kube & system resources on kubelet options * Add kube & system reserved example configuration into root cluster.yaml template
Adds kubelet flags required to set values necessary to configure resources available for system daemons. Relevant reading:
Closes #1869.