-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k3s causes a high load average #294
Comments
@drym3r is this a fresh install? Do you any thing deployed in kubernetes? Do you you see what processes are taking CPU? |
I'm seeing the same behavior on my laptop (vanilla Ubuntu 18.04 - i5-3427u CPU). |
@ibuildthecloud Yes, fresh install. I have things installed right now, but I saw this behaviour when I didn't. There's nothing taking CPU much, similarly to @seabrookmx, I only see a high load average. The load average is also afected by IO from disks, but I also doesn't have detect any activity on that. |
Seeing the same kind of k3s-server CPU usage (continuous 25-30%) on centos 7.6, vanilla k3s install, otherwise idle server. Only resources deployed is kubevirt 0.17, but nothing actually deployed for kubevirt to manage. Not sure if it's related but I am also seeing #463 at the same time. |
My cpu load of a small single node cluster is also quite high:
k3s manages to produce more load than the elasticsearch process and other resource intensive workloads. |
v0.6.0-rc5 also causes a high cpu load... |
I'm seeing about 20-25% cpu usage of a single Skylake Xeon core constantly. Is this expected behaviour? With https://microk8s.io it's only a couple percent. |
v0.7.0-rc5 also causes a high cpu load... @ibuildthecloud are you guys going to tackle this issue anytime soon? |
I'm using the last stable version and still see a too high load average, but it 's a little bit better now. I'm between 0.82, 0.93, 0.86 and 1.21, 0.93, 1.30. Even when under 1 it still has too little traffic that justifies it, but at least is not blocking any more. |
What kind of load avg, or idle CPU usage, would be considered normal? |
I'm getting a similar higher load average on my Raspberry Pi cluster. |
Without a way to reproduce is hard to fix, any info that can be given about OS version, hardware, and K3S version is greatly appreciated. I am not too worried about 10-30% usage but 100%+ is not good. If there is anything helpful in the logs sharing that would be greatly appreciated. It would be good to obtain some profiling data from k3s, we may need to make a special compilation to help with that: https://github.com/golang/go/wiki/Performance |
Some details from me:
Excerpt from syslog on master:
The "orphaned pod" messages refer to kubernetes/kubernetes#60987 . Removing the orphans doesn't seem to have any effect on the load. The "EOF" messages continue to be logged. Aside from these two I can see no other log messages from k3s during normal (idle) operation. |
The EOF messages should not be causing an issue. It is fixed in later versions of k3s, the fact that versions like v0.7.0-rc5 still has this issue is concerning. Does the CPU spike right away or does it take time to build up? What are the ulimits of the k3s server process? What is the iostat of the k3s server process? What services/things are installed that might be using the k8s API? And does disabling that thing reduce CPU consumption? |
Regarding CPU & I/O, here's a grafana screenshot of the two servers: https://drive.google.com/open?id=17CJi95dO98z9bW6oMJtiqiym1yEuIQ5i "beryllium" ist the k3s server host, "protactinium" is the agent with an additional kafka broker installed. The curves are pretty representative for the behavior all day, including shortly after a restart. ulimit of "k3s server":
|
Does this have anything to do with kubernetes/kubernetes#64137? That has to do with orphaned systemd cgroup watches causing kubelet CPU usage to gradually ramp up until a node dies. Several threads attribute this to a bad interaction between certain kernel versions and certain systemd versions. It's particularly exacerbated by cron jobs since they can have such a short lifecycle. Anyhow, if this is the same thing, then I posted a DaemonSet that you can use as a workaround on that ticket until someone gets around to fixing the kernel or systemd or whatever. |
I was also able to replicate this pretty trivially on Ubuntu 18.04. This is with a cluster with nothing deployed to it and yet the k3s server consumed 20-30% cpu. I also evaluated microk8s and standard kubernetes deployed with kubeadm. Both only consumed about ~3% cpu at idle with nothing deployed on the same machine. These results were consistent across attempts and after I applyed deployments to the cluster. The load was immediate right after bring up. No logs emitted from the server suggest something going wrong. This a show stopper for me moving forward with k3s unfortunately. |
The latest version 0.8.1 also suffers from this problem. I wonder how someone would run k3s on a low-power edge device when k3s itself consumes so many CPU cycles. |
If this is trivial to replicate it would be good to have steps to reproduce. Ubuntu 18.04 seems to be the common theme, so I tried creating a VM:
However, after launching k3s with |
I'm seeing this on k3s 1.0.1. I've been using rke before and haven't had this problem on the same host. In my case I'm using NixOS - https://nixos.org. |
I've investigated about the load average and how it works. And it does in fact make sense that in a raspberry pi 3, that has 4 cores, the load avearge is of from 1 to 4. I do still think that a load average of 2.69 for a master with little usage is too much, but since it's inside of what is supported, I'll close this issue. If somebody doesn't agree, feel free to open another one or reopen this one. PD: In the same machine, an agent without containers does have like 0.3 of load average, FYI. |
Same behavior with my tests.
k3s consumes always about 20% CPU in top after initialization. 18% away from using it on an Edge device. 20% for a management tool when idling is a joke. A full installed Apache, PHP, MariaDB, Gitea, Postfix, Dovecot all in Docker Containers jumps around 3-5% when idling. On a slow Raspberry 3. Easy to reproduce. |
Have similar issue here: idle cluster with 5-15 load average on master.
Apps: istio + single nginx serving static that has no requests at all. Versions:
No pods are restarting or scaling so this is out of the picture. I suspect it may be related to health checks but it seems really strange that 19 pods can produce so much load (especially considering that all health checks should be in async IO and once per 2 seconds tops). The most strange thing that the more I observe (sitting in SSH) the less load average becomes... Just after installation of the whole cluster load avg was about 4-5, then after good sleep (about 7 hours) I wanted to check that all is OK and load avg was 13-15, after some investigation (collecting logs, inspecting all versions, etc.) load avg dropped to 5 and now is raising again. I don't understand neither why nor how it is possible. All this time (regarding of load avg) k3s daemon process is jumping from 0-5% to X% (where x is ~ load avg * 100) for about a second and then back to 0-5%
k3s journalctl logs (last 100 lines):
|
Do you have many workloads which use an |
No there are only http probes |
Although it might be that k3s executes http probes similary to how reference implementation executes |
Not as far as I can see. Still takes up a lot of CPU even on low-end Intel VMs |
I know this is not the place for this but I'll go ahead and ask anyway
|
All of these components are useful even with a single node setup and that's why there are enabled by default. But here is a small description of these components in case you want to disable some:
About scheduler, cloud controller and network policy. Those are key components, no really need of removing them. |
Im still seeing this on my vm. Did anyone solve this or find a workaround? |
The issue is not resolved. |
/usr/share/bcc/tools/filetop (from bcc-tools) + iotop are saying this is due to etcd being really write intensive. REALLY write intensive. 2-4MB/s of constant writes when kubernetes is completely idle (default installation of 3 masters and 2 workers). This is SSD killer (and when SLC buffer fills on your cheap QLC/TLC drive, SSD will grind to a halt) and completely unacceptable in a homelab environment. I also wonder about how this affects performance, if db of etcd is on BTRFS. I guess I will try Mysql next ;) |
Yes, this is why k3s ships with sqlite by default, or external DB as the default HA option. Etcd is incredibly write intensive. |
This issue still seems trivial to reproduce across any hardware. Despite being closed for over 2 years it still has tons of activity and consistent reports. I gave up on k3s long ago because of it. Suggestions:
|
I am facing the same issue. Raspberry Pi 4 with Ubuntu 21.04 (AArch64), k3s v1.20.7+k3s1, single-node installation CPU is constantly around 20-30%, so one full core used permanently by I made a Flamegrapf according to the documentation of k3s-server and have attached it. I have no idea what I'm looking at so maybe someone else can see where the problem is :-D Basically I did this:
|
20% on a Pi4b is about right, see: resource-profiling |
Here's darkstar's flamegraph for anyone driving by @darkstar From that flamegraph we can see the entire time is spent inside containerd but we don't have symbols to get any introspection on the stack there. We can see about half of the time is spent purely within containerd and a little bit less than the other half ends up in a system call el0_sync from containerd. To get a more full picture make sure you're compiling k3s-server with debug symbols so we can get more detail on the stacktrace for both k3s and containerd. I really don't know this project that well but a quick look shows this build script omitting debug symbols with @brandond Those numbers seem unreasonably high. I think I remember plain k8s deployed with kubeadm to a single node taking ~3-5% on a consumer Intel Skylake 3.1Ghz. There is likely some optimization that can be done here. |
"20% of a core" is different to "25% overall" (which is basically "100% of a core" as the RPi4 has 4 cores), so a factor 5 too high. @AdamDorwart I'll try and see what I can do about the symbols, thanks for the quick analysis |
Does anyone have a debug binary ready that they could share? Or a document explaining how to cross-build for aarch64 on x86_64? |
Here the problem was connected to k3s and iptables, when dumping |
@plu that sounds like #3117 (comment) |
Fresh install on latest Raspbian, jumps between 20-60% CPU usage, did the dump with |
Same issue here. On my AWS instance which type is avg-cpu: %user %nice %system %iowait %steal %idle
11.53 0.16 3.80 2.23 81.90 0.38
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
loop0 3.25 138.15 0.00 0.00 305637 0 0
loop1 0.03 0.18 0.00 0.00 400 0 0
loop2 0.11 1.05 0.00 0.00 2325 0 0
loop3 0.03 0.49 0.00 0.00 1088 0 0
loop4 1.54 57.79 0.00 0.00 127851 0 0
loop5 0.03 0.17 0.00 0.00 377 0 0
loop6 0.12 1.08 0.00 0.00 2398 0 0
loop7 0.03 0.51 0.00 0.00 1119 0 0
xvda 138.26 3681.40 318.29 0.00 8144350 704156 0 --- update --- |
Same issue here on a 6 node cluster (3 servers, 3 agent). |
Which iptables version are you running? If it's anything below 1.8.6 you can just get rid of it since it has some nasty bug. Simply uninstall the iptables package, as k3s brings its own version that works fine |
Whatever version is in Ubuntu 20.04 (Focal Fossa) It seems that the problem may actually be etcd. I don't see why it has to write as much as it does. If the cluster is idle, then it should see that there is no change and not try to actually write anything. Or at least keep the writes in a cache that it flushes less frequently. I've read conflicting suggestions for etcd. One says to make sure it's writing to SSDs because spinning rust drives are too slow (again, it shouldn't care how long it takes to write. Ever heard of async IO?). The other advice is that it writes way too much and will kill SSDs. |
That's another problem, but unrelated. I'm using k3s with external PostgreSQL (no etcd anywhere), and idle k3s process regularly used ~30% CPU. |
I'm going to post here even if the issue is closed given the fact that many other users are experiencing the same problem. High cpu usage on master node (50% cpu avg on all cores) on 2 Nodes cluster Profiling Results This may be specific to my workload so it would be nice to gather some profiles from other clusters/workloads to narrow down the issue.
The most cpu busy calls are related to sqllite and some file read sys call. |
Same issue here, it happened suddenly and persists after a few restart. k3s version v1.24.4+k3s1 (c3f830e) |
I also notice this CPU consumption, between 14% and 50% with almost no traffic. I switched the apps of a custom-made NAS running on an Odroid, from docker-compose to k3s. Plex was able to software-transcode some simple videos before. With k3s, that's not possible. I assume it is because that ~20% of CPU k3s uses is really needed for this very humble hardware. It was an experiment anyway. I'll have to come back to docker-compose. |
has this been resolved? Im on a fresh install of 23.1.10.3 and am showing some serious CPU load issues (around 80% on a 12 core machine) |
@Menethoran this is a year old issue; please open a new issue and fill out the issue template. It would also be helpful to make clear if you are talking about 80% of a core, or 80% of all available CPU resources? What process specifically is consuming the CPU time? How many nodes are in your cluster, and what sort of workload are they running? |
Describe the bug
I'm not sure if it's a bug, but I think it's not an expected behaviour. When running k3s on any computer, it causes a very high load average. To have a concrete example, I'll explain the situation of my raspberry pi3 node.
When running k3s, I have a load average usage of:
Without running it, but still having the containers up, I have a load average of:
To Reproduce
I just run it without any special arguments, just how is installed by the
sh
installer.Expected behavior
The load average should be under 1.
The text was updated successfully, but these errors were encountered: