-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rook/Ceph on k3s on Fedora CoreOS is extremely slow #329
Comments
Thanks for the report. I would suggest checking the logs to see if you are hitting any timeouts. |
Additionally, is this using Ceph via the in-kernel driver or is it this purely containerized userspace? |
Rook does claim it requires a modern kernel with the I anyway added both modules to How do you recommend to debug I/O on FCOS? right now it doesn't have any tools like |
I found |
The issues seen here are unlikely related to ceph, as this is the preparation procedure before a new ceph component is initialized. The log above is from a tool called ceph-volume, which is a python script that sets up LVM volumes for the OSD (a ceph daemon) to use. |
I have eliminated K3S, Rook, and Ceph as the cause of the issue. I am able to use the same ignition config with RedHat CoreOS (with trivial adjustments for the older ignition) and Ceph can start a cluster quickly.
versus:
Same log snippet as above, it takes ~16 seconds on RedHat versus 4 hours (!) on Fedora:
Could the log lines mentioning My ignition files for both |
Did you manage to figure this out I have encountered the same issue. |
We encoutered the same problem and our solution was to kill each osd prepare pod and initialy the osd came up. This is not really a permanent solution but it might help. |
My 'solution' was to stay with the RedHat images instead of the Fedora images. So far I haven't had the opportunity to take another look (waiting six hours for each test run really sucks). I've put off upgrading my nodes because of this. Anyway, this was an experiment for my home lab so I may just go with a different solution entirely if/when I need to rebuild the cluster. |
The behaviour of the latest Rook/Ceph is still the same on the latest Fedora CoreOS release, The latest release of RedHat CoreOS, |
I'm wondering whether it would be the same behavior on standard Fedora vs. RHEL |
Due to hardware failure I need to rebuild my cluster already. Can you suggest some test scenarios I can try to narrow down the cause? |
I was trying to emulate this with VMs and attached strace to an unhappy
|
BLUF:
Poking around some more, looks it seems like this Popen call in ceph-volume is causing slowness. Minimal repro case is: # this may hang for a long time depending upon how large `ulimit -n` is
python2 <<< "import subprocess; subprocess.Popen('true', close_fds=True)" The behavior of [root@rook-ceph-osd-prepare-node0-mx2dn /]# ulimit -n
1073741816 Looks like this was brought up in https://bugs.python.org/issue8052, but an # many matches
strace -f python2 <<< "import subprocess; subprocess.Popen('true', close_fds=True)" |& grep EBADF
# zero matches
strace -f python3 <<< "import subprocess; subprocess.Popen('true', close_fds=True)" |& grep EBADF # zero matches If this is in fact the issue, I'd be interested in how RHCOS differs here. We should be using the same impacted |
Dug into this some more. Taking the original ignition file, I was able to fully mitigate this issue by setting If so, then IMO there's nothing for FCOS (whose only crime is having high open fd limits) to do here. If I'm right, then the real issue is Python2.7. Since that's EOL and rook is still running |
Wow @mcsaucy. Nice investigation! |
Thanks a lot @mcsaucy for investigating this! I really appreciate it. I should have guessed that this would be yet another issue caused by ulimits on a RedHat OS... I rebuilt my cluster with A google search for I'm also curious why this appears on FCOS but not RHCOS. One link suggests it is due to the how the kernel is tuned, with Ultimately, if the fix is good enough for the big container projects, it should be good enough for k3s. |
Fix is merged in upstream k3s. |
When k3s is installed on an OS with default high ulimits, performance issues can be observed. This was discovered on CoreOS where the default value is 1073741816. Symptoms include very slow file operations such as installing a Rook/Ceph cluster will take ~6 hours instead of ~10 minutes. A google search for 'container LimitNOFILE' will show that most major projects set this already, including the (unused) containerd systemd unit found in this repository at /vendor/github.com/containerd/containerd/containerd.service k3OS is not affected becuasse the default there is already 1048576. See description in coreos/fedora-coreos-tracker#329
@stellirin Is it possible you made a mistake in phrasing the comment? As I read this issue, having no limit ( The comment currently says "Having non-zero limits causes performance issues" and then a non-zero limit is set, so until I found this issue I assumed I am supposed to set the limit to zero to get better performance. |
@AndreKR you're right, it does seem contradictory, but if I remember correctly I simply copied the comment from an upstream change on containerd or dockerd. |
@stellirin Indeed I can find the same comment in multiple recommendations for Docker service files (1 2 3), but in all of them they actually seem to mean you should not set any limits, so exactly the opposite of what was found in this issue. |
I use a custom built FCOS image for my own K8S (kubeadm, not k3s) and experienced this same issue. Adding
Allowed rook-ceph to finish building successfully (all other times I would get an error about ceph not responding). |
I use Fedora CoreOS to set up a 3 node k3s cluster on local hardware (3x Core i5, 12GB ram). I boot each node manually via PXE and the k3s cluster eventually configures itself as expected.
Each node has a second HDD so I want to set up a distributed file system with Rook/Ceph. I basically follow the default installation process (example instructions on medium.com).
I am able to bring up the Ceph cluster but provisioning is extremely slow. Most instructions suggest this should not take more than a few minutes, but for me on Fedora CoreOS it takes over 6 hours.
I have an issue open at rook/rook#4470 but I suspect it is something about Fedora CoreOS that causes this behaviour. Many commands executed by the Ceph processes hang for approximately 7 minutes, with
top
showing a process stuck at 100% CPU on a single core.In the logs I see even trivial commands like
ln -s
get stuck:Unfortunately I'm new to Ceph so debugging this is slow, and I don't see this behaviour described anywhere else.
There are a lot of moving parts here, but any insight from the Fedora CoreOS side that may help explain and debug this behaviour will be greatly appreciated.
ignition.yaml
rpm-ostree status
The text was updated successfully, but these errors were encountered: