Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rook/Ceph on k3s on Fedora CoreOS is extremely slow #329

Closed
stellirin opened this issue Dec 15, 2019 · 22 comments
Closed

Rook/Ceph on k3s on Fedora CoreOS is extremely slow #329

stellirin opened this issue Dec 15, 2019 · 22 comments

Comments

@stellirin
Copy link

I use Fedora CoreOS to set up a 3 node k3s cluster on local hardware (3x Core i5, 12GB ram). I boot each node manually via PXE and the k3s cluster eventually configures itself as expected.

Each node has a second HDD so I want to set up a distributed file system with Rook/Ceph. I basically follow the default installation process (example instructions on medium.com).

I am able to bring up the Ceph cluster but provisioning is extremely slow. Most instructions suggest this should not take more than a few minutes, but for me on Fedora CoreOS it takes over 6 hours.

I have an issue open at rook/rook#4470 but I suspect it is something about Fedora CoreOS that causes this behaviour. Many commands executed by the Ceph processes hang for approximately 7 minutes, with top showing a process stuck at 100% CPU on a single core.

In the logs I see even trivial commands like ln -s get stuck:

2019-12-11 01:08:15.851205 I | Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-36ac78de-a0dd-47e7-b63f-1bcd937c500b /dev/sdb
2019-12-11 01:15:41.436758 I |  stdout: Physical volume "/dev/sdb" successfully created.
2019-12-11 01:15:41.484904 I |  stdout: Volume group "ceph-36ac78de-a0dd-47e7-b63f-1bcd937c500b" successfully created
2019-12-11 01:23:07.163973 I | Running command: /usr/sbin/lvcreate --yes -l 298 -n osd-data-8eadc725-d865-4be2-8ac7-25be2fca23ad ceph-36ac78de-a0dd-47e7-b63f-1bcd937c500b
2019-12-11 01:30:32.676035 I |  stdout: Logical volume "osd-data-8eadc725-d865-4be2-8ac7-25be2fca23ad" created.
2019-12-11 02:22:12.183093 I | Running command: /bin/ceph-authtool --gen-print-key
2019-12-11 02:29:37.255539 I | Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new edd97953-8143-4a93-9ed1-ce86e922bc5b
2019-12-11 04:33:08.015851 I | Running command: /bin/ceph-authtool --gen-print-key
2019-12-11 04:40:34.765191 I | Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2
2019-12-11 04:47:59.120653 I | Running command: /bin/chown -h ceph:ceph /dev/ceph-36ac78de-a0dd-47e7-b63f-1bcd937c500b/osd-data-8eadc725-d865-4be2-8ac7-25be2fca23ad
2019-12-11 04:55:32.812044 I | Running command: /bin/chown -R ceph:ceph /dev/mapper/ceph--36ac78de--a0dd--47e7--b63f--1bcd937c500b-osd--data--8eadc725--d865--4be2--8ac7--25be2fca23ad
2019-12-11 05:02:57.915667 I | Running command: /bin/ln -s /dev/ceph-36ac78de-a0dd-47e7-b63f-1bcd937c500b/osd-data-8eadc725-d865-4be2-8ac7-25be2fca23ad /var/lib/ceph/osd/ceph-2/block
2019-12-11 05:10:05.876488 I | Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-2/activate.monmap

Unfortunately I'm new to Ceph so debugging this is slow, and I don't see this behaviour described anywhere else.

There are a lot of moving parts here, but any insight from the Fedora CoreOS side that may help explain and debug this behaviour will be greatly appreciated.

ignition.yaml

variant: fcos
version: 1.0.0

passwd:
  users:
    - name: core
      ssh_authorized_keys:
        - "ssh-rsa ..."

storage:

  disks:
    - device: /dev/sdb
      wipe_table: true

  files:
    - path: "/usr/local/bin/k3s"
      contents:
        source: "https://github.com/rancher/k3s/releases/download/v1.0.0/k3s"
        verification:
          hash: sha512-bbea8cd86a5c660e646b002a013593d25d9c20bb1194f2d9bd4ecaa60eb9d43bda368e313c5f09df6c6d5d2602e3235bb366336c1a2fd57f3042f05ff09f3de4
      mode: 0755
      user:
        name: root
      group:
        name: root
    - path: "/usr/local/etc/k3s/token"
      contents:
        # uuidgen | base64 -w 0
        source: "data:text/plain;charset=utf-8;base64,NWE5NzAzYzktOTcyOC00NjAyLWE0ZmYtNGRjY2Q0NDIwOGNkCg=="
      mode: 0400
      user:
        name: root
      group:
        name: root

systemd:
  units:

    - name: k3s.service
      enabled: true
      contents: |
        [Unit]
        Description=Lightweight Kubernetes
        Documentation=https://k3s.io
        After=network-online.target

        [Install]
        WantedBy=multi-user.target

        [Service]
        Type=notify
        KillMode=process
        Delegate=yes
        LimitNOFILE=infinity
        LimitNPROC=infinity
        LimitCORE=infinity
        TasksMax=infinity
        TimeoutStartSec=0
        Restart=always
        RestartSec=5s
        ExecStartPre=-/sbin/modprobe br_netfilter
        ExecStartPre=-/sbin/modprobe overlay
        ExecStart=/usr/local/bin/k3s server \
          --cluster-domain example.com \
          --token-file /usr/local/etc/k3s/token

rpm-ostree status

State: idle
AutomaticUpdates: disabled
Deployments:
● ostree://fedora:fedora/x86_64/coreos/testing
                   Version: 31.20191211.1 (2019-12-11T17:33:17Z)
                    Commit: f42843346587ae8cb8a022b5d7364864057fd2f6b54ca0b1dbf6f35faac63e0a
              GPGSignature: Valid signature by 7D22D5867F2A4236474BF7B850CB390B3C3359C4

@lucab
Copy link
Contributor

lucab commented Dec 16, 2019

Thanks for the report.
It looks like those commands may be hanging for a long time, waiting on I/O. I've no specific Ceph knowledge, but those are usually symptoms of misconfigurations or networking troubles.

I would suggest checking the logs to see if you are hitting any timeouts.
Also, double-check network traffic: if Ceph traffic is running on top of the overlay network, there could a bottleneck there. Silly MTU mismatches, DNS fallbacks and misrouted multicast are common mistakes.

@lucab
Copy link
Contributor

lucab commented Dec 16, 2019

Additionally, is this using Ceph via the in-kernel driver or is it this purely containerized userspace?

@stellirin
Copy link
Author

Rook does claim it requires a modern kernel with the rbd module. However lsmod shows that the rbd module (nor even the ceph module) has so far not been loaded.

I anyway added both modules to /etc/modules-load.d and restarted a new cluster but it hasn't had any effect so far, it's showing the same behaviour (it looks like it will again take 6 hours to do the initial provision).

How do you recommend to debug I/O on FCOS? right now it doesn't have any tools like iotop or iftop.

@stellirin
Copy link
Author

I found rpm-ostree install and was able to get iftop and iotop installed. With both tools I don't see any significant load.

@jan--f
Copy link

jan--f commented Dec 17, 2019

The issues seen here are unlikely related to ceph, as this is the preparation procedure before a new ceph component is initialized. The log above is from a tool called ceph-volume, which is a python script that sets up LVM volumes for the OSD (a ceph daemon) to use.
So this could be related to LVM usage (though this seems to work in other k8s environments) or it could be caused by running a python script.

@stellirin
Copy link
Author

I have eliminated K3S, Rook, and Ceph as the cause of the issue. I am able to use the same ignition config with RedHat CoreOS (with trivial adjustments for the older ignition) and Ceph can start a cluster quickly.

NAME                       STATUS   ROLES    AGE     VERSION         INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                                                   KERNEL-VERSION                CONTAINER-RUNTIME
k3s-server-0.example.net   Ready    master   7m20s   v1.16.3-k3s.2   192.168.16.101   <none>        Red Hat Enterprise Linux CoreOS 42.80.20191002.0 (Ootpa)   4.18.0-80.11.2.el8_0.x86_64   containerd://1.3.0-k3s.4
k3s-agent-0.example.net    Ready    worker   6m47s   v1.16.3-k3s.2   192.168.16.102   <none>        Red Hat Enterprise Linux CoreOS 42.80.20191002.0 (Ootpa)   4.18.0-80.11.2.el8_0.x86_64   containerd://1.3.0-k3s.4
k3s-agent-1.example.net    Ready    worker   5m47s   v1.16.3-k3s.2   192.168.16.103   <none>        Red Hat Enterprise Linux CoreOS 42.80.20191002.0 (Ootpa)   4.18.0-80.11.2.el8_0.x86_64   containerd://1.3.0-k3s.4

versus:

NAME                       STATUS   ROLES    AGE     VERSION         INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                                KERNEL-VERSION           CONTAINER-RUNTIME
k3s-server-0.example.net   Ready    master   10m     v1.16.3-k3s.2   192.168.16.101   <none>        Fedora CoreOS preview 31.20191217.2.0   5.3.16-300.fc31.x86_64   containerd://1.3.0-k3s.4
k3s-agent-0.example.net    Ready    worker   6m38s   v1.16.3-k3s.2   192.168.16.102   <none>        Fedora CoreOS preview 31.20191217.2.0   5.3.16-300.fc31.x86_64   containerd://1.3.0-k3s.4
k3s-agent-1.example.net    Ready    worker   6m5s    v1.16.3-k3s.2   192.168.16.103   <none>        Fedora CoreOS preview 31.20191217.2.0   5.3.16-300.fc31.x86_64   containerd://1.3.0-k3s.4

Same log snippet as above, it takes ~16 seconds on RedHat versus 4 hours (!) on Fedora:

2020-01-02 11:20:04.738365 I | Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-c51e73c4-71ce-48e7-b6f7-4c4338916172 /dev/sdb
2020-01-02 11:20:05.178519 I |  stderr: Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, LVM will manage logical volume symlinks in device directory.
2020-01-02 11:20:05.178832 I |  stderr: Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, LVM will obtain device list by scanning device directory.
2020-01-02 11:20:05.452117 I |  stdout: Physical volume "/dev/sdb" successfully created.
2020-01-02 11:20:05.498573 I |  stdout: Volume group "ceph-c51e73c4-71ce-48e7-b6f7-4c4338916172" successfully created
2020-01-02 11:20:05.986049 I | Running command: /usr/sbin/lvcreate --yes -l 298 -n osd-data-cd39837c-7c20-4ab3-a0db-613fedb56ca0 ceph-c51e73c4-71ce-48e7-b6f7-4c4338916172
2020-01-02 11:20:06.428890 I |  stderr: Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, LVM will manage logical volume symlinks in device directory.
2020-01-02 11:20:06.429223 I |  stderr: Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, LVM will obtain device list by scanning device directory.
2020-01-02 11:20:06.443223 I |  stderr: Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, device-mapper library will manage device nodes in device directory.
2020-01-02 11:20:06.536848 I |  stdout: Logical volume "osd-data-cd39837c-7c20-4ab3-a0db-613fedb56ca0" created.
2020-01-02 11:20:09.782896 I | Running command: /bin/ceph-authtool --gen-print-key
2020-01-02 11:20:10.247722 I | Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 233b1cfa-9c90-4988-bae6-538c42ea58cd
2020-01-02 11:20:18.700221 I | Running command: /bin/ceph-authtool --gen-print-key
2020-01-02 11:20:19.165869 I | Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1
2020-01-02 11:20:19.603050 I | Running command: /bin/chown -h ceph:ceph /dev/ceph-c51e73c4-71ce-48e7-b6f7-4c4338916172/osd-data-cd39837c-7c20-4ab3-a0db-613fedb56ca0
2020-01-02 11:20:20.037883 I | Running command: /bin/chown -R ceph:ceph /dev/mapper/ceph--c51e73c4--71ce--48e7--b6f7--4c4338916172-osd--data--cd39837c--7c20--4ab3--a0db--613fedb56ca0
2020-01-02 11:20:20.458045 I | Running command: /bin/ln -s /dev/ceph-c51e73c4-71ce-48e7-b6f7-4c4338916172/osd-data-cd39837c-7c20-4ab3-a0db-613fedb56ca0 /var/lib/ceph/osd/ceph-1/block
2020-01-02 11:20:20.895316 I | Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-1/activate.monmap

Could the log lines mentioning DM_DISABLE_UDEV point to something missing on the Fedora CoreOS side?

My ignition files for both fcos and rhcos found at: https://github.com/stellirin/coreos-k3s-ceph

@mattrapa
Copy link

mattrapa commented Mar 10, 2020

Did you manage to figure this out I have encountered the same issue.

@tomaaron
Copy link

We encoutered the same problem and our solution was to kill each osd prepare pod and initialy the osd came up. This is not really a permanent solution but it might help.

@stellirin
Copy link
Author

My 'solution' was to stay with the RedHat images instead of the Fedora images. So far I haven't had the opportunity to take another look (waiting six hours for each test run really sucks). I've put off upgrading my nodes because of this.

Anyway, this was an experiment for my home lab so I may just go with a different solution entirely if/when I need to rebuild the cluster.

@stellirin
Copy link
Author

The behaviour of the latest Rook/Ceph is still the same on the latest Fedora CoreOS release, 31.20200223.3.0.

The latest release of RedHat CoreOS, 4.3.0, does not show the behaviour, so I used this when I rebuilt my cluster. I guess I probably won't revisit this topic again for some time as I want to have some stability for a few months.

@LorbusChris
Copy link
Contributor

I'm wondering whether it would be the same behavior on standard Fedora vs. RHEL

@stellirin
Copy link
Author

Due to hardware failure I need to rebuild my cluster already. Can you suggest some test scenarios I can try to narrow down the cause?

@mcsaucy
Copy link

mcsaucy commented May 2, 2020

I was trying to emulate this with VMs and attached strace to an unhappy ceph-volume process. Not sure if it's a red herring, but there's a whole lot of:

...
close(779248286)                        = -1 EBADF (Bad file descriptor)
close(779248287)                        = -1 EBADF (Bad file descriptor)
close(779248288)                        = -1 EBADF (Bad file descriptor)
close(779248289)                        = -1 EBADF (Bad file descriptor)
...

@mcsaucy
Copy link

mcsaucy commented May 2, 2020

BLUF:

[root@rook-ceph-osd-prepare-node0-mx2dn /]# ulimit -n
1073741816
[root@rook-ceph-osd-prepare-node0-mx2dn /]# time timeout 15 /usr/bin/python2.7 /usr/sbin/ceph-volume inventory --format json /dev/sdb || echo ":("

real    0m15.003s
user    0m0.062s
sys     0m0.030s
:(
[root@rook-ceph-osd-prepare-node0-mx2dn /]# ulimit -n 1024
[root@rook-ceph-osd-prepare-node0-mx2dn /]# time timeout 15 /usr/bin/python2.7 /usr/sbin/ceph-volume inventory --format json /dev/sdb || echo ":("
{"available": true, "lvs": [], "rejected_reasons": [], "sys_api": {"size": 5368709120.0, "scheduler_mode": "bfq", "rotational": "1", "vendor": "ATA", "human_readable_size": "5.00 GB", "sectors": 0, "sas_device_handle": "", "rev": "2.5+", "sas_address": "", "locked": 0, "sectorsize": "512", "removable": "0", "path": "/dev/sdb", "support_discard": "512", "model": "QEMU HARDDISK", "ro": "0", "nr_requests": "2", "partitions": {}}, "path": "/dev/sdb", "device_id": "QEMU_HARDDISK_QM00002"}

real    0m0.291s
user    0m0.148s
sys     0m0.077s
[root@rook-ceph-osd-prepare-node0-mx2dn /]#

Poking around some more, looks it seems like this Popen call in ceph-volume is causing slowness. Minimal repro case is:

# this may hang for a long time depending upon how large `ulimit -n` is
python2 <<< "import subprocess; subprocess.Popen('true', close_fds=True)"

The behavior of close_fds in Python2.7 seems to involve closing ~all fds up to ulimit -n, which translates to a billion wasted close(2) calls (~30 minutes for me) per Popen:

[root@rook-ceph-osd-prepare-node0-mx2dn /]# ulimit -n
1073741816

Looks like this was brought up in https://bugs.python.org/issue8052, but an strace of the testcase suggests this didn't get ported to 2.7:

# many matches
strace -f python2 <<< "import subprocess; subprocess.Popen('true', close_fds=True)" |& grep EBADF
# zero matches
strace -f python3 <<< "import subprocess; subprocess.Popen('true', close_fds=True)" |& grep EBADF # zero matches

If this is in fact the issue, I'd be interested in how RHCOS differs here. We should be using the same impacted python2.7 binary on each side, so I wonder if the container on RHCOS gets a different max FD limit.

@mcsaucy
Copy link

mcsaucy commented May 2, 2020

Dug into this some more.

Taking the original ignition file, I was able to fully mitigate this issue by setting k3s.service's LimitNOFILE to a smaller number (1048576, in my case) from unlimited. Provisioning time dropped from hours to ~10 minutes. Can you see if that works for you, @stellirin?

If so, then IMO there's nothing for FCOS (whose only crime is having high open fd limits) to do here. If I'm right, then the real issue is Python2.7. Since that's EOL and rook is still running ceph-volume with it, moving it to Python 3 would be a good call. If that's not feasible, a potential hack would be to set stricter fd rlimits (performance is tolerable at 1 million) for the number of open files somewhere in rook-land.

@dustymabe
Copy link
Member

Wow @mcsaucy. Nice investigation!

@stellirin
Copy link
Author

Thanks a lot @mcsaucy for investigating this! I really appreciate it. I should have guessed that this would be yet another issue caused by ulimits on a RedHat OS...

I rebuilt my cluster with LimitNOFILE=1048576 and the issue is fully mitigated for me on FCOS. Provisioning a new Rook/Ceph cluster takes the same amount of time as on RHCOS and k3OS.

A google search for container LimitNOFILE actually brings up a few similar issues, with the result being that upstream containerd and docker both setting the same limit, based on similar performance issues.

I'm also curious why this appears on FCOS but not RHCOS. One link suggests it is due to the how the kernel is tuned, with sudo sysctl fs.nr_open giving fs.nr_open = 1073741816 on FCOS, which is what we see as ulimit -n inside the containers. I didn't have time to check the values on RHCOS.

Ultimately, if the fix is good enough for the big container projects, it should be good enough for k3s.

@stellirin
Copy link
Author

Fix is merged in upstream k3s.

erikwilson pushed a commit to erikwilson/k3s-ansible that referenced this issue May 11, 2020
When k3s is installed on an OS with default high ulimits, performance
issues can be observed. This was discovered on CoreOS where the default
value is 1073741816. Symptoms include very slow file operations such
as installing a Rook/Ceph cluster will take ~6 hours instead of ~10 minutes.

A google search for 'container LimitNOFILE' will show that most major
projects set this already, including the (unused) containerd systemd unit
found in this repository at /vendor/github.com/containerd/containerd/containerd.service

k3OS is not affected becuasse the default there is already 1048576.

See description in coreos/fedora-coreos-tracker#329
@AndreKR
Copy link

AndreKR commented Mar 18, 2021

@stellirin Is it possible you made a mistake in phrasing the comment? As I read this issue, having no limit (infinity) causes performance issues, right?

The comment currently says "Having non-zero limits causes performance issues" and then a non-zero limit is set, so until I found this issue I assumed I am supposed to set the limit to zero to get better performance.

@stellirin
Copy link
Author

@AndreKR you're right, it does seem contradictory, but if I remember correctly I simply copied the comment from an upstream change on containerd or dockerd.

@AndreKR
Copy link

AndreKR commented Mar 18, 2021

@stellirin Indeed I can find the same comment in multiple recommendations for Docker service files (1 2 3), but in all of them they actually seem to mean you should not set any limits, so exactly the opposite of what was found in this issue.

@travisby
Copy link

I use a custom built FCOS image for my own K8S (kubeadm, not k3s) and experienced this same issue.

Adding

$ cat /etc/systemd/system/containerd.service.d/override.conf
[Service]
LimitNOFILE=1048576

Allowed rook-ceph to finish building successfully (all other times I would get an error about ceph not responding).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants