Pod failed to mount SparseLoopDevice Volume after node reboot #2726

thomasdanan · 2020-08-10T09:58:56Z

Component:

'containers', 'kubernetes', 'volumes'

What happened:

When provisioning a Volume using SparseLoopDevice mode, a pod consuming such volume may fail to start after a node reboot, because it is unable to mount the volume since its path on disk may not exist anymore:

Warning FailedMount 90s (x23 over 32m) kubelet, master-1 MountVolume.NewMounter initialization failed for volume "master-1-prometheus" : path "/dev/disk/by-uuid/183dd142-9c46-411f-b160-d99bbd03f6c9" does not exist

Workaround: restart salt-minion on all nodes where some Volumes were created using SparseLoopDevice mode

What was expected:

As a persistent Volume, we expect its path on disk to be persistent so that the pod can mount it after a reboot and retrieve data that were stored under it.

Steps to reproduce

Deploy MetalK8s cluster
Provision a Volume using SparseLoopDevice mode, a PVC to bind it, and a pod to mount this PVC
Wait for pod to start (and PVC to bind the volume's PV)
Reboot the node on which the volume was provisioned / pod was scheduled
Observe that the pod may stay in Pending state because it is not able to find the device referenced by its PVC volume

Resolution proposal (optional):

The text was updated successfully, but these errors were encountered:

The "Update pillar after volume provisioning" step should be outside of the for loop because: - it's ID is not customized per-volume (duplicate ID) - we only need to refresh once after deploying all the volumes Closes: #2726 Closes: #2729 Signed-off-by: Sylvain Laperche <[email protected]>

The "Update pillar after volume provisioning" step should be outside of the for loop because: - it's ID is not customized per-volume (duplicate ID) - we only need to refresh once after deploying all the volumes Refs: #2726 Closes: #2729 Signed-off-by: Sylvain Laperche <[email protected]>

In our snapshot upgrade state we start from already installed machine that have the issue with sparse loop device, so add a workaround for the #2726 issue

NicolasT · 2020-11-12T14:05:21Z

This can be worked around by applying the Salt state responsible for creating the loop devices. This is supposed to happen at node reboot (using salt-minion's 'startup state execution' feature), however, this fails when salt-master is not (yet) available when the salt-minion service starts.

This is a nuisance, at least.

As discussed on Slack, I see two ways to 'fix' this:

Implement a CSI driver which provisions and handles sparse files as volumes (an interesting project, but quite a bit of work)
No longer make Salt responsible to set up the loop device, and instead rely on the system (systemd) to set these up and tear them down, e.g. at system boot.

Given the second approach, here's an outline of how this can be deployed:

We can use a "systemd template unit" to manage the loop device(s). To deploy, we can drop the following file as /etc/systemd/system/[email protected]:

[Unit]
Description=Setup MetalK8s sparse loop device %I
RequiresMountsFor=/var/lib/metalk8s/storage/sparse

[Service]
Type=oneshot
ExecStart=/sbin/losetup --find --partscan "/var/lib/metalk8s/storage/sparse/%i"
ExecStop=/sbin/losetup -d "/dev/disk/by-uuid/%i"
TimeoutSec=60
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

(Note: this is slightly different from the current implementation in the module, where the --partscan option is only passed to 'block' volumes. However, this seems to work fine for sparse files which contain no partition table as well. May want to check with @slaperche-scality).

After creating the file, systemctl daemon-reload (using the appropriate Salt state invocation).

Alter the Salt state responsible for setting up the volume on the system. This is currently mainly done in module/Python code. We may want to lift this a bit into 'state' land. In any case, it should
- Create the sparse file
- Format it (setting the UUID correctly, as already implemented)
- systemctl enable --now metalk8s-sparse-volume@${UUID}.service (using the appropriate Salt state)

Similarly, removing the volume implies systemctl disable --now metalk8s-sparse-volume@${UUID}.service before removing the sparse file etc.

slaperche-scality · 2020-11-18T12:18:01Z

(Note: this is slightly different from the current implementation in the module, where the --partscan option is only passed to 'block' volumes. However, this seems to work fine for sparse files which contain no partition table as well. May want to check with @slaperche-scality).

For the --partscan, it should be fine: I think in the earliest iterations of the first PR for the storage-operator I was using --partscan even though I was formatting the whole device (thus no partition table), but then we removed it because it wasn't needed.
Could try again to be sure, but it should work.

NicolasT · 2020-11-18T12:38:24Z

FWIW, I did not test what happens when you losetup -d /dev/disk/by-uuid/... when the backing sparse file has a partition table on it but no filesystem: in that case, /dev/disk/by-uuid/... may not exist, and we need to losetup -d ... some other path?

slaperche-scality · 2020-11-18T13:11:08Z

For formatted device, it's super simple: you can rely on /dev/disk/by-uuid/<volume-UUID> (for both SparseLoop and RawBlock, but we only care about SparseLoop here anyway).

For raw (non-formatted) device, this get trickier:

for SparseLoop it will be under /dev/disk/by-partuuid/<volume-UUID>
for RawBlockDevice (irrelevant here), either /dev/disk/by-partuuid/<volume-UUID> or /dev/disk/by-id/dm-uuid-LVM-…

NicolasT · 2020-11-18T13:21:32Z

Thanks. Indeed this could imply we need two unit templates: one for 'file' volumes (with losetup -d /dev/disk/by-uuid/<volumem-UUID>) and one for 'block' volumes. However, in that case, /dev/disk/by-partuuid/<volume-UUID> will be a symlink to a partition on a loop device (/dev/loopXpN), and losetup -d may not like that (expect a plain /dev/loopX instead).

Alternatively, we have some script which can handle both 'file' and 'block' volumes somehow, drop this in /usr/local/libexec and use that as the ExecStop entrypoint, which again requires only a single unit template (which seems kinda preferable?).

slaperche-scality · 2020-11-18T13:24:31Z

losetup -d may not like that (expect a plain /dev/loopX instead).

Yup, I can confirm that losetup -d doesn't like it: it wants the "root" loop device:

% truncate -s 10G foo
% sgdisk --largest-new 1 --partition-guid 1:7f561330-fb42-440f-83e9-e9f8aa31d2f4 --change-name 1:bar foo
Creating new GPT entries in memory.
Setting name!
partNum is 0
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
% losetup -f --partscan foo
% losetup -l
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE         DIO LOG-SEC
/dev/loop0         0      0         0  0 /home/sylvain/foo   0     512
% ll /dev/disk/by-uuid/* | grep loop
% ll /dev/disk/by-partuuid/* | grep loop
lrwxrwxrwx 1 root root 13 18 nov.  14:20 /dev/disk/by-partuuid/7f561330-fb42-440f-83e9-e9f8aa31d2f4 -> ../../loop0p1
% losetup -d /dev/disk/by-partuuid/7f561330-fb42-440f-83e9-e9f8aa31d2f4
losetup: /dev/disk/by-partuuid/7f561330-fb42-440f-83e9-e9f8aa31d2f4: failed to use device: No such device
% losetup -d /dev/loop0

NicolasT · 2020-11-18T13:39:31Z

Confirmed. In the ExecStop script, which is passed the volume UUID, we could:

Check whether /dev/disk/by-uuid/UUID exists. If so, losetup -d it
Check whether /dev/disk/by-partuuid/UUID exists. If not, abort
Retrieve the 'parent' device of the partition:
- Either by running lsblk -no PKNAME /dev/disk/by-partuuid/UUID, or
- By running something along the lines of basename $(realpath /sys/class/block/$(basename $(readlink /dev/disk/by-partuuid/UUID))/..)
  Then, pass /dev/PARENTNAME to losetup -d

Script design TBD of course, likely first figuring out the device ID 'read-only', then finally exec losetup -d $NAME at the end of the script is 'better', also add some logging and whatnot.

The previous approach was relying entirely on Salt (through our custom `metalk8s_volumes` execution/state module) to manage provisioning and cleanup of loop devices (on sparse files). The issues arise when (re)booting a node: - Salt minion has to execute a startup state, which may come in when kube-apiserver is not (yet) available - `losetup --find` calls, with the version available for CentOS 7, are not atomic, so we may lose some devices if we have too many (re-running the state manually is the current workaround) To fix these, we introduce two main changes: - management of loop devices is delegated to systemd; using a unit template file, we define a service per volume (passing the Volume's UUID as a parameter), which will provision and attach the device on start, and detach it on stop - `losetup --find` invocations from these units is made sequential by using an exclusive lockfile (this is not perfect, but avoids the need to re-implement the command ourselves to include the fix from higher versions of `losetup`) Note that we cannot simply use `losetup --detach` in the unit template, because sparse volumes may not always be discovered from the same path: - either the volume is formatted, and we can find it in /dev/disk/by-uuid/ - or it's kept raw, so we only have the UUID in the partition table, and we can discover the device through /dev/disk/by-partuuid/ Fixes: #2726