Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod failed to mount SparseLoopDevice Volume after node reboot #2726

Closed
thomasdanan opened this issue Aug 10, 2020 · 7 comments · Fixed by #2936
Closed

Pod failed to mount SparseLoopDevice Volume after node reboot #2726

thomasdanan opened this issue Aug 10, 2020 · 7 comments · Fixed by #2936
Assignees
Labels
priority:medium Medium priority issues, should only be postponed if no other option severity:minor No impact, or minor impact, on live deployments topic:operations Operations-related issues topic:storage Issues related to storage topic:tooling Everything related to tooling around MetalK8s (e.g. Vagrant or Terraform)

Comments

@thomasdanan
Copy link
Contributor

thomasdanan commented Aug 10, 2020

Component:

'containers', 'kubernetes', 'volumes'

What happened:

When provisioning a Volume using SparseLoopDevice mode, a pod consuming such volume may fail to start after a node reboot, because it is unable to mount the volume since its path on disk may not exist anymore:

Warning FailedMount 90s (x23 over 32m) kubelet, master-1 MountVolume.NewMounter initialization failed for volume "master-1-prometheus" : path "/dev/disk/by-uuid/183dd142-9c46-411f-b160-d99bbd03f6c9" does not exist

Workaround: restart salt-minion on all nodes where some Volumes were created using SparseLoopDevice mode

What was expected:

As a persistent Volume, we expect its path on disk to be persistent so that the pod can mount it after a reboot and retrieve data that were stored under it.

Steps to reproduce

  • Deploy MetalK8s cluster
  • Provision a Volume using SparseLoopDevice mode, a PVC to bind it, and a pod to mount this PVC
  • Wait for pod to start (and PVC to bind the volume's PV)
  • Reboot the node on which the volume was provisioned / pod was scheduled
  • Observe that the pod may stay in Pending state because it is not able to find the device referenced by its PVC volume

Resolution proposal (optional):

@thomasdanan thomasdanan added topic:storage Issues related to storage topic:operations Operations-related issues priority:medium Medium priority issues, should only be postponed if no other option severity:minor No impact, or minor impact, on live deployments topic:tooling Everything related to tooling around MetalK8s (e.g. Vagrant or Terraform) labels Aug 10, 2020
@gdemonet gdemonet changed the title POD failed to mount SparseLoopDevice Volume after node reboot Pod failed to mount SparseLoopDevice Volume after node reboot Aug 10, 2020
slaperche-scality added a commit that referenced this issue Aug 11, 2020
The "Update pillar after volume provisioning" step should be outside of
the for loop because:
- it's ID is not customized per-volume (duplicate ID)
- we only need to refresh once after deploying all the volumes

Closes: #2726
Closes: #2729
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Aug 11, 2020
The "Update pillar after volume provisioning" step should be outside of
the for loop because:
- it's ID is not customized per-volume (duplicate ID)
- we only need to refresh once after deploying all the volumes

Refs: #2726
Closes: #2729
Signed-off-by: Sylvain Laperche <[email protected]>
TeddyAndrieux added a commit that referenced this issue Oct 29, 2020
In our snapshot upgrade state we start from already installed machine
that have the issue with sparse loop device, so add a workaround for
the #2726 issue
TeddyAndrieux added a commit that referenced this issue Oct 29, 2020
In our snapshot upgrade state we start from already installed machine
that have the issue with sparse loop device, so add a workaround for
the #2726 issue
TeddyAndrieux added a commit that referenced this issue Oct 30, 2020
In our snapshot upgrade state we start from already installed machine
that have the issue with sparse loop device, so add a workaround for
the #2726 issue
TeddyAndrieux added a commit that referenced this issue Oct 30, 2020
In our snapshot upgrade state we start from already installed machine
that have the issue with sparse loop device, so add a workaround for
the #2726 issue
TeddyAndrieux added a commit that referenced this issue Oct 30, 2020
In our snapshot upgrade state we start from already installed machine
that have the issue with sparse loop device, so add a workaround for
the #2726 issue
@NicolasT
Copy link
Contributor

NicolasT commented Nov 12, 2020

This can be worked around by applying the Salt state responsible for creating the loop devices. This is supposed to happen at node reboot (using salt-minion's 'startup state execution' feature), however, this fails when salt-master is not (yet) available when the salt-minion service starts.

This is a nuisance, at least.

As discussed on Slack, I see two ways to 'fix' this:

  • Implement a CSI driver which provisions and handles sparse files as volumes (an interesting project, but quite a bit of work)
  • No longer make Salt responsible to set up the loop device, and instead rely on the system (systemd) to set these up and tear them down, e.g. at system boot.

Given the second approach, here's an outline of how this can be deployed:

[Unit]
Description=Setup MetalK8s sparse loop device %I
RequiresMountsFor=/var/lib/metalk8s/storage/sparse

[Service]
Type=oneshot
ExecStart=/sbin/losetup --find --partscan "/var/lib/metalk8s/storage/sparse/%i"
ExecStop=/sbin/losetup -d "/dev/disk/by-uuid/%i"
TimeoutSec=60
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

(Note: this is slightly different from the current implementation in the module, where the --partscan option is only passed to 'block' volumes. However, this seems to work fine for sparse files which contain no partition table as well. May want to check with @slaperche-scality).

After creating the file, systemctl daemon-reload (using the appropriate Salt state invocation).

  • Alter the Salt state responsible for setting up the volume on the system. This is currently mainly done in module/Python code. We may want to lift this a bit into 'state' land. In any case, it should

    • Create the sparse file
    • Format it (setting the UUID correctly, as already implemented)
    • systemctl enable --now metalk8s-sparse-volume@${UUID}.service (using the appropriate Salt state)

Similarly, removing the volume implies systemctl disable --now metalk8s-sparse-volume@${UUID}.service before removing the sparse file etc.

@gdemonet gdemonet self-assigned this Nov 18, 2020
@slaperche-scality
Copy link
Contributor

(Note: this is slightly different from the current implementation in the module, where the --partscan option is only passed to 'block' volumes. However, this seems to work fine for sparse files which contain no partition table as well. May want to check with @slaperche-scality).

For the --partscan, it should be fine: I think in the earliest iterations of the first PR for the storage-operator I was using --partscan even though I was formatting the whole device (thus no partition table), but then we removed it because it wasn't needed.
Could try again to be sure, but it should work.

@NicolasT
Copy link
Contributor

FWIW, I did not test what happens when you losetup -d /dev/disk/by-uuid/... when the backing sparse file has a partition table on it but no filesystem: in that case, /dev/disk/by-uuid/... may not exist, and we need to losetup -d ... some other path?

@slaperche-scality
Copy link
Contributor

For formatted device, it's super simple: you can rely on /dev/disk/by-uuid/<volume-UUID> (for both SparseLoop and RawBlock, but we only care about SparseLoop here anyway).

For raw (non-formatted) device, this get trickier:

  • for SparseLoop it will be under /dev/disk/by-partuuid/<volume-UUID>
  • for RawBlockDevice (irrelevant here), either /dev/disk/by-partuuid/<volume-UUID> or /dev/disk/by-id/dm-uuid-LVM-…

@NicolasT
Copy link
Contributor

NicolasT commented Nov 18, 2020

Thanks. Indeed this could imply we need two unit templates: one for 'file' volumes (with losetup -d /dev/disk/by-uuid/<volumem-UUID>) and one for 'block' volumes. However, in that case, /dev/disk/by-partuuid/<volume-UUID> will be a symlink to a partition on a loop device (/dev/loopXpN), and losetup -d may not like that (expect a plain /dev/loopX instead).

Alternatively, we have some script which can handle both 'file' and 'block' volumes somehow, drop this in /usr/local/libexec and use that as the ExecStop entrypoint, which again requires only a single unit template (which seems kinda preferable?).

@slaperche-scality
Copy link
Contributor

losetup -d may not like that (expect a plain /dev/loopX instead).

Yup, I can confirm that losetup -d doesn't like it: it wants the "root" loop device:

% truncate -s 10G foo
% sgdisk --largest-new 1 --partition-guid 1:7f561330-fb42-440f-83e9-e9f8aa31d2f4 --change-name 1:bar foo
Creating new GPT entries in memory.
Setting name!
partNum is 0
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
% losetup -f --partscan foo
% losetup -l
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE         DIO LOG-SEC
/dev/loop0         0      0         0  0 /home/sylvain/foo   0     512
% ll /dev/disk/by-uuid/* | grep loop
% ll /dev/disk/by-partuuid/* | grep loop
lrwxrwxrwx 1 root root 13 18 nov.  14:20 /dev/disk/by-partuuid/7f561330-fb42-440f-83e9-e9f8aa31d2f4 -> ../../loop0p1
% losetup -d /dev/disk/by-partuuid/7f561330-fb42-440f-83e9-e9f8aa31d2f4
losetup: /dev/disk/by-partuuid/7f561330-fb42-440f-83e9-e9f8aa31d2f4: failed to use device: No such device
% losetup -d /dev/loop0

@NicolasT
Copy link
Contributor

Confirmed. In the ExecStop script, which is passed the volume UUID, we could:

  • Check whether /dev/disk/by-uuid/UUID exists. If so, losetup -d it
  • Check whether /dev/disk/by-partuuid/UUID exists. If not, abort
  • Retrieve the 'parent' device of the partition:
    • Either by running lsblk -no PKNAME /dev/disk/by-partuuid/UUID, or
    • By running something along the lines of basename $(realpath /sys/class/block/$(basename $(readlink /dev/disk/by-partuuid/UUID))/..)
      Then, pass /dev/PARENTNAME to losetup -d

Script design TBD of course, likely first figuring out the device ID 'read-only', then finally exec losetup -d $NAME at the end of the script is 'better', also add some logging and whatnot.

gdemonet added a commit that referenced this issue Nov 25, 2020
The previous approach was relying entirely on Salt (through our custom
`metalk8s_volumes` execution/state module) to manage provisioning and
cleanup of loop devices (on sparse files).
The issues arise when (re)booting a node:
- Salt minion has to execute a startup state, which may come in when
kube-apiserver is not (yet) available
- `losetup --find` calls, with the version available for CentOS 7, are
not atomic, so we may lose some devices if we have too many (re-running
the state manually is the current workaround)

To fix these, we introduce two main changes:
- management of loop devices is delegated to systemd; using a unit
template file, we define a service per volume (passing the Volume's UUID
as a parameter), which will provision and attach the device on start,
and detach it on stop
- `losetup --find` invocations from these units is made sequential by
using an exclusive lockfile (this is not perfect, but avoids the need to
re-implement the command ourselves to include the fix from higher
versions of `losetup`)

Note that we cannot simply use `losetup --detach` in the unit template,
because sparse volumes may not always be discovered from the same path:
- either the volume is formatted, and we can find it in
/dev/disk/by-uuid/
- or it's kept raw, so we only have the UUID in the partition table, and
we can discover the device through /dev/disk/by-partuuid/

Fixes: #2726
gdemonet added a commit that referenced this issue Nov 25, 2020
We adjust the `lint-python` tox environment to:
- handle Python 2 and Python 3 files (we only run mypy on the latter)
- always run both pylint and mypy, and report exit codes for both in
case of failure

See: #2726
gdemonet added a commit that referenced this issue Nov 26, 2020
The previous approach was relying entirely on Salt (through our custom
`metalk8s_volumes` execution/state module) to manage provisioning and
cleanup of loop devices (on sparse files).
The issues arise when (re)booting a node:
- Salt minion has to execute a startup state, which may come in when
kube-apiserver is not (yet) available
- `losetup --find` calls, with the version available for CentOS 7, are
not atomic, so we may lose some devices if we have too many (re-running
the state manually is the current workaround)

To fix these, we introduce two main changes:
- management of loop devices is delegated to systemd; using a unit
template file, we define a service per volume (passing the Volume's UUID
as a parameter), which will provision and attach the device on start,
and detach it on stop
- `losetup --find` invocations from these units is made sequential by
using an exclusive lockfile (this is not perfect, but avoids the need to
re-implement the command ourselves to include the fix from losetup 2.25
and higher: see util-linux/util-linux@f7e21185)

Note that we cannot simply use `losetup --detach` in the unit template,
because sparse volumes may not always be discovered from the same path:
- either the volume is formatted, and we can find it in
/dev/disk/by-uuid/
- or it's kept raw, so we only have the UUID in the partition table, and
we can discover the device through /dev/disk/by-partuuid/

Fixes: #2726
gdemonet added a commit that referenced this issue Nov 26, 2020
We adjust the `lint-python` tox environment to:
- handle Python 2 and Python 3 files (we only run mypy on the latter)
- always run both pylint and mypy, and report exit codes for both in
case of failure

See: #2726
gdemonet added a commit that referenced this issue Nov 26, 2020
The previous approach was relying entirely on Salt (through our custom
`metalk8s_volumes` execution/state module) to manage provisioning and
cleanup of loop devices (on sparse files).
The issues arise when (re)booting a node:
- Salt minion has to execute a startup state, which may come in when
kube-apiserver is not (yet) available
- `losetup --find` calls, with the version available for CentOS 7, are
not atomic, so we may lose some devices if we have too many (re-running
the state manually is the current workaround)

To fix these, we introduce two main changes:
- management of loop devices is delegated to systemd; using a unit
template file, we define a service per volume (passing the Volume's UUID
as a parameter), which will provision and attach the device on start,
and detach it on stop
- `losetup --find` invocations from these units is made sequential by
using an exclusive lockfile (this is not perfect, but avoids the need to
re-implement the command ourselves to include the fix from losetup 2.25
and higher: see util-linux/util-linux@f7e21185)

Note that we cannot simply use `losetup --detach` in the unit template,
because sparse volumes may not always be discovered from the same path:
- either the volume is formatted, and we can find it in
/dev/disk/by-uuid/
- or it's kept raw, so we only have the UUID in the partition table, and
we can discover the device through /dev/disk/by-partuuid/

Fixes: #2726
gdemonet added a commit that referenced this issue Nov 26, 2020
We adjust the `lint-python` tox environment to:
- handle Python 2 and Python 3 files (we only run mypy on the latter)
- always run both pylint and mypy, and report exit codes for both in
case of failure

See: #2726
gdemonet added a commit that referenced this issue Nov 26, 2020
We adjust the `lint-python` tox environment to:
- handle Python 2 and Python 3 files (we only run mypy on the latter)
- always run both pylint and mypy, and report exit codes for both in
case of failure

See: #2726
gdemonet added a commit that referenced this issue Nov 26, 2020
The previous approach was relying entirely on Salt (through our custom
`metalk8s_volumes` execution/state module) to manage provisioning and
cleanup of loop devices (on sparse files).
The issues arise when (re)booting a node:
- Salt minion has to execute a startup state, which may come in when
kube-apiserver is not (yet) available
- `losetup --find` calls, with the version available for CentOS 7, are
not atomic, so we may lose some devices if we have too many (re-running
the state manually is the current workaround)

To fix these, we introduce two main changes:
- management of loop devices is delegated to systemd; using a unit
template file, we define a service per volume (passing the Volume's UUID
as a parameter), which will provision and attach the device on start,
and detach it on stop
- `losetup --find` invocations from these units is made sequential by
using an exclusive lockfile (this is not perfect, but avoids the need to
re-implement the command ourselves to include the fix from losetup 2.25
and higher: see util-linux/util-linux@f7e21185)

Note that we cannot simply use `losetup --detach` in the unit template,
because sparse volumes may not always be discovered from the same path:
- either the volume is formatted, and we can find it in
/dev/disk/by-uuid/
- or it's kept raw, so we only have the UUID in the partition table, and
we can discover the device through /dev/disk/by-partuuid/

Fixes: #2726
gdemonet added a commit that referenced this issue Nov 26, 2020
We adjust the `lint-python` tox environment to:
- handle Python 2 and Python 3 files (we only run mypy on the latter)
- always run both pylint and mypy, and report exit codes for both in
case of failure

See: #2726
@bert-e bert-e closed this as completed in 634da52 Nov 27, 2020
gdemonet added a commit that referenced this issue Nov 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:medium Medium priority issues, should only be postponed if no other option severity:minor No impact, or minor impact, on live deployments topic:operations Operations-related issues topic:storage Issues related to storage topic:tooling Everything related to tooling around MetalK8s (e.g. Vagrant or Terraform)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants