Skip to content
This repository has been archived by the owner on May 6, 2020. It is now read-only.

packet: Mounts inside /mnt/ can't be used by pods #73

Open
rata opened this issue Sep 27, 2019 · 2 comments
Open

packet: Mounts inside /mnt/ can't be used by pods #73

rata opened this issue Sep 27, 2019 · 2 comments

Comments

@rata
Copy link
Contributor

rata commented Sep 27, 2019

Node local storage was working when it was mounted to /mnt. However, we later changed that to support mounting hdd and ssd disks on /mnt/node-local-sdd-storage and /mnt/node-local-hdd-storage.

However, when doing so (#72) the mounts were not exposed to the pod and the pod, therefore, was just using the installation OS. This PR that fixed the issue has more details: #72

The quick fix was to mount on /mnt again, but we should investigate to make sure we understand why mounts under that (eg. in /mnt/node-local-storage) are not exposed to the pods and can't use them.

The goal is to either understand why mounts don't propagate and is not possible to change, or do the needed changes for the mounts to be propagated.

However, if we find out mount's can't be propagated, we can consider a potentially very nasty approach (only if we really have to. We should see #74 before doing this too) that is the following:

We can consider expose in the rkt container, like now is being down with /mnt/, /mnt/node-local-storage, /mnt/node-ssd-local-storage/, /mnt/node-hdd-local-storage, etc.

However, we can only use as volumes for rkt directories that do exist[1], so we should either select in the template based on the different setup_* flags (this is very messy), or we should always create all directories and rkt just exposes all (this is more elegant, as that can all be part of the systemd unit as with the rest of the ExecStartPre=/bin/mkdir .. we are already doing).

[1]: I tried replacing /mnt by /mnt/node-local-storage in the kubelet systemd unit, while that directory didn't exist, and the systemd unit failed to start with: Sep 27 11:03:25 rata-test-backend-worker-0 kubelet-wrapper[5459]: run: stat of host path /mnt/node-local-storage: stat /mnt/node-local-storage: no such file or directory

rata added a commit that referenced this issue Sep 27, 2019
The rkt pod is exposed with /mnt/ volume[1] and in this PR:

	https://github.com/kinvolk/lokomotive-kubernetes/pull/48/files#diff-b6c7caf796cd86bdcdd936319b1793a1R152

the location where the RAID 0 array is mounted was changed from `/mnt`
to `/mnt/<some-dir>`. This breaks using local volumes in pods, as it
seems the mount is not visible for the rkt container or the kubelet
running in rkt.

The investigation of the root cause of this issue, to understand why
mounts inside mnt (like in `/mnt/node-local-storage`) can't be used by
pods currently and what needs to be changed for that, is left as a for
future patches and issue #73 was created to investigate this.

This patch just makes the mount location back to `/mnt` for all the cases
(`setup_raid`, `setup_raid_*`) so it works on all cases. However, this
imposes one limitation: `setup_raid_hdd` and `setup_raid_ssd` are mutually
exclusive now.

This limitation is not limiting something that was working in master, in
fact `setup_raid_hdd` and `setup_raid_ssd` (when `setup_raid_ssd_fs` is set)
were completely broken since they were merged and pods never wrote to
those volumes, for the very same reason issue #73 states: mounts insde
`/mnt` are not visible to pods.

Therefore, this patches fixes node local storage while making those
options exclusive (only one can be set), and this is not a big problem
as those options never really worked.

Some alternative fixes were considered, like changing the path that is
exposed to rkt container to be /mnt/node-local-storage or
/mnt/node-hdd-local-storage, according to what option was used
(setup_raid, setup_raid_hdd, and exposing both if both are set), but
that was messy without any good reason (it is better to tackle #73
before doing something that ofuscated). So, we decided for this, more
simpler, approach.

This patch is just a minimal fix, "a revert" to mount in `/mnt/` again,
to make this work again on master.

As a side effect of this PR, too, another issue to reconsider if we need
so many `setup_raid_*` flags was created
(#74) and to even
consider a totally different approach than the current bash script
before it gets out of control: #75

[1]: https://github.com/kinvolk/lokomotive-kubernetes/blob/d59d071a451f45ac61c2524b94a146a6cde60401/packet/flatcar-linux/kubernetes/workers/cl/worker.yaml.tmpl#L65-L66
@rata
Copy link
Contributor Author

rata commented Sep 27, 2019

The second approach, i.e. the one that if possible I'd like to avoid, that exposes and mounts all the paths to the rkt container is implemented here: https://github.com/kinvolk/lokomotive-kubernetes/tree/rata/node-local-storage-inside-mnt. In particular, here: ba80265

rata added a commit that referenced this issue Sep 27, 2019
The rkt pod is exposed with /mnt/ volume[1] and in this PR:

	https://github.com/kinvolk/lokomotive-kubernetes/pull/48/files#diff-b6c7caf796cd86bdcdd936319b1793a1R152

the location where the RAID 0 array is mounted was changed from `/mnt`
to `/mnt/<some-dir>`. This breaks using local volumes in pods, as it
seems the mount is not visible for the rkt container or the kubelet
running in rkt.

The investigation of the root cause of this issue, to understand why
mounts inside mnt (like in `/mnt/node-local-storage`) can't be used by
pods currently and what needs to be changed for that, is left as a for
future patches and issue #73 was created to investigate this.

This patch just makes the mount location back to `/mnt` for all the cases
(`setup_raid`, `setup_raid_*`) so it works on all cases. However, this
imposes one limitation: `setup_raid_hdd` and `setup_raid_ssd` are mutually
exclusive now.

This limitation is not limiting something that was working in master, in
fact `setup_raid_hdd` and `setup_raid_ssd` (when `setup_raid_ssd_fs` is set)
were completely broken since they were merged and pods never wrote to
those volumes, for the very same reason issue #73 states: mounts insde
`/mnt` are not visible to pods.

Therefore, this patches fixes node local storage while making those
options exclusive (only one can be set), and this is not a big problem
as those options never really worked.

Some alternative fixes were considered, like changing the path that is
exposed to rkt container to be /mnt/node-local-storage or
/mnt/node-hdd-local-storage, according to what option was used
(setup_raid, setup_raid_hdd, and exposing both if both are set), but
that was messy without any good reason (it is better to tackle #73
before doing something that ofuscated). So, we decided for this, more
simpler, approach.

This patch is just a minimal fix, "a revert" to mount in `/mnt/` again,
to make this work again on master.

As a side effect of this PR, too, another issue to reconsider if we need
so many `setup_raid_*` flags was created
(#74) and to even
consider a totally different approach than the current bash script
before it gets out of control: #75

[1]: https://github.com/kinvolk/lokomotive-kubernetes/blob/d59d071a451f45ac61c2524b94a146a6cde60401/packet/flatcar-linux/kubernetes/workers/cl/worker.yaml.tmpl#L65-L66
@rata
Copy link
Contributor Author

rata commented Oct 15, 2019

From https://coreos.com/rkt/docs/latest/running-fly-stage1.html

The fly stage1 makes use of Linux mount propagation modes. If a volume source path is a mountpoint on the host, this mountpoint is made recursively shared before the host path is mounted on the target path in the container (...) User-provided volumes are not mounted recursively. This is a safety measure to prevent system crashes when multiple containers are started that mount / into the container.

Therefore, the behaviour is indeed deliberate and we should double check docs or consider the patch I posted in the previous comment (#73 (comment))

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant