Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: transient /etc #2972

Closed
wants to merge 1 commit into from

Conversation

raballew
Copy link

@raballew raballew commented Aug 7, 2023

Working copy of #2970

@openshift-ci
Copy link

openshift-ci bot commented Aug 7, 2023

Hi @raballew. Thanks for your PR.

I'm waiting for a ostreedev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

err (EXIT_FAILURE, "failed to make writable /etc bind-mount at /sysroot.tmp/etc");
if (etc_transient)
{
/* Do we have a persistent overlayfs for /usr? If so, mount it now. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems out of place, we're not handling /usr atm. nor is it persitent.

@alexlarsson
Copy link
Member

I don't quite understand the handling of /etc here.

In the typical, current, case, an ostree commit will contain /usr/etc, which during deploy will end up in $deploydir/usr/etc (as well as in the composefs image). During prepare_deployment_etc() we extract this into usr/etc, and then we copy it back to etc . Then, during merge_configuration_from(), we take diff usr/etc etc on the old deployement, and apply the result to the new etc.

Given the code in this MR, if transient etc is enabled, it always prefers the etc directory in the deploy dir, over the usr/etc directory. This means that if you persist anything in `/sysroot/ostree/deploy/..../$previous_commit/etc, then it will be merged into the new etc dir, which will be used as the lower dir in the overlayfs, and thus be persisted to the new boot.

Isn't the point of this all that with transient etc, we only ever make /etc an overlayfs with /usr/etc from the composefs and a tmpfs dir? Thats what I think of when I hear transient anyway.

@alexlarsson
Copy link
Member

Actually, why are we using usr/etc as the lower anyway? Shouldn't we be using TMP_SYSROOT/usr/etc, which is the copy that is (possibly) on the composefs mount, rather than the one in the deploy dir?

@alexlarsson alexlarsson requested a review from cgwalters August 8, 2023 13:07
@alexlarsson
Copy link
Member

Booted this and got:

[    5.063620] ostree-prepare-root[996]: Loading /etc/ostree/prepare-root.conf
[    5.066518] ostree-prepare-root[996]: sysroot.readonly configuration value: 1 (fs writable: 1)
[    5.068433] ostree-prepare-root[996]: composefs: Mounting with no digest or signature check
[    5.072483] ostree-prepare-root[996]: Resolved OSTree target to: /sysroot/ostree/deploy/fedora-coreos/deploy/3fe9ea20c2e61e182502cc32d34bb93214a81deacbfbd08375f7ae3f12df9e12.0
[    5.123628] ostree-prepare-root[996]: ostree-prepare-root: Failed to find etc: failed to stat etc: Success
[    4.943158] ostree-prepare-root[996]: ostree-prepare-root: Failed to find etc: failed to stat etc: Success
[    5.128123] ostree-prepare-root[996]: composefs: mounted successfully
[    5.136962] systemd[1]: ostree-prepare-root.service: Failed with result 'exit-code'.
[    5.142865] systemd[1]: Failed to start ostree-prepare-root.service - OSTree Prepare OS/.

@alexlarsson
Copy link
Member

Fixing that I got:

[    5.222235] ostree-prepare-root[1005]: sysroot.readonly configuration value: 1 (fs writable: 1)
[    5.224991] ostree-prepare-root[1005]: composefs: Mounting with no digest or signature check
[    5.227061] ostree-prepare-root[1005]: Resolved OSTree target to: /sysroot/ostree/deploy/fedora-coreos/deploy/b0a3b28af0e50d195a87b4fcb32427f180eb837dbec83f91ffc217ab44cfb921.0
[    5.268330] overlayfs: empty lowerdir
[    5.269682] overlayfs: failed to resolve '/run/ostree/.private/etc-transient/upper': -2
[    5.273042] ostree-prepare-root[1005]: ostree-prepare-root: failed to mount transient etc overlayfs: No such file or directory
[    5.091384] ostree-prepare-root[1005]: ostree-prepare-root: failed to mount transient etc overlayfs: No such file or directory
[    5.277826] ostree-prepare-root[1005]: composefs: mounted successfully
[    5.282528] systemd[1]: ostree-prepare-root.service: Main process exited, code=exited, status=1/FAILURE
[    5.284302] systemd[1]: ostree-prepare-root.service: Failed with result 'exit-code'.
[    5.293295] systemd[1]: Failed to start ostree-prepare-root.service - OSTree Prepare OS/.

@raballew
Copy link
Author

raballew commented Aug 8, 2023

I don't quite understand the handling of /etc here.

In the typical, current, case, an ostree commit will contain /usr/etc, which during deploy will end up in $deploydir/usr/etc (as well as in the composefs image). During prepare_deployment_etc() we extract this into usr/etc, and then we copy it back to etc . Then, during merge_configuration_from(), we take diff usr/etc etc on the old deployement, and apply the result to the new etc.

Given the code in this MR, if transient etc is enabled, it always prefers the etc directory in the deploy dir, over the usr/etc directory. This means that if you persist anything in `/sysroot/ostree/deploy/..../$previous_commit/etc, then it will be merged into the new etc dir, which will be used as the lower dir in the overlayfs, and thus be persisted to the new boot.

Isn't the point of this all that with transient etc, we only ever make /etc an overlayfs with /usr/etc from the composefs and a tmpfs dir? Thats what I think of when I hear transient anyway.

I think the answer to this partially is given by #2970 (comment) which implies that there can be an empty /etc which mean find_etc will return /usr/etc (or NULL on error).

@alexlarsson
Copy link
Member

Isn't the point of this all that with transient etc, we only ever make /etc an overlayfs with /usr/etc from the composefs and a tmpfs dir? Thats what I think of when I hear transient anyway.

I think the answer to this partially is given by #2970 (comment) which implies that there can be an empty /etc which mean find_etc will return /usr/etc (or NULL on error).

I think that quote is just confused. We always create an /etc in the composefs image (as per comment #2958 (comment) ), so there is no need for that commit at all from what i can see.

@raballew
Copy link
Author

raballew commented Aug 8, 2023

If thats the case, let me drop the find_etc function and use TMP_SYSROOT/etc instead.

@raballew raballew force-pushed the prepare-root-transient-etc branch from db9fd18 to 905bf97 Compare August 8, 2023 15:51
@alexlarsson
Copy link
Member

With this version (squashed back to one commit):
https://github.com/alexlarsson/ostree/commits/prepare-root-transient-etc

I get it to boot and start most services, including with selinux enabled.
However, there is one failure remaining:

× rpm-ostreed.service - rpm-ostree System Management Daemon
     Loaded: loaded (/usr/lib/systemd/system/rpm-ostreed.service; static)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: failed (Result: exit-code) since Tue 2023-08-08 17:37:46 UTC; 259ms ago
       Docs: man:rpm-ostree(1)
    Process: 1720 ExecStart=rpm-ostree start-daemon (code=exited, status=1/FAILURE)
   Main PID: 1720 (code=exited, status=1/FAILURE)
     Status: "error: Couldn't start daemon: Error setting up sysroot: loading sysroot: Unexpected state: /run/ostree-booted found and in / sysroot, but bootloader entry not found"
        CPU: 25ms

Aug 08 17:37:46 cosa-devsh systemd[1]: Starting rpm-ostreed.service - rpm-ostree System Management Daemon...
Aug 08 17:37:46 cosa-devsh rpm-ostree[1720]: Reading config file '/etc/rpm-ostreed.conf'
Aug 08 17:37:46 cosa-devsh rpm-ostree[1720]: error: Couldn't start daemon: Error setting up sysroot: loading sysroot: Unexpected state: /run/ostree-booted found and in / sysroot, but bootloader entry not found
Aug 08 17:37:46 cosa-devsh systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE
Aug 08 17:37:46 cosa-devsh systemd[1]: rpm-ostreed.service: Failed with result 'exit-code'.
Aug 08 17:37:46 cosa-devsh systemd[1]: Failed to start rpm-ostreed.service - rpm-ostree System Management Daemon.

Not sure what Unexpected state: /run/ostree-booted found and in / sysroot, but bootloader entry not found mean. @cgwalters ?

@raballew raballew force-pushed the prepare-root-transient-etc branch 4 times, most recently from b8293a1 to 2222aa9 Compare August 8, 2023 18:47
@raballew
Copy link
Author

raballew commented Aug 8, 2023

@alexlarsson I am trying to reproduce your results and I am only able to do so with additional kargs set to enforce=0 otherwise the following units fail:

dbus-broker.service
NetworkManager.service
rpm-ostreed.service
systemd-homed.service
dbus.socket

@alexlarsson
Copy link
Member

alexlarsson commented Aug 9, 2023

@raballew Sorry, I can't get it to work, must have forgotten i was in permissive mode.

I've done some research here, and here is what I think happens:

  • The selinux label of an overlayfs mount is stored as the security.selinux xattr on the upper directory of the overlayfs mount. In our case this is /run/ostree/.private/etc-transient/upper.

  • In the ostree-prepare-root service, we copy the real selinux label (xattr) from the sysroot usr/etc onto this upper dir.

  • At a later time systemd loads the selinux policy. At this point, all tmpfs inodes in memory gets labeled with the associated label tmpfs_t. This overwrites the copy we did before.

  • During mount setup, systemd manually calls for relabeling of /run, which will recursively traverse /run (not crossing mounts) and relabel files according to the policy that was loaded from the sysroot. This will replace the tmpfs_t with var_run_t for the upper dir.

  • Now the /etc overlayfs mount root has the overwritten label from the upper dir. This causes various services to fail because of selinux policy.

A manual run of restorecon /etc will fix this. But /etc is not typically relabeled at boot, because its generally part of a filesystem with persistent selinux labels.

I've tried to make the upper dir a separate mount to stop the var_run_t relabeling across devices, but that just results in getting tmpfs_t instead.

I also tried to mount the overlayfs with rootcontext=..., but that fails to mount at this point because selinux is not enabled.

I'm not sure there is anything else we can do other than expressing the expected labels for /run/ostree/.private/etc-transient/upper in the selinux policy, which would make systemd relabel get the right etc_t label.

@alexlarsson
Copy link
Member

I see we have a service (ostree-remount) that runs early after switching to the sysroot before /etc is used. I think we can probably fix up /etc in that.

@alexlarsson
Copy link
Member

I see we have a service (ostree-remount) that runs early after switching to the sysroot before /etc is used. I think we can probably fix up /etc in that.

Unfortunately several AVCs we hit in /etc happen before this, so it won't work. I guess we really need an ostree selinux module.

@alexlarsson
Copy link
Member

Or, we could perhaps just mount the transient /etc after selinux policy is loaded...

@alexlarsson
Copy link
Member

So, I put this in /etc/selinux/targeted/contexts/files/file_contexts.local via an override:

/var/run/ostree/.private/etc-transient/upper	-d	system_u:object_r:etc_t:s0

After boot (permissive), the upper dir has the right label:

$ sudo ls -ldZ /var/run/ostree/.private/etc-transient/upper
drwxr-xr-x. 11 root root system_u:object_r:etc_t:s0 640 Aug  9 15:36 /var/run/ostree/.private/etc-transient/upper

Unfortunately /etc is still wrong:

$ ls -ldZ /etc
drwxr-xr-x. 1 root root system_u:object_r:tmpfs_t:s0 640 Aug  9 15:36 /etc

I guess the initial data is stored in the dcache?

@raballew
Copy link
Author

@alexlarsson So, dropping dcache with echo 1 > /proc/sys/vm/drop_caches should do the trick then but it comes at a cost of significant amount of I/O and CPU to recreate the dropped objects.

@alexlarsson
Copy link
Member

alexlarsson commented Aug 10, 2023

@raballew I dunno if that is even enough. Some inodes are kept alive even over that if they are references elsewhere.

In fact, looking at this in more detail, it is even more complex.
After ostree-prepare-root runs, /sysroot/etc is now a tmpfs sourced from /run, and the selinux policy is not yet loaded.
For whatever reason, some files are changed in this directory in the initramfs before we switchroot to /sysroot and reexec systemd which triggetrs the loading of the policy and the relabeling of /run.

Directly after a permissive boot we can see that even if we relabeled the upper dir after policy load, some files in the upper are still owned by var_run_t:

# ls -ldZ /run/ostree/.private/etc-transient/upper/
drwxr-xr-x. 11 root root system_u:object_r:etc_t:s0 640 Aug 10 10:04 /run/ostree/.private/etc-transient/upper/
# ls -lZ /run/ostree/.private/etc-transient/upper/
total 64
d---------. 2 root root system_u:object_r:etc_t:s0         40 Jan  1  1970 credstore
d---------. 2 root root system_u:object_r:etc_t:s0         40 Jan  1  1970 credstore.encrypted
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0    108 Aug 10 10:04 group
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0     79 Jan  1  1970 group-
-r--------. 1 root root system_u:object_r:var_run_t:s0    408 Aug 10 10:04 gshadow
-r--------. 1 root root system_u:object_r:var_run_t:s0    391 Jan  1  1970 gshadow-
drwxr-xr-x. 2 root root system_u:object_r:etc_t:s0        120 Aug 10 10:04 issue.d
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0    16243 Aug 10 10:04 ld.so.cache
-rw-r--r--. 1 root root system_u:object_r:locale_t:s0      81 Jan  1  1970 locale.conf
drwxr-xr-x. 3 root root system_u:object_r:etc_t:s0         60 Aug 10 10:04 lvm
-r--r--r--. 1 root root system_u:object_r:tmpfs_t:s0       33 Aug 10 10:04 machine-id
lrwxrwxrwx. 1 root root system_u:object_r:etc_t:s0         19 Jan  1  1970 mtab -> ../proc/self/mounts
lrwxrwxrwx. 1 root root system_u:object_r:etc_t:s0         21 Jan  1  1970 os-release -> ../usr/lib/os-release
drwxr-xr-x. 2 root root system_u:object_r:etc_t:s0         40 Jan  1  1970 pam.d
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0     93 Aug 10 10:04 passwd
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0     38 Jan  1  1970 passwd-
drwxr-xr-x. 2 root root system_u:object_r:var_run_t:s0     80 Aug 10 10:04 profile.d
lrwxrwxrwx. 1 root root system_u:object_r:net_conf_t:s0    39 Jan  1  1970 resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
-r--------. 1 root root system_u:object_r:var_run_t:s0    423 Aug 10 10:04 shadow
-r--------. 1 root root system_u:object_r:var_run_t:s0    397 Jan  1  1970 shadow-
drwxr-xr-x. 2 root root system_u:object_r:etc_t:s0        160 Aug 10 10:04 ssh
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0     18 Aug 10 10:04 subgid
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0      0 Jan  1  1970 subgid-
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0     18 Aug 10 10:04 subuid
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0      0 Jan  1  1970 subuid-
drwxr-xr-x. 4 root root system_u:object_r:var_run_t:s0     80 Aug 10 10:04 systemd
drwxr-xr-x. 2 root root system_u:object_r:etc_t:s0         60 Aug 10 10:04 udev

I think this means they were created by overlayfs due to writes to /sysroot/etc before the policy was loaded, and thus the default /run label was applied when labeling /run. Thats not good, for example shadow should be shadow_t.

This is probably some ignition thing fixing up the /etc labels, because the /usr/etc/shadow file is not shadow_t.

Additionally I see ld.so.cache and machine-id are even tmpfs_t instead of var_run_t. Where these not relabeled as var_run_t?

To make things even worse, various files are cached in the overlayfs mount before the relabel happen, which mean that even if the upper dir files get relabled, they are still tmpfs_t when viewed via the overlayfs /etc:

# ls -lZ /etc/ | grep tmpfs_t
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0       108 Aug 10 10:04 group
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0        79 Jan  1  1970 group-
-r--------. 1 root root system_u:object_r:tmpfs_t:s0       408 Aug 10 10:04 gshadow
-r--------. 1 root root system_u:object_r:tmpfs_t:s0       391 Jan  1  1970 gshadow-
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0     16243 Aug 10 10:04 ld.so.cache
-r--r--r--. 1 root root system_u:object_r:tmpfs_t:s0        33 Aug 10 10:04 machine-id
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0        93 Aug 10 10:04 passwd
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0        38 Jan  1  1970 passwd-
drwxr-xr-x. 1 root root system_u:object_r:tmpfs_t:s0        80 Aug 10 10:04 profile.d
-r--------. 1 root root system_u:object_r:tmpfs_t:s0       423 Aug 10 10:04 shadow
-r--------. 1 root root system_u:object_r:tmpfs_t:s0       397 Jan  1  1970 shadow-
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0        18 Aug 10 10:04 subgid
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0         0 Jan  1  1970 subgid-
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0        18 Aug 10 10:04 subuid
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0         0 Jan  1  1970 subuid-
drwxr-xr-x. 1 root root system_u:object_r:tmpfs_t:s0        80 Aug 10 10:04 systemd

I guess what happened is that since these files were created before selinux policy was added they were in dcache, and the overlayfs inodes that correspond to the upper files were created before the /run relabel, so they aren't aware of the label change. (Really, we shouldn't change the files in the upper dir behind the back of overlayfs like this).

And indeed, after echo 3 > /proc/sys/vm/drop_caches the new labels are recoverd:

# ls -lZ /etc/ | grep -v etc_t
total 1616
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0     108 Aug 10 10:04 group
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0      79 Jan  1  1970 group-
-r--------. 1 root root system_u:object_r:var_run_t:s0     408 Aug 10 10:04 gshadow
-r--------. 1 root root system_u:object_r:var_run_t:s0     391 Jan  1  1970 gshadow-
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0     16243 Aug 10 10:04 ld.so.cache
-rw-r--r--. 1 root root system_u:object_r:locale_t:s0       81 Jan  1  1970 locale.conf
-r--r--r--. 1 root root system_u:object_r:tmpfs_t:s0        33 Aug 10 10:04 machine-id
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0      93 Aug 10 10:04 passwd
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0      38 Jan  1  1970 passwd-
drwxr-xr-x. 1 root root system_u:object_r:var_run_t:s0      80 Aug 10 10:04 profile.d
lrwxrwxrwx. 1 root root system_u:object_r:net_conf_t:s0     39 Jan  1  1970 resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
-r--------. 1 root root system_u:object_r:var_run_t:s0     423 Aug 10 10:04 shadow
-r--------. 1 root root system_u:object_r:var_run_t:s0     397 Jan  1  1970 shadow-
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0      18 Aug 10 10:04 subgid
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0       0 Jan  1  1970 subgid-
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0      18 Aug 10 10:04 subuid
-rw-r--r--. 1 root root system_u:object_r:var_run_t:s0       0 Jan  1  1970 subuid-
drwxr-xr-x. 1 root root system_u:object_r:var_run_t:s0      80 Aug 10 10:04 systemd

I guess this is what happened to e.g. ld.so.cache. It was created after the relabel, but the root overlayfs had the pre-relabel tmpfs_t label in cache, so the newly created ld.so.cache file inherited that label.

We could avoid having the files created in etc be var_run_t by having the selinux policy specify etc_t also for files inside the upper dir. Then the files created before policy load would get etc_t. But we can't really fix the fact that the policy load "loses" the labels added on tmpfs file pre-policy-load (such as shadow_t), and we can't (easily) fix the fact that the relabel of upper doesn't affect the /etc overlayfs files.

@alexlarsson
Copy link
Member

Indeed, if I boot the system with this file_contexts.local:

/var/run/ostree/.private/etc-transient/upper(/.*)?		system_u:object_r:etc_t:s0

I get all etc_t for the files in upper that were var_run_t before:

# ls -lZ /run/ostree/.private/etc-transient/upper/
total 64
d---------. 2 root root system_u:object_r:etc_t:s0         40 Jan  1  1970 credstore
d---------. 2 root root system_u:object_r:etc_t:s0         40 Jan  1  1970 credstore.encrypted
-rw-r--r--. 1 root root system_u:object_r:etc_t:s0        108 Aug 10 11:33 group
-rw-r--r--. 1 root root system_u:object_r:etc_t:s0         79 Jan  1  1970 group-
-r--------. 1 root root system_u:object_r:etc_t:s0        408 Aug 10 11:33 gshadow
-r--------. 1 root root system_u:object_r:etc_t:s0        391 Jan  1  1970 gshadow-
drwxr-xr-x. 2 root root system_u:object_r:etc_t:s0        120 Aug 10 11:33 issue.d
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0    16243 Aug 10 11:33 ld.so.cache
-rw-r--r--. 1 root root system_u:object_r:locale_t:s0      81 Jan  1  1970 locale.conf
drwxr-xr-x. 3 root root system_u:object_r:etc_t:s0         60 Aug 10 11:33 lvm
-r--r--r--. 1 root root system_u:object_r:tmpfs_t:s0       33 Aug 10 11:33 machine-id
lrwxrwxrwx. 1 root root system_u:object_r:etc_t:s0         19 Jan  1  1970 mtab -> ../proc/self/mounts
lrwxrwxrwx. 1 root root system_u:object_r:etc_t:s0         21 Jan  1  1970 os-release -> ../usr/lib/os-release
drwxr-xr-x. 2 root root system_u:object_r:etc_t:s0         40 Jan  1  1970 pam.d
-rw-r--r--. 1 root root system_u:object_r:etc_t:s0         93 Aug 10 11:33 passwd
-rw-r--r--. 1 root root system_u:object_r:etc_t:s0         38 Jan  1  1970 passwd-
drwxr-xr-x. 2 root root system_u:object_r:etc_t:s0         80 Aug 10 11:33 profile.d
lrwxrwxrwx. 1 root root system_u:object_r:net_conf_t:s0    39 Jan  1  1970 resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
-r--------. 1 root root system_u:object_r:etc_t:s0        423 Aug 10 11:33 shadow
-r--------. 1 root root system_u:object_r:etc_t:s0        397 Jan  1  1970 shadow-
drwxr-xr-x. 2 root root system_u:object_r:etc_t:s0        160 Aug 10 11:33 ssh
-rw-r--r--. 1 root root system_u:object_r:etc_t:s0         18 Aug 10 11:33 subgid
-rw-r--r--. 1 root root system_u:object_r:etc_t:s0          0 Jan  1  1970 subgid-
-rw-r--r--. 1 root root system_u:object_r:etc_t:s0         18 Aug 10 11:33 subuid
-rw-r--r--. 1 root root system_u:object_r:etc_t:s0          0 Jan  1  1970 subuid-
drwxr-xr-x. 4 root root system_u:object_r:etc_t:s0         80 Aug 10 11:33 systemd
drwxr-xr-x. 2 root root system_u:object_r:etc_t:s0         60 Aug 10 11:33 udev

But they are still tmpfs_t in overlayfs (while still cached):

# ls -lZ /etc | grep -v etc_t
total 1616
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0       108 Aug 10 11:33 group
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0        79 Jan  1  1970 group-
-r--------. 1 root root system_u:object_r:tmpfs_t:s0       408 Aug 10 11:33 gshadow
-r--------. 1 root root system_u:object_r:tmpfs_t:s0       391 Jan  1  1970 gshadow-
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0     16243 Aug 10 11:33 ld.so.cache
-rw-r--r--. 1 root root system_u:object_r:locale_t:s0       81 Jan  1  1970 locale.conf
-r--r--r--. 1 root root system_u:object_r:tmpfs_t:s0        33 Aug 10 11:33 machine-id
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0        93 Aug 10 11:33 passwd
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0        38 Jan  1  1970 passwd-
drwxr-xr-x. 1 root root system_u:object_r:tmpfs_t:s0        80 Aug 10 11:33 profile.d
lrwxrwxrwx. 1 root root system_u:object_r:net_conf_t:s0     39 Jan  1  1970 resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
-r--------. 1 root root system_u:object_r:tmpfs_t:s0       423 Aug 10 11:33 shadow
-r--------. 1 root root system_u:object_r:tmpfs_t:s0       397 Jan  1  1970 shadow-
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0        18 Aug 10 11:33 subgid
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0         0 Jan  1  1970 subgid-
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0        18 Aug 10 11:33 subuid
-rw-r--r--. 1 root root system_u:object_r:tmpfs_t:s0         0 Jan  1  1970 subuid-
drwxr-xr-x. 1 root root system_u:object_r:tmpfs_t:s0        80 Aug 10 11:33 systemd

@alexlarsson
Copy link
Member

Ok, I finally figured out why this is happening:

ext4 (etc) uses xattrs (SECURITY_FS_USE_XATTR) to manage the selinux context. On initialization of an inode it reads the security.selinux xattr, parses it according to the policy, and then stores this (broken down) in the struct inode_security_struct for the inode. This struct is used to make security checks. It also hooks into setxattr to update this.

eg:

fs_use_xattr ext4 gen_context(system_u:object_r:fs_t,s0);

tmpfs uses a different mechanism, SECURITY_FS_USE_TRANS, where it never looks at any xattr. It just initializes the struct inode_security_struct with the default value, and then updates this directly when changing the label on a file. TRANS means "transition" and I guess this means it transitions from unableled to this, and then gets set to whatever you update it too when relabeling.

eg:

fs_use_trans tmpfs gen_context(system_u:object_r:tmpfs_t,s0);

@alexlarsson
Copy link
Member

Hmm... So, there is another issue too. When ostree deploy creates the merged etc directory after it copies /usr/etc to /etc it does a relabel of etc, because the selinux labels are different in /usr/etc and /etc. This means that a plain overlayfs of /usr/etc will not work as /etc without doing a full relabel.

@alexlarsson
Copy link
Member

Ugh, even more issues. When anonymous temporary files are created with overlayfs they get created in the work dir and that has a different selinux context, so i'm getting avcs like:

avc:  denied  { create } for  pid=1 comm="systemd" name="#7" scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=chr_file permissive=1
avc:  denied  { link } for  pid=1 comm="systemd" name="#7" dev="vda4" ino=125239 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=chr_file permissive=1
avc:  denied  { rename } for  pid=1 comm="systemd" name="#8" dev="vda4" ino=125239 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=chr_file permissive=1

Where the unlabeled_t is the work dir (i think).

@alexlarsson
Copy link
Member

Also, even with this all worked around, I keep getting AVCs like this far after switchroot:

AVC avc:  denied  { relabelfrom } for  pid=1302 comm="systemd-hwdb" name="hwdb.bin" dev="vda4" ino=125254 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:systemd_hwdb_etc_t:s0 tclass=file permissive=1
AVC avc:  denied  { relabelto } for  pid=1302 comm="systemd-hwdb" name="hwdb.bin" dev="vda4" ino=125254 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:systemd_hwdb_etc_t:s0 tclass=file permissive=1

At this point things should not still be running as kernel_t, as that is the context in the initrd before the policy is loaded. systemd should be running as init_t.

I think the problem here is that the overlayfs mount happened before policy was loaded, and it saves the context from that point and uses it when writing things to the upper dir.

@alexlarsson
Copy link
Member

Ok, so to summarize the problems I've found with using a transient etc
where the basic idea is to have ostree, in the initrd, setup /etc be
an overlayfs mount of /usr/etc in the image backed on a transient
directory (for example in /run):

  • The selinux contexts of files in /usr/etc in the image are all etc_t,
    which is not right for when it is moved to /etc. Ostree deploy will
    relabel the files to the right thing when usr/etc is copied to etc.

    We could use the merged relabeled "etc" directory instead as the
    overlayfs lower. To do this with composefs we would have to make the
    compoosefs image contain a relabeled etc in it somewhere.

  • Overlayfs stores the calling context at mount time, including things
    like uid and selinux context. If a user is allowed (based on its
    uid/context) to do a file operation on the overlayfs file, the
    overlayfs filesystem applies the equivalent operation on the "upper"
    directory. At this point the overlayfs context is applied against
    the selinux context in the upper dir. This will always have the
    internal selinux context kernel_t, because the overlayfs was mounted
    in the initramfs before the policy was loaded.

    In practice, this means many operations fail against the /etc mount,
    because kernel_t is not very powerful.

  • On overlayfs, anonymous files (like tmpfiles) are created first in
    the work dir before being linked into the upper dir. The selinux
    context of this dir is unlabeled due to the early mount, which
    causes AVCs. And, even if it was labeled there may be problems, as
    whatever label was used may not have the permissions to do some
    things we need.

  • systemd relabels /run very early after re-execing into the sysroot
    and loading the selinux policy. If the overlayfs upper dir is in
    /run, then this will be relabeled as var_run_t. This can be worked
    around by having special selinux policy rules for the upper dir.

  • /run is on a tmpfs, and the selinux policy for tmpfs filesystems is
    fs_use_trans instead of the more typical fs_use_xattr. This means
    that the initial selinux context for files (i.e. at selinux policy
    load time, or later at file creation time) are set to the fs-wide
    default (tmpfs_t here). This means we can't affect the selinux
    context for any files in the upper dir that are created in the
    initrd. This includes the upper dir itself, the files created
    directly by ignition, and the directories indirectly created by
    ignition when overlayfs creates parent directories in the upper as
    needed for the directly created files.

    Ignition has code to "fix up" these direct labels, but the fixups
    are set on xattrs before policy load, so the fixups will be
    discarded.

Almost all these problems are related to things happening before
loading the selinux policy (i.e. in the initrd). I don't think it is
possible to work around all of these issues, and thus the only
workable approach is to mount the overlayfs after policy
load. However, systemd accessess (and writes to) /etc very early after
loading the policy, and there is no way to intercept before that to
mount the overlay.

So, the summary is that I don't think ostree can implement transient
/etc. But systemd can. and in fact there are transient rootfs options
in systemd which we need to research. The systemd.volatile=overlay
option seems to make everything a tmpfs+overlay, and we can maybe use
that.

Issues I forsee with this approach:

  • For the composefs case we would need to be able to expose a
    relabeled /etc, as we don't want to use the relabeled merged
    deploy/etc (since that is persistent and unverified).

  • It is impossible for ignition to modify this etc dir in the initrd.
    Any such changes would have to be made in the image itself, or
    possibly using bind mount tricks.

@alexlarsson
Copy link
Member

alexlarsson commented Aug 11, 2023

Ok, I tried systemd.volatile=overlay. I had to add Before=systemd-volatile-root.service to ostree-prepare-root.service. But there is an issue with the /sysroot overlayfs mount covering the other submounts like /sysroot/sysroot and the var and etc mounts.

@raballew
Copy link
Author

raballew commented Aug 14, 2023

@alexlarsson I have followed your instruction in #2972 (comment) and am able to reproduce the error.

Then I used the following unit mount-inplace-var.service

[Unit]
Description=Ensure inplace mount of /var
DefaultDependencies=no
Conflicts=shutdown.target
Before=shutdown.target
AssertPathExists=/etc/initrd-release
ConditionKernelCommandLine=|systemd.volatile=overlay

OnFailure=emergency.target
OnFailureJobMode=isolate

RequiredBy=ignition-ostree-mount-var.service
Before=ignition-ostree-mount-var.service

Requires=systemd-volatile-root.service
After=systemd-volatile-root.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/sbin/mount-inplace-var mount

[Install]
WantedBy=initrd-root-fs.target

And /usr/sbin/mount-inplace-var

#!/bin/bash
set -euo pipefail

lower_var=/sysroot/var
upper_var=/run/systemd/overlay-sysroot/var

fatal() {
    echo "$@" >&2
    exit 1
}

if [ $# -ne 1 ] || { [[ $1 != mount ]] && [[ $1 != umount ]]; }; then
    fatal "Usage: $0 <mount|umount>"
fi

do_mount() {
    echo "dummy mounting"
}

do_umount() {
    echo "dummy unmounting"
}

"do_$1"

In mount.bu:

variant: fcos
version: 1.5.0
systemd:
  units:
   - name: mount-inplace-var.service
     enabled: true
     contents_local: mount-inplace-var.service
storage:
  files:
    - path: /usr/sbin/mount-inplace-var
      contents:
        local: mount-inplace-var.sh
      mode: 0755

After converting the butan to ignition I run cosa run -c --kargs systemd.volatile=overlay --ignition mount.ign which fails with the same error:

Aug 14 19:43:12 systemd[1]: Starting ignition-ostree-mount-var.service - Mount OSTree /var...
Aug 14 19:43:12 ignition-ostree-mount-var[1055]: /sysroot//ostree/boot.1/fedora-coreos/c0aa9551920e67be6bc1d55f459e8dce9f26379dc2dbfdbd3a6beee30b6b51f459e8dce9f26379dc2dbfdbd3a6beee30b6b51ed/0 is not a symlink
Aug 14 19:43:12 systemd[1]: ignition-ostree-mount-var.service: Main process exited, code=exited, status=1E
Aug 14 19:43:12 systemd[1]: ignition-ostree-mount-var.service: Failed with result 'exit-code'.
Aug 14 19:43:12 systemd[1]: Failed to start ignition-ostree-mount-var.service - Mount OSTree /var.
Aug 14 19:43:12 systemd[1]: ignition-ostree-mount-var.service: Triggering OnFailure= dependencies.

Which is valid because the file does not exist.

ls -lisa /sysroot//ostree/boot.1/fedora-coreos/c0aa9551920e67be6bc1d55f459e8dce9f26379dc2dbfdbd3a6beee30b6b51f459e8dce9f26379dc2dbfdbd3a6beee30b6b51ed/0

ls: cannot access '/sysroot//ostree/boot.1/fedora-coreos/c0aa9551920e67be6bc1d55f459e8dce9f26379dc2dbfdbd3a6beee30b6b51f459e8dce9f26379dc2dbfdbd3a6beee30b6b51ed/0'

What I dont understand though is why the file mount-inplace-var.service does not exists when running find / -name mount-inplace-var* . I thought that Ignition files might get parsed at a later point in time but looking at journalctl -b I found the following prior to executing the failed ignition-ostree-mount-var.service:

[    3.838341] systemd[1]: Starting dracut-initqueue.service - dracut initqueue hook...
         Starting dracut-initqueue.…ice - dracut initqueue hook...
[    3.855547] ignition[788]: Ignition 2.16.2
[    3.856123] ignition[788]: Stage: kargs
[    3.856636] ignition[788]: reading system config file "/usr/lib/ignition/base.d/00-core.ign"
[    3.857653] ignition[788]: reading system config file "/usr/lib/ignition/base.d/30-afterburn-sshkeys-c"
[    3.858900] ignition[788]: reading system config file "/usr/lib/ignition/base.d/40-core-passwd.ign"
[    3.860259] ignition[788]: no config dir at "/usr/lib/ignition/base.platform.d/qemu"

40-core-passwd.ign contains a default password for core that I specified earlier. So I added the contents of mount.ign to the initrd as well by modifying 40-core-passwd.ign and rebuilding the image.

{
  "ignition": {
    "version": "3.4.0"
  },
  "passwd": {
    "users": [
      {
        "name": "core",
        "passwordHash": "$y$j9T$BQOELCJiutwQDykkREezY0$N31zasZ15aVmYTah/YjBhpLrrUOZY7LQA99HS61yAVC"
      }
    ]
  },
  "storage": {
    "files": [
      {
        "path": "/usr/sbin/mount-inplace-var",
        "contents": {
          "compression": "gzip",
          "source": "data:;base64,H4sIAAAAAAAC/4RUXWvbMBR91684dUybjDpuurd1LmODwWDspfQplKDYN7aILBlJTue1+e9DtuPUbdaZEMT1ueec+yFPzuK1UPGa24JZcoio1qhERRsuJGNSP5JZ7bhJYttYo7WLd9ywuqoOYVMr/8pRmcV6R0byJnoJZRvuuJzO8MQAgNJCIwi/BLg9v+4iv4XDgu0ZExssEU4QKcICD3h+xhOWS4QLnCUoda0cHh5wfn4M1ofoDfY3cAWplrPVRHBveU6fEF7hc4t77uC3AdsIxnJyK22dIVpxkw8OJ2jIXsIVwiLTZNWFA0/TVmejDWzFU7J4FK4QCtzk9hI0z+coG27yJCgb7LgMLrGuXU8oHKwTUuJRm61tWXRtUNWm0pZsi5q2/z++3yXhBS76RFuQlGlB6RaZsHwtKbn7dn21+NhVqY3Xh1AIpyl3iCuj0zgtMykUzW6Q6RbnH9/bJUIPTxJ0ZScf2sYNXTs8/YyeuMknPXIfDIiNaI+ZVtQeZn50mV61rR262OeF03GXZ+xgBtEfr9G92gd4baQf4S99oMKWjCLpK65LUs6XPSq4s+gn2/qjSurGA1cVd8VxfwfNo5UzRD+9mVc5/3b1FgphobQDh21KKdR27MY67sjL+0vTGQqnhrj0x1N88XzufztugtnIaObRb+ne83oCfbSbCUOp06YZGx7JDV+B91SOoP+Sdwv2Vaisu9VC5QjfuoTTCAfajqK771G09snBiaTgRUq/mfV4NTv1e3WUfqXRf1TGPEGmV+EiYH8DAAD//8hb8b8xBQAA"
        },
        "mode": 493
      }
    ]
  },
  "systemd": {
    "units": [
      {
        "contents": "[Unit]\nDescription=Ensure inplace mount of /var\nDefaultDependencies=no\nConflicts=shutdown.target\nBefore=shutdown.target\nAssertPathExists=/etc/initrd-release\nConditionKernelCommandLine=|systemd.volatile=overlay\n\nOnFailure=emergency.target\nOnFailureJobMode=isolate\n\nRequires=systemd-volatile-root.service\nAfter=systemd-volatile-root.service\n\n[Service]\nType=oneshot\nRemainAfterExit=yes\nExecStart=/usr/sbin/mount-inplace-var mount\n\n[Install]\nWantedBy=initrd-root-fs.target\n",
        "enabled": true,
        "name": "mount-inplace-var.service"
      }
    ]
  }
}

The unit still fails when running core run -c --kargs systemd.volatile=overlay and find / -name mount-inplace-var* also returns nothing. This is probable due to ostree-prepare-root.service already covering the lower directory including my unit file.

I also tried adding it as a dedicated unit without relying on Ignition to do the heavy lifting with a custom module-setup.sh. This actually installed the unit at /sysroot but none of the units it depends on are present at this point in time.

This is what I dont fully understand:

  • If the ignition file gets proccessed shouldnt the unit be present?
  • If none of the above (adding it to ignition in initrd or as argument for cosa, not relying on ignition) works, where and when should I add the unit file?
  • What am I missing?

@cgwalters cgwalters mentioned this pull request Aug 15, 2023
@cgwalters
Copy link
Member

The selinux contexts of files in /usr/etc in the image are all etc_t,

Right, but they don't have to be. At build time (e.g. rpm-ostree) we could force /usr/etc to have labels as if it's etc.

I am honestly not sure I can think of real downsides to that...suddenly we don't need to relabel etc client side by default.

I think doing this is going to be way better than trying to do relabeling at runtime.

@raballew raballew closed this Aug 16, 2023
@raballew raballew force-pushed the prepare-root-transient-etc branch from 2222aa9 to 2cc6b53 Compare August 16, 2023 11:27
@cgwalters
Copy link
Member

@raballew why did you close this?

@raballew
Copy link
Author

raballew commented Aug 16, 2023

@cgwalters I am not sure. I did a force push to the fork and that seems to have auto-closed the PR. It was not my intention but I can not reopen the PR either. This is weird. The reopen pull request button is disabled so I can only comment.

@cgwalters
Copy link
Member

That's very strange...I don't have permission to reopen it either, and I'm a repository administrator. I didn't think that was possible.

@cgwalters
Copy link
Member

Reopening

@cgwalters
Copy link
Member

Wait, the "reopen and comment" button did turn green for that last comment, but did nothing? I'm pretty sure this sort of Github glitch.

Anyways, not a fatal problem - can you just open a new one?

@cgwalters
Copy link
Member

Ahh I see, if you hover over it, the Github UI is showing "There are no new commits on the raballew:prepare-root-transient-etc branch" - but I didn't think that was required to reopen a PR? Wonder if it'd work if you force pushed again though?

@raballew raballew reopened this Aug 17, 2023
@raballew
Copy link
Author

@cgwalters It seems pushing to the branch resolved the issue. Anyhow, as mentioned in #2972 (comment) a transient etc is not possible at the moment.

So, the summary is that I don't think ostree can implement transient
/etc. But systemd can. and in fact there are transient rootfs options
in systemd which we need to research. The systemd.volatile=overlay
option seems to make everything a tmpfs+overlay, and we can maybe use
that.

Issues I forsee with this approach:

For the composefs case we would need to be able to expose a
relabeled /etc, as we don't want to use the relabeled merged
deploy/etc (since that is persistent and unverified).

It is impossible for ignition to modify this etc dir in the initrd.
Any such changes would have to be made in the image itself, or
possibly using bind mount tricks.

For the systemd volatile path I have opened #2986 which has its own challenges.

@cgwalters cgwalters added the area/prepare-root Issue relates to ostree-prepare-root label Aug 31, 2023
@cgwalters
Copy link
Member

Closing in favor of #3062 now...

@cgwalters cgwalters closed this Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants