systemd.mount integration #7329

aerusso · 2018-03-21T22:53:45Z

Implements the "systemd generator" protocol in zfs-mount-generator

Description

zfs-mount-generator implements the "systemd generator" protocol, producing systemd.mount units from the (possibly cached) output of zfs list during early boot, giving full systemd integration.

The most visible benefit of this is that /etc/fstab can safely refer to zfs mount points, because systemd will take care to mount filesystems in the correct order.

Motivation and Context

This PR takes a different approach from #6974, which modified /etc/fstab to reflect ZFS mountpoints. Here, instead, ZFS mounts are tracked by directly creating native systemd .mount units, at early boot from the output of zfs list -H -t filesystem -oname,mountpoint,canmount. Because pools may not be imported, the output of this command can be saved in /etc/zfs/zfs-list.cache. If the pools are for some reason mounted at early boot (e.g., zfs on root), this file can be omitted and the command will be run.

This generator is not required; it does not interfere with zfs-mount.service, so anything missing from the cache file (or from an unimported pool) will be mounted as before.

As mentioned before, this allows for complex mount hierarchies (e.g., bind mounts that must happen after zfs mounts are made; any other filesystem mounted on top of any ZFS). Notice that ZFS on root users are most likely to want such features, and will not have to create the zfs-list.cache file.

How Has This Been Tested?

I've been using several incarnations of this generator for several months, allowing for some maturity in the patches. E.g., a dependency has been reduced from Requires to Wants to prevent filesystems from being unmounted when zfs systemd units are shuffled around during upgrades.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

Fabian-Gruenbichler · 2018-03-22T09:58:59Z

concept LGTM at a first glance - not yet tested though ;) hopefully I'll find some time soon.

I wonder whether we want some kind of integration for re-generating the cached dataset list? it does sound a bit cumbersome to have to remember to re-run it after creating or renaming a dataset.. OTOH I am not sure how we'd be able to automate that easily.. maybe some kind of (optional) zfs-zed integration?

aerusso · 2018-03-22T22:22:45Z

So, I took a look at cmd/zed/zed.d/README, and decided that I wasn't going to try to implement a new zed event (at least right now). However, I did fix up the script a bit (it works with bash and dash, instead of requiring bash). All of shellcheck.net valid complaints have been addressed.

Also, I changed the semantics around a little bit:

The dependencies of local-fs.target are reduced to Wants. This reduces the chance of unpleasant surprises if a mount fails. Maybe it gets raised to Requires after it gets a little wider testing.
Instead of clobbering existing .mounts, abort instead.
zfs list is run unconditionally, and its output is always preferred to /etc/zfs/zfs-list.cache.

This should further reduce the need to populate zfs-list.cache (in fact, just running systemctl daemon-reload will re-run generators, producing the desired mount units).

It would be nice if someone running zfs on root could give this script a try (just toss it in /etc/system/system-generators after replacing @sysconfdir@ and @sbindir@ with the correct values).

AttilaFueloep · 2018-03-24T04:01:44Z

@aerusso Thanks for that, works like a charm!

I'm running root on ZFS with homegrown boot environments and a complex fs layout (see below). This is a current Arch Linux with zfs-git kmod. Up to now all mount points were set to legacy and got mounted via /etc/fstab. Setting all mount points to the appropriate directories, removing them from fstab and rebooting gave me, as expected, only a mounted /. Then I tossed in your generator and rebooted again. Now all fs got mounted properly. Nice!

I'm still seeing a systemd[1]: var.mount: Directory /var to mount over is not empty, mounting anyway on boot (and a failed to unmount /var" message on shutdown) but this was also the case with the old legacy mounts, so nothing new. Looks like systemd (journald ?) is writing to /var to early. This caused no havoc yet, so I can live with that.

zfs mount without generator:

rpool/ROOT/lx-4.15.10-1-ARCH_1   /

zfs mount with generator:

rpool/ROOT/lx-4.15.10-1-ARCH_1  /
rpool/home/root             /root
rpool/var                   /var
rpool/home                  /home
rpool/opt                   /opt
rpool/vbox                  /vbox
rpool/var/cache             /var/cache
rpool/home/me              /home/me
rpool/var/temp              /var/temp
rpool/vbox/me              /vbox/me
rpool/tmp/me               /home/me/tmp
rpool/vbox/me/vms          /vbox/me/vms
rpoolubia/home/me/stuff      /home/me/stuff

I don't know much about systemd, but if you want me to review anyway I would happily do so. Please just drop me a note then.

Thanks again.

aerusso · 2018-03-24T04:34:53Z

@AttilaFueloep I'm glad it's working! Thanks for testing it. My setup is very similar, except root is not ZFS (yet). I also cannot get /var/log to unmount at system shutdown (but this is IMO a systemd journald bug/feature depending on how you feel about it).

As for the unclean mounting, I'd guess you have some residual files/directories under /var/ that haven't been cleaned up. systemd is usually pretty careful to not open /var/log/journal until the directory is mounted (check journalctl -b, look for "Starting Flush Journal to Persistent Storage"). If indeed that's happening after /var/ is mounted, you can inspect that by bind mounting / somewhere, and peeking inside /var: i.e.,

# mount --bind / /mnt
# cd /mnt/var
# ls

Then carefully remove/move anything inside there. The original motivation for this patch was the contamination of these mountpoints when things didn't get set up correctly and some services got started.

Fabian-Gruenbichler · 2018-03-25T19:00:37Z

On Thu, Mar 22, 2018 at 10:22:50PM +0000, Antonio Russo wrote: So, I took a look at `cmd/zed/zed.d/README`, and decided that I wasn't going to try to implement a new zed event (at least right now). However, I did fix up the script a bit (it works with bash and dash, instead of requiring bash). All of `shellcheck.net` valid complaints have been addressed.

see below

Also, I changed the semantics around a little bit: 1. The dependencies of `local-fs.target` are reduced to `Wants`. This reduces the chance of unpleasant surprises if a mount fails. Maybe it gets raised to `Requires` after it gets a little wider testing. 2. Instead of clobbering existing `.mount`s, abort instead.

should we maybe log this? after all, this means there is a potential conflict between a manually set up .mount unit and a generated one (previously generated ones are cleared before the generator is called)

3. `zfs list` is run unconditionally, and its output is always preferred to `/etc/zfs/zfs-list.cache`.

the last one is problematic IMHO. `zfs list` can take quite a while on a system with lots of datasets and under load, or might (in case of bugs / ...) even hang altogether. see `man systemd.generator`: ``` · Generators are run very early at boot and cannot rely on any external services. They may not talk to any other process. That includes simple things such as logging to syslog(3), or systemd itself (this means: no systemctl(1))! Non-essential file systems like /var and /home are mounted after generators have run. Generators can however rely on the most basic kernel functionality to be available, including a mounted /sys, /proc, /dev, /usr. [...] · If you are careful, you can implement generators in shell scripts. We do recommend C code however, since generators are executed synchronously and hence delay the entire boot if they are slow. ``` so I think any calls to external binaries, especially ones which could block in kernel space are a no-go in generators. I wonder whether it would not be better to drop the call to `zfs list` altogether and invest some energy into ZED integration to keep the cache file current? we'd basically need to hook: - pool creation - pool import - pool export (or not?) - filesystem creation (including receive and clone) - filesystem destruction (or not?) - filesystem rename - filesystem mountpoint property changes - filesystem canmount property changes not sure which of those are already available in ZED? if you don't want to do this I can try to whip something up..

This should further reduce the need to populate `zfs-list.cache` (in fact, just running `systemctl daemon-reload` will re-run generators, producing the desired mount units).

see above for why the fact that systemd will re-run generators on every reload makes matters worse, not better ;) e.g. a single upgrade on a Debian testing or unstable might run `systemctl daemon-reload` tens or hundreds of times (depending on the number of upgraded packages which contain unit files of some kind)! I am not sure how the RPM world handles upgraded unit files.. on my system with ~256 datasets, running the generator already takes a measurable fraction of a second, but I guess most of that is actually writing the unit files (haven't checked yet)

It would be nice if someone running zfs on root could give this script a try (just toss it in `/etc/system/system-generators` after replacing ***@***.***@` and ***@***.***@` with the correct values).

FWIW, I did on Stretch and Sid. it works fine, and allows me to drop a `Requires` on zfs-mount.service from a bind mount :) so I'd very much like to see this integrated in some form! the testing instructions lack a `chmod +x /etc/systemd/system-generators/zfs-mount-generator` though ;)

aerusso · 2018-03-26T23:12:12Z

I've reworked this again using @Fabian-Gruenbichler's suggestions:

Instead of clobbering existing .mounts, abort instead.

should we maybe log this? after all, this means there is a potential
conflict between a manually set up .mount unit and a generated one
(previously generated ones are cleared before the generator is called)

Done.

I wonder whether it would not be better to drop the call to zfs list
altogether and invest some energy into ZED integration to keep the cache
file current? we'd basically need to hook:

pool creation

pool import

pool export (or not?)

filesystem creation (including receive and clone)

filesystem destruction (or not?)

filesystem rename

filesystem mountpoint property changes

filesystem canmount property changes

No call to zfs list is made in the systemd-generator anymore---if there's any concern over long hangs or serious bugs, that absolutely must not be allowed to interfere with system startup.

The new patch implements a history_event-zfs-list-cacher.sh "ZEDLET" updating zfs-list.cache. It tracks destroy, rename, mountpoint and canmount changes. I don't ever add new ones automatically, keeping the administrator in control (How do we feel about this?).

Fabian-Gruenbichler

I also wonder whether we want to deal with blacklisting datasets (besides canmount=off), since there can only be one unit for each mountpoint in systemd, but more than one dataset with the same mountpoint value in ZFS.

Fabian-Gruenbichler · 2018-03-27T15:40:26Z