[WIP] Prototype for systemd and fstab integration #4943

tuxoko · 2016-08-08T19:19:31Z

Recently, I have been cooking some stuff for better integration with systemd and fstab integration. This is what I've got so far.

What does this achieve:

Event driven zpool import from udev, no more non sense like wait for udev-settle.
Configurable selective import with fstab or explicit systemd service file.
Using fstab for mounting to let systemd build a proper mounting dependency.

So how does this work?
I create a zready command, which udev will fire upon discovering zfs devices. The zready command will read the label and build up the vdev tree in /dev/zpool/<pool>/. And when it figures the pool is ready to import. It will create a file called /dev/zpool/<pool>/ready. (Note, the current zready is just a hack I came up with. Since I'm not familiar with the label stuff, it might not be able to handle every cases.)

There's two systemd unit files, [email protected] and [email protected]. zpool@<pool>.path will listen on the /dev/zpool/<pool>/ready. And zpool@<pool>.service will do import with dependency on zpool@<pool>.path. Note, you don't need to do any configuration with these two unit files if you use fstab as described below, but you can still explicit enable a [email protected] if you want.

To let systemd mount stuff on boot, you only need to add an entry like this in fstab:
rpool/home /home zfs rw,defaults,[email protected]
This will automatically make this mount depends on the service unit.

For using zfs as root, each distribution probably has its only way to do it. But if you use systemd in initrd, you can add root=rpool/root rootfstype=zfs [email protected] to kernel cmdline. systemd will be able to automatically build the dependency just as using fstab.

Signed-off-by: Chunwei Chen [email protected]

Signed-off-by: Chunwei Chen <[email protected]>

behlendorf

This is really cool and I think it's a promising approach to get better integrated with systemd and fstab. Just a few questions.

What cleans up the /dev/zpool directory if the system crashes?
How are the ready files removed when devices are removed from the system?

behlendorf · 2016-09-21T21:54:02Z

cmd/zready/zready.c

+#define ZPOOL_PATH "/dev/zpool"
+
+static nvlist_t *
+get_config(const char *dev)


The existing zpool_read_label() function already does most of this and is part of libzfs. Rather than have two versions of this function I'd suggest extending zpool_read_label() as needed. The interfaces are slightly different but not really in a significant way.

behlendorf · 2016-09-21T21:57:51Z

cmd/zready/zready.c

+	sprintf(dir, "%llu", (u_longlong_t)vguid);
+	if (mkdir(dir, 0755) < 0 && errno != EEXIST) {
+		fprintf(stderr, "failed to mkdir %s: %s\n", dir, strerror(errno));
+		exit(1);


I know it's a prototype but best to return (1); and return error from main().

behlendorf · 2016-09-21T22:15:32Z

etc/systemd/system/[email protected]

+
+[Path]
+PathExists=/dev/zpool/%i/ready
+Unit=zpool@%i.service


These templates are pretty clever.

behlendorf · 2016-09-21T22:16:41Z

etc/systemd/system/[email protected]

+RemainAfterExit=yes
+ExecStartPre=/sbin/modprobe zfs
+ExecStart=@sbindir@/zpool import -EN %i
+ExecStop=@sbindir@/zpool export %i


We probably don't want the export.

behlendorf · 2016-09-21T22:20:44Z

etc/systemd/system/[email protected]

+ExecStop=@sbindir@/zpool export %i
+
+[Install]
+WantedBy=local-fs.target


If you added WantedBy=zfs-mount.service and renamed this to [email protected] then this could be a third pool import option. Users would then just need to disable the other zfs-* services if they didn't want them to use them. This would allow this same import service to be used for both use cases which would be nice.

behlendorf · 2016-09-21T22:28:12Z

cmd/zpool/zpool_main.c

@@ -2221,6 +2222,9 @@ zpool_do_import(int argc, char **argv)
 		case 'D':
 			do_destroyed = B_TRUE;
 			break;
+		case 'E':
+			do_imported = B_TRUE;
+			break;


You'll want to add this option to the man page and comment above. I'm surprised to see E was already part of the getopt string above.

behlendorf · 2016-09-21T22:31:56Z

cmd/zpool/zpool_main.c

+			err = 0;
+			log_history = B_FALSE;
+			goto exit_early;
+		}


What about the other error cases below?

tuxoko · 2016-09-21T23:13:02Z

@behlendorf
I think most system have devtmpfs for /dev, so it should clean up the between boot. Also this is a requirement because we need the /dev to stay the same between initrd and rootfs.

The whole detection and ready stuff is quite hacky. I'm hoping libzfs can already do this sort of stuff, but since I'm not familiar with it, I just came up with this prototype to mainly show that we can integrate well with systemd.

mailinglists35 · 2017-01-01T20:21:55Z

is this is ready for testing? i'd like to try it out.

tuxoko · 2017-01-02T18:55:34Z

@mailinglists35
You can try it out, it should work for simple pool, but the whole detection code is a quick hack and probably need to be rewritten, and may have some corner cases that won't work. I did this mainly for a demonstration purpose, but if people find this interesting, I'll find time to work on this again.

rlaager · 2017-01-06T04:57:53Z

What is the criteria for a pool being "ready"? For example, if I have a 6-disk raidz2 pool, is it ready when 4 devices arrive, when 6 devices arrive, or something else? The behavior I think is desired is: Once 4 devices are ready, start a timer. If the timer expires or all 6 devices are ready, import the pool.

In terms of this being production-ready, it would definitely need to handle corner cases like duplicate pool names (e.g. old devices), etc. I haven't reviewed the code, so I'm not saying it doesn't do that now.

Am I correct in understanding that this always imports every available pool? How would I handle a system where I only want to import certain pools?

I'd personally like to mount all the filesystems upon pool import. I'd rather not have to add everything to /etc/fstab. Is there a clean way to do that?

Mic92 · 2017-01-18T09:30:13Z

I think relying on /dev being a devtempfs is a sane choice. Systemd and udev already make this assumption.

intelfx · 2017-01-21T08:00:39Z

The zready command will read the label and build up the vdev tree in /dev/zpool//. And when it figures the pool is ready to import. It will create a file called /dev/zpool//ready. (Note, the current zready is just a hack I came up with. Since I'm not familiar with the label stuff, it might not be able to handle every cases.)

Nice. So that's how you solved the hierarchical zpool naming problem.

intelfx · 2017-01-21T08:02:41Z

@rlaager

I'd personally like to mount all the filesystems upon pool import. I'd rather not have to add everything to /etc/fstab. Is there a clean way to do that?

If zfs has code to mount filesystems upon import automatically (I recall there was something like that, though I don't use zfs anymore), then you could just pull in zpool@<zpool>.service from your local-fs.target.

intelfx · 2017-01-21T08:04:41Z

@rlaager

Am I correct in understanding that this always imports every available pool? How would I handle a system where I only want to import certain pools?


@tuxoko, this is something to consider. Not everyone wants their pools to be automatically imported (the opposite is also true though — in simple usecases it could be useful).

tuxoko · 2017-02-02T22:43:01Z

@rlaager

What is the criteria for a pool being "ready"? For example, if I have a 6-disk raidz2 pool, is it ready when 4 devices arrive, when 6 devices arrive, or something else? The behavior I think is desired is: Once 4 devices are ready, start a timer. If the timer expires or all 6 devices are ready, import the pool.

Currently it waits for all devices to come up. I do want to have what configurable timeout and continue thing but I haven't figured out the best way to do this.

In terms of this being production-ready, it would definitely need to handle corner cases like duplicate pool names

Yes, currently when zready finds same pool name with different pool guid, it will error out. I'm not sure if this is good enough. Feel free to make suggestions.

Am I correct in understanding that this always imports every available pool? How would I handle a system where I only want to import certain pools?

No, As I mentioned in description:
2. Configurable selective import with fstab or explicit systemd service file.
This doesn't import automatically, only if you enable [email protected] or specify it in fstab.

I'd personally like to mount all the filesystems upon pool import. I'd rather not have to add everything to /etc/fstab. Is there a clean way to do that?

We can add helper script to make it easy to add or remove stuff in fstab, or perhaps use generator to generate systemd mount service. The point is that we want to let systemd handle the mounting, because systemd handles mounting much better than zfs does.

pdf · 2017-02-08T09:13:08Z

I can definitely understand wanting better integration with systemd, but going via fstab to get there doesn't seem like an optimal approach. Is there some other reasoning for wanting fstab integration that I'm not grokking, or is this primarily for targeting systemd?

What happens when fstab disagrees with the ZFS mountpoint?

intelfx · 2017-02-08T10:10:00Z

@pdf

going via fstab to get there doesn't seem like an optimal approach

Why? Either you are going to generate systemd units for mountpoints directly or you use fstab. There is no other way to have support for hotplug and dependency tracking, unless you reimplement it from scratch.

intelfx · 2017-02-08T10:29:51Z

@pdf

If I understand correctly, the "ZFS mountpoints" are applied by zfs itself at import time. If that's what you are talking about, these mountpoints are not known to systemd before they appear, so for example they cannot be waited for or triggered via a unit's dependencies. That's the first side of the problem.

The second side of the problem is depending on a particular pool and its devices. systemd can wait for a device to appear before calling mount(8), or it can shutdown the mountpoint and all services depending on it if a device disappears, but for that it needs to know the mapping between a mountpoint and a real block device. If that mapping is provided, it waits for a device node to appear and then waits for a udev rule to say "ready" on that device. This is used in btrfs — any device node belonging to a filesystem is marked "not ready" unless all device nodes are present.

In zfs, this is a tougher problem because we mount pools, not devices.

pdf · 2017-02-08T11:26:57Z

@intelfx wrote:

Either you are going to generate systemd units for mountpoints directly or you use fstab.

The former requires no user action, supports the standard ZFS workflow dynamically, and avoids requiring double-entry, and potential confusion if the mountpoint ZFS property gets out of sync with fstab.

these mountpoints are not known to systemd before they appear, so for example they cannot be waited for or triggered via a unit's dependencies.

Seems like this should be doable if we can ensure that pool imports occur early enough and blocks until .mount services corresponding to the enabled mountpoints are generated, after that, standard systemd dependency management should work as expected. I'm not certain how feasible this is though.

The second side of the problem is depending on a particular pool and its devices. systemd can wait for a device to appear before calling mount(8), or it can shutdown the mountpoint and all services depending on it if a device disappears, but for that it needs to know the mapping between a mountpoint and a real block device.

It doesn't have to be a real block device though, right? This PR appears to be using a newly implemented device node for this purpose. So, is your only issue here determining which pools should be imported? Could a generator for pools be used here? So, if a user wants to disable importing a pool, they just disable the service? If the service default is disabled, there will be no surprise imports on boot either, though this does require the user to enable any pools the do want imported on boot.

tuxoko · 2017-02-08T19:14:03Z

The problem for using generator is that you can't specify when generator should fire, and all generator will fire early on, and you need to have your pool imported at this time, and to guarantee that would mean forcing import at initrd. That wouldn't be an issue if you have a root zpool and only have one pool, but for others you'll be delaying their boot time for no reason.

Another problem is that the generator will need to have proper dependency with regards to other non-zfs mountpoint in fstab. But why bother doing this if systemd already does this for you when you use fstab.

The former requires no user action, supports the standard ZFS workflow dynamically, and avoids requiring double-entry, and potential confusion if the mountpoint ZFS property gets out of sync with fstab.

This is not really an issue, we can just have a script to generate fstab. The current way also require user to enable zfs-mount.service, so the burden on user is more or less the same. We can even hook up the script so that when you change mountpoint a new fstab will be automatically generated, but I'm not sure if that's a good idea.

intelfx · 2017-02-08T19:51:42Z

@pdf

The former requires no user action, supports the standard ZFS workflow dynamically, and avoids requiring double-entry, and potential confusion if the mountpoint ZFS property gets out of sync with fstab.

Yes, except that it won't actually work. See below.

Seems like this should be doable if we can ensure that pool imports occur early enough and blocks until .mount services corresponding to the enabled mountpoints are generated, after that, standard systemd dependency management should work as expected. I'm not certain how feasible this is though.

early enough

This is a big "no". This is inherently racy, and this is something I wanted to get rid of in the first place when I filed issue #4178. We can't rely on something being imported "early enough", because buses can be slow, controllers can be hotplugged, USB disks added and removed in runtime and so on.

It doesn't have to be a real block device though, right? This PR appears to be using a newly implemented device node for this purpose. So, is your only issue here determining which pools should be imported? Could a generator for pools be used here? So, if a user wants to disable importing a pool, they just disable the service? If the service default is disabled, there will be no surprise imports on boot either, though this does require the user to enable any pools the do want imported on boot.

It does have to be a real block device — something to match in udev. This is not the only way though: this PR requires a user to add a specific parameter in fstab that makes systemd add an artificial dependency on a service that waits for a pool and imports it (instead of adding a "natural" dependency on a block device).

pdf · 2017-02-09T01:14:41Z

@tuxoko

The problem for using generator is that you can't specify when generator should fire, and all generator will fire early on, and you need to have your pool imported at this time, and to guarantee that would mean forcing import at initrd. That wouldn't be an issue if you have a root zpool and only have one pool, but for others you'll be delaying their boot time for no reason.

I'm not sure this is actually a terrible thing. For users who want to optimise boot times by having some pools not imported during early boot (I suspect this is the minority), a kernel parameter could be used, as with other generators. Filesystems are going to need to mount before other services are started in any case, so the total time to boot is likely not changed if all pools have filesystems that will be mounted, but I can see other possible issues with trying to do this in initrd.

Would it make sense, as an alternative to trying to discover all available pools early (I can see how this may be problematic), to use a cache file similar to Illumos?

I will say that I'm not certain this generator approach is actually optimal, but I figure it's worth having the discussion now, rather than after code gets merged.

Another problem is that the generator will need to have proper dependency with regards to other non-zfs mountpoint in fstab. But why bother doing this if systemd already does this for you when you use fstab.

I think you'll find that systemd already does this for you if you've generated the mount units - they don't explicitly specify dependent filesystems, just a dependent device, and where the mountpoint is.

This is not really an issue, we can just have a script to generate fstab. The current way also require user to enable zfs-mount.service, so the burden on user is more or less the same. We can even hook up the script so that when you change mountpoint a new fstab will be automatically generated, but I'm not sure if that's a good idea.

I'm not entirely opposed to this if it is the best we can do, but I think there are some issues to consider here. What happens if the user edits fstab? Do their changes get clobbered, or does the script stop working on conflict? In the latter case, what is the behaviour when fstab disagrees with the ZFS mountpoint? Which source wins? What about other ZFS props that interact with mounting (eg canmount vs auto/noauto)?

We lose the single source of truth (currently the ZFS props), unless the generator script runs automatically, and always wins, but I guarantee this will cause confusion.

@intelfx

This is a big "no". This is inherently racy, and this is something I wanted to get rid of in the first place when I filed issue #4178. We can't rely on something being imported "early enough", because buses can be slow, controllers can be hotplugged, USB disks added and removed in runtime and so on.

I think there are two scenarios to consider here - mountpoints required at boot (that's essentially everything in fstab, unless noauto, and see above for canmount conflict), and hot-plugged pools. My comment about 'early enough' is related to the former - imports need to happen in a timely fashion for mountpoint introspection, to allow datasets to be mounted for local-fs.target, and for proper mount unit ordering. That's not to say udev triggers would not be involved, and I suspect we could generate new units on hotplug.

It does have to be a real block device — something to match in udev.

You specifically mention in #4178 that it does not have to be a real block devices, which is what I was getting at.

tuxoko · 2017-02-09T01:33:36Z

I'm not sure this is actually a terrible thing.

It is terrible. Because you don't know how many pools are there to import during initrd. You can use additional config file, or worse use cache file like you said. But that's already proven to be painful.

seschwar · 2017-05-05T23:18:25Z

@tuxoko wrote:

The problem for using generator is that you can't specify when generator should fire, (...)

systemd.generator(7) states:

systemd(1) will execute those binaries very early at bootup and at configuration reload time [emphasis added] — (...)

So if systemctl daemon-reload gets run after a successful zpool import messing with fstab could be avoided.

behlendorf · 2018-04-06T21:13:57Z

Closed in favor of #7329 which has been merged.

aerusso · 2018-04-07T11:32:07Z

The zready infrastructure, which deals with zpool import, here is complementary to #7329, which deals with dataset mounting. In fact, #7329 was partially born out of a desire to satisfy the last ingredient required for this PR:

To let systemd mount stuff on boot, you only need to add an entry like this in fstab:
rpool/home /home zfs rw,defaults,[email protected]
This will automatically make this mount depends on the service unit.

behlendorf · 2018-04-07T17:53:32Z

My mistake, reopening. Although it probably makes sense to open a new PR with a refreshed version of zready.

Baughn · 2018-06-19T12:57:31Z

So considering what I'm working on for NixOS, I should probably ask: Is there a rough timeline for when this may be merged and released?

behlendorf · 2018-09-25T23:05:19Z

@tuxoko @aerusso I'd love to see this make the 0.8 release. Any chance we can get this over the finish line.

darkbasic · 2018-09-26T14:05:22Z

Agree, I'd love if it could make it in time for 0.8 :)

Rudd-O · 2019-03-26T12:21:25Z

cmd/zpool/zpool_main.c

@@ -2443,12 +2452,13 @@ zpool_do_import(int argc, char **argv)
 	}

 	if (err == 1) {
+exit_early:


Looks like memleak here?

behlendorf · 2019-11-23T00:50:01Z

Closing due to inactivity, if someone has the inclination to pick up this work please open a new PR.

Prototype for systemd and fstab integration

8a59793

Signed-off-by: Chunwei Chen <[email protected]>

behlendorf reviewed Sep 21, 2016

View reviewed changes

rlaager changed the title ~~[RFC] Prototype for systemd and fstab integration~~ [WIP] Prototype for systemd and fstab integration Sep 27, 2016

behlendorf added the Status: Work in Progress Not yet ready for general review label Sep 30, 2016

This was referenced Jan 1, 2017

Systemd: Replace zfs-mount.service with systemd.generator(7) #4898

Closed

RFC: better import integration with systemd, hotplug et al. #4178

Open

Mic92 mentioned this pull request Jan 18, 2017

createHome attempts to create home directories before local filesystems are mounted NixOS/nixpkgs#21928

Closed

pdf mentioned this pull request Feb 8, 2017

wiki: Ubuntu 16.04 Root on ZFS /var/tmp and encrypted swap issues #5754

Closed

Mic92 mentioned this pull request Apr 29, 2017

display-manager: systemd-udev-settle serves no purpose, boot 10% faster NixOS/nixpkgs#25311

Merged

7 tasks

behlendorf added the Type: Feature Feature request or new feature label May 10, 2017

Mic92 mentioned this pull request Jul 7, 2017

ZFS dataset with specified non-legacy mount point does not automount NixOS/nixpkgs#27183

Closed

aerusso mentioned this pull request Oct 16, 2017

systemd: order zfs-import-* before local-fs-pre #6764

Merged

13 tasks

aerusso mentioned this pull request Dec 17, 2017

fstab integration #6974

Closed

13 tasks

Mic92 mentioned this pull request Feb 5, 2018

ZFS umount errors during shutdown NixOS/nixpkgs#34616

Closed

behlendorf closed this Apr 6, 2018

behlendorf reopened this Apr 7, 2018

Mic92 mentioned this pull request Jun 19, 2018

Correctly importing ZFS pools, and nixos-generate-config perl-ness NixOS/nixpkgs#42178

Closed

behlendorf added Status: Inactive Not being actively updated and removed Status: Work in Progress Not yet ready for general review labels Sep 25, 2018

ahrens added the Status: Work in Progress Not yet ready for general review label Sep 27, 2018

kpande assigned tuxoko Feb 16, 2019

Rudd-O reviewed Mar 26, 2019

View reviewed changes

behlendorf added the Status: Post 0.8.0 label Apr 19, 2019

behlendorf removed the Status: Post 0.8.0 label May 23, 2019

behlendorf closed this Nov 23, 2019

aerusso mentioned this pull request Nov 23, 2019

Fixes for the systemd mount generator #9611

Closed

12 tasks

aerusso mentioned this pull request Sep 27, 2020

systemd-udev-settle.service is deprecated. Please fix zfs-import-cache.service not to pull it in. #10891

Open

ElvishJerricco added a commit to NixOS/nixpkgs that referenced this pull request May 5, 2022

zfs: Update comment for openzfs/zfs#4943

0a16158

ElvishJerricco mentioned this pull request Jul 23, 2023

zfs: Relate import services to zfs-import.target instead of local-fs NixOS/nixpkgs#227208

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Prototype for systemd and fstab integration #4943

[WIP] Prototype for systemd and fstab integration #4943

tuxoko commented Aug 8, 2016 •

edited

Loading

behlendorf left a comment

behlendorf Sep 21, 2016 •

edited

Loading

behlendorf Sep 21, 2016

behlendorf Sep 21, 2016

behlendorf Sep 21, 2016

behlendorf Sep 21, 2016

behlendorf Sep 21, 2016

behlendorf Sep 21, 2016

tuxoko commented Sep 21, 2016

mailinglists35 commented Jan 1, 2017

tuxoko commented Jan 2, 2017

rlaager commented Jan 6, 2017

Mic92 commented Jan 18, 2017

intelfx commented Jan 21, 2017

intelfx commented Jan 21, 2017

intelfx commented Jan 21, 2017 •

edited

Loading

tuxoko commented Feb 2, 2017

pdf commented Feb 8, 2017

intelfx commented Feb 8, 2017

intelfx commented Feb 8, 2017 •

edited

Loading

pdf commented Feb 8, 2017

tuxoko commented Feb 8, 2017

intelfx commented Feb 8, 2017

pdf commented Feb 9, 2017 •

edited

Loading

tuxoko commented Feb 9, 2017

seschwar commented May 5, 2017

behlendorf commented Apr 6, 2018

aerusso commented Apr 7, 2018

behlendorf commented Apr 7, 2018

Baughn commented Jun 19, 2018

behlendorf commented Sep 25, 2018

darkbasic commented Sep 26, 2018

Rudd-O Mar 26, 2019

behlendorf commented Nov 23, 2019

[WIP] Prototype for systemd and fstab integration #4943

[WIP] Prototype for systemd and fstab integration #4943

Conversation

tuxoko commented Aug 8, 2016 • edited Loading

behlendorf left a comment

Choose a reason for hiding this comment

behlendorf Sep 21, 2016 • edited Loading

Choose a reason for hiding this comment

behlendorf Sep 21, 2016

Choose a reason for hiding this comment

behlendorf Sep 21, 2016

Choose a reason for hiding this comment

behlendorf Sep 21, 2016

Choose a reason for hiding this comment

behlendorf Sep 21, 2016

Choose a reason for hiding this comment

behlendorf Sep 21, 2016

Choose a reason for hiding this comment

behlendorf Sep 21, 2016

Choose a reason for hiding this comment

tuxoko commented Sep 21, 2016

mailinglists35 commented Jan 1, 2017

tuxoko commented Jan 2, 2017

rlaager commented Jan 6, 2017

Mic92 commented Jan 18, 2017

intelfx commented Jan 21, 2017

intelfx commented Jan 21, 2017

intelfx commented Jan 21, 2017 • edited Loading

tuxoko commented Feb 2, 2017

pdf commented Feb 8, 2017

intelfx commented Feb 8, 2017

intelfx commented Feb 8, 2017 • edited Loading

pdf commented Feb 8, 2017

tuxoko commented Feb 8, 2017

intelfx commented Feb 8, 2017

pdf commented Feb 9, 2017 • edited Loading

tuxoko commented Feb 9, 2017

seschwar commented May 5, 2017

behlendorf commented Apr 6, 2018

aerusso commented Apr 7, 2018

behlendorf commented Apr 7, 2018

Baughn commented Jun 19, 2018

behlendorf commented Sep 25, 2018

darkbasic commented Sep 26, 2018

Rudd-O Mar 26, 2019

Choose a reason for hiding this comment

behlendorf commented Nov 23, 2019

tuxoko commented Aug 8, 2016 •

edited

Loading

behlendorf Sep 21, 2016 •

edited

Loading

intelfx commented Jan 21, 2017 •

edited

Loading

intelfx commented Feb 8, 2017 •

edited

Loading

pdf commented Feb 9, 2017 •

edited

Loading