-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Prototype for systemd and fstab integration #4943
Conversation
Signed-off-by: Chunwei Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool and I think it's a promising approach to get better integrated with systemd and fstab. Just a few questions.
- What cleans up the /dev/zpool directory if the system crashes?
- How are the
ready
files removed when devices are removed from the system?
#define ZPOOL_PATH "/dev/zpool" | ||
|
||
static nvlist_t * | ||
get_config(const char *dev) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing zpool_read_label()
function already does most of this and is part of libzfs
. Rather than have two versions of this function I'd suggest extending zpool_read_label()
as needed. The interfaces are slightly different but not really in a significant way.
sprintf(dir, "%llu", (u_longlong_t)vguid); | ||
if (mkdir(dir, 0755) < 0 && errno != EEXIST) { | ||
fprintf(stderr, "failed to mkdir %s: %s\n", dir, strerror(errno)); | ||
exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's a prototype but best to return (1);
and return error from main()
.
|
||
[Path] | ||
PathExists=/dev/zpool/%i/ready | ||
Unit=zpool@%i.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These templates are pretty clever.
RemainAfterExit=yes | ||
ExecStartPre=/sbin/modprobe zfs | ||
ExecStart=@sbindir@/zpool import -EN %i | ||
ExecStop=@sbindir@/zpool export %i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't want the export.
ExecStop=@sbindir@/zpool export %i | ||
|
||
[Install] | ||
WantedBy=local-fs.target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you added WantedBy=zfs-mount.service
and renamed this to [email protected]
then this could be a third pool import option. Users would then just need to disable the other zfs-*
services if they didn't want them to use them. This would allow this same import service to be used for both use cases which would be nice.
@@ -2221,6 +2222,9 @@ zpool_do_import(int argc, char **argv) | |||
case 'D': | |||
do_destroyed = B_TRUE; | |||
break; | |||
case 'E': | |||
do_imported = B_TRUE; | |||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to add this option to the man page and comment above. I'm surprised to see E
was already part of the getopt
string above.
err = 0; | ||
log_history = B_FALSE; | ||
goto exit_early; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the other error cases below?
@behlendorf The whole detection and ready stuff is quite hacky. I'm hoping libzfs can already do this sort of stuff, but since I'm not familiar with it, I just came up with this prototype to mainly show that we can integrate well with systemd. |
is this is ready for testing? i'd like to try it out. |
@mailinglists35 |
What is the criteria for a pool being "ready"? For example, if I have a 6-disk raidz2 pool, is it ready when 4 devices arrive, when 6 devices arrive, or something else? The behavior I think is desired is: Once 4 devices are ready, start a timer. If the timer expires or all 6 devices are ready, import the pool. In terms of this being production-ready, it would definitely need to handle corner cases like duplicate pool names (e.g. old devices), etc. I haven't reviewed the code, so I'm not saying it doesn't do that now. Am I correct in understanding that this always imports every available pool? How would I handle a system where I only want to import certain pools? I'd personally like to mount all the filesystems upon pool import. I'd rather not have to add everything to /etc/fstab. Is there a clean way to do that? |
I think relying on |
Nice. So that's how you solved the hierarchical zpool naming problem. |
If zfs has code to mount filesystems upon import automatically (I recall there was something like that, though I don't use zfs anymore), then you could just pull in |
|
Currently it waits for all devices to come up. I do want to have what configurable timeout and continue thing but I haven't figured out the best way to do this.
Yes, currently when zready finds same pool name with different pool guid, it will error out. I'm not sure if this is good enough. Feel free to make suggestions.
No, As I mentioned in description:
We can add helper script to make it easy to add or remove stuff in fstab, or perhaps use generator to generate systemd mount service. The point is that we want to let systemd handle the mounting, because systemd handles mounting much better than zfs does. |
I can definitely understand wanting better integration with systemd, but going via fstab to get there doesn't seem like an optimal approach. Is there some other reasoning for wanting fstab integration that I'm not grokking, or is this primarily for targeting systemd? What happens when fstab disagrees with the ZFS mountpoint? |
Why? Either you are going to generate systemd units for mountpoints directly or you use fstab. There is no other way to have support for hotplug and dependency tracking, unless you reimplement it from scratch. |
If I understand correctly, the "ZFS mountpoints" are applied by zfs itself at import time. If that's what you are talking about, these mountpoints are not known to systemd before they appear, so for example they cannot be waited for or triggered via a unit's dependencies. That's the first side of the problem. The second side of the problem is depending on a particular pool and its devices. systemd can wait for a device to appear before calling mount(8), or it can shutdown the mountpoint and all services depending on it if a device disappears, but for that it needs to know the mapping between a mountpoint and a real block device. If that mapping is provided, it waits for a device node to appear and then waits for a udev rule to say "ready" on that device. This is used in btrfs — any device node belonging to a filesystem is marked "not ready" unless all device nodes are present. In zfs, this is a tougher problem because we mount pools, not devices. |
@intelfx wrote:
The former requires no user action, supports the standard ZFS workflow dynamically, and avoids requiring double-entry, and potential confusion if the
Seems like this should be doable if we can ensure that pool imports occur early enough and blocks until .mount services corresponding to the enabled mountpoints are generated, after that, standard systemd dependency management should work as expected. I'm not certain how feasible this is though.
It doesn't have to be a real block device though, right? This PR appears to be using a newly implemented device node for this purpose. So, is your only issue here determining which pools should be imported? Could a generator for pools be used here? So, if a user wants to disable importing a pool, they just disable the service? If the service default is disabled, there will be no surprise imports on boot either, though this does require the user to enable any pools the do want imported on boot. |
The problem for using generator is that you can't specify when generator should fire, and all generator will fire early on, and you need to have your pool imported at this time, and to guarantee that would mean forcing import at initrd. That wouldn't be an issue if you have a root zpool and only have one pool, but for others you'll be delaying their boot time for no reason. Another problem is that the generator will need to have proper dependency with regards to other non-zfs mountpoint in fstab. But why bother doing this if systemd already does this for you when you use fstab.
This is not really an issue, we can just have a script to generate fstab. The current way also require user to enable zfs-mount.service, so the burden on user is more or less the same. We can even hook up the script so that when you change mountpoint a new fstab will be automatically generated, but I'm not sure if that's a good idea. |
Yes, except that it won't actually work. See below.
This is a big "no". This is inherently racy, and this is something I wanted to get rid of in the first place when I filed issue #4178. We can't rely on something being imported "early enough", because buses can be slow, controllers can be hotplugged, USB disks added and removed in runtime and so on.
It does have to be a real block device — something to match in udev. This is not the only way though: this PR requires a user to add a specific parameter in fstab that makes systemd add an artificial dependency on a service that waits for a pool and imports it (instead of adding a "natural" dependency on a block device). |
I'm not sure this is actually a terrible thing. For users who want to optimise boot times by having some pools not imported during early boot (I suspect this is the minority), a kernel parameter could be used, as with other generators. Filesystems are going to need to mount before other services are started in any case, so the total time to boot is likely not changed if all pools have filesystems that will be mounted, but I can see other possible issues with trying to do this in initrd. Would it make sense, as an alternative to trying to discover all available pools early (I can see how this may be problematic), to use a cache file similar to Illumos? I will say that I'm not certain this generator approach is actually optimal, but I figure it's worth having the discussion now, rather than after code gets merged.
I think you'll find that systemd already does this for you if you've generated the mount units - they don't explicitly specify dependent filesystems, just a dependent device, and where the mountpoint is.
I'm not entirely opposed to this if it is the best we can do, but I think there are some issues to consider here. What happens if the user edits fstab? Do their changes get clobbered, or does the script stop working on conflict? In the latter case, what is the behaviour when fstab disagrees with the ZFS mountpoint? Which source wins? What about other ZFS props that interact with mounting (eg We lose the single source of truth (currently the ZFS props), unless the generator script runs automatically, and always wins, but I guarantee this will cause confusion.
I think there are two scenarios to consider here - mountpoints required at boot (that's essentially everything in fstab, unless
You specifically mention in #4178 that it does not have to be a real block devices, which is what I was getting at. |
It is terrible. Because you don't know how many pools are there to import during initrd. You can use additional config file, or worse use cache file like you said. But that's already proven to be painful. |
systemd.generator(7) states:
So if |
Closed in favor of #7329 which has been merged. |
The
|
My mistake, reopening. Although it probably makes sense to open a new PR with a refreshed version of |
So considering what I'm working on for NixOS, I should probably ask: Is there a rough timeline for when this may be merged and released? |
Agree, I'd love if it could make it in time for 0.8 :) |
@@ -2443,12 +2452,13 @@ zpool_do_import(int argc, char **argv) | |||
} | |||
|
|||
if (err == 1) { | |||
exit_early: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like memleak here?
Closing due to inactivity, if someone has the inclination to pick up this work please open a new PR. |
Recently, I have been cooking some stuff for better integration with systemd and fstab integration. This is what I've got so far.
What does this achieve:
So how does this work?
I create a
zready
command, which udev will fire upon discovering zfs devices. Thezready
command will read the label and build up the vdev tree in/dev/zpool/<pool>/
. And when it figures the pool is ready to import. It will create a file called/dev/zpool/<pool>/ready
. (Note, the current zready is just a hack I came up with. Since I'm not familiar with the label stuff, it might not be able to handle every cases.)There's two systemd unit files,
[email protected]
and[email protected]
.zpool@<pool>.path
will listen on the/dev/zpool/<pool>/ready
. Andzpool@<pool>.service
will do import with dependency onzpool@<pool>.path
. Note, you don't need to do any configuration with these two unit files if you use fstab as described below, but you can still explicit enable a [email protected] if you want.To let systemd mount stuff on boot, you only need to add an entry like this in fstab:
rpool/home /home zfs rw,defaults,[email protected]
This will automatically make this mount depends on the service unit.
For using zfs as root, each distribution probably has its only way to do it. But if you use systemd in initrd, you can add
root=rpool/root rootfstype=zfs [email protected]
to kernel cmdline. systemd will be able to automatically build the dependency just as using fstab.Signed-off-by: Chunwei Chen [email protected]