libzfs_init() should busy-wait on module initialization #3427

ryao · 2015-05-18T15:31:36Z

libzfs_init()'s just-in-time load of the module before using it is
racy because Linux kernel module initialization is asynchronous. This
causes a sporadic failure whenever libzfs_init() is required to load
the kernel modules. This happens during the boot process on EPEL
systems, Fedora and likely others such as Ubuntu.

The general mode of failure is that libzfs_init() is expected to load
the module, module initialization does not complete before /dev/zfs is
opened and pool import fails. This could explain the infamous mountall
failure on Ubuntu where pools will import, but things fail to mount.
The general explanation is that the userland process expected to mount
things fails because the module loses the race with libzfs_init(), the
module loads the pools by reading the zpool.cache and nothing mounts
because the userland process expected to perform the mount has already
failed.

A related issue can also manifest itself in initramfs archives that
mount / on ZFS, which affected Gentoo until 2013 when a busy-wait was
implemented to ensure that the module loaded:

https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=c812c35100771bb527f6b03853fa6d8ef66a48fe
https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=a21728ae287e988a1848435ab27f7ab503def784
https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=32585f117ffbf6d6a0aa317e6876ae7711a7f307

The busy-wait approach was chosen because it imposed minimal latency and
was implementable in shell code. Unfortunately, it was not known at the
time that libzfs_init() had the same problem, so this went unfixed. It
caused sporadic failures in the flocker tutorial, which caught our
attention at ClusterHQ:

https://clusterhq.atlassian.net/browse/FLOC-1834

Subsequent analysis following reproduction in a development environment
concluded that the failures were caused by module initialization losing
the race with libzfs_init(). While all Linux kernel modules needed
ASAP during the boot process suffer from this race, the zfs module's
dependence on additional modules make it particularly vulnerable to this
issue. The solution that has been chosen mirrors the solution chosen for
genkernel with the addition of sched_yield() for greater efficiency.

This fails to close the race in the scenario where system execution in a
virtual machine is paused in the exact window necessary to introduce a
delay between a failure and subsequent try greater than the timeout.
Closing the race in that situation would require hooking into udev
and/or the kernel hotplug events. That has been left as a future
improvement because it would require significant development time and it
is quite likely that the busy-wait approach implemented here would be
required for a fallback on exotic systems systems where neither are
available. The chosen approach should be sufficient for achieving

99.999% reliability.

Closes #2556

Signed-off-by: Richard Yao [email protected]
Reviewed-by: Turbo Fredriksson [email protected]

ryao · 2015-05-18T18:49:03Z

I am fairly confident this explains a sporadic failure on the buildbot:

http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-builder/builds/2202/steps/shell_14/logs/stdio
http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-builder/builds/2202/steps/shell_15/logs/stdio

behlendorf · 2015-05-18T19:15:10Z

@ryao @FransUrbo @dun good timing on this. This little bit of code had just come to my attention for similar reasons, specifically the updated init scripts. My inclination was actually to resolve this problem an entirely different way but I wanted your guys input.

I definitely agree with the analysis, there is a race between module load and opening the /dev/zfs device. A race we could very well lose since the device node is created asynchronously. However, perhaps the right thing to do here is nothing, never auto-load the module. Instead push this off to whatever the init system is to ensure the module is loaded prior to the any zfs/zpool commands.

With the notable exception of the mount command, which tries to load a module based on the filesystem type, this is the customary way of handling things under Linux. This code was added long ago only as a convenience and it make some dubious assumptions. Here are my thoughts for why not to do this.

Should be done explicitly in the boot process (although mount(8) does a similar trick).
Makes the failure mode in this case deterministic.
Assumes the caller has root permissions, this may not be true when delegations are supported.
Doesn't cleanly handle the case where ZFS is built directly in to the kernel.
Daemon's such as the ZED probably should not trigger a module load.
More divergence from illumos (not a huge deal here since there are other modifications)

So what do you think? What are the arguments for keeping this functionality? Did I miss any for removing it?

FransUrbo · 2015-05-18T19:21:54Z

I've said it before, I've never liked the autoloading of the module. NO OTHER (that I know of) system/software does this (in Linux at least). If I have a filesystem on a device which I have not loaded a module for, fsck and/or mount does not load this.

Having this is just a simplicity for new users, but it teaches them bad behavior. I personally would like to vote for the auto loading of modules to be removed.

I prefer to do this outside of ZoL (in the init scripts, [my] initramfs scripts and what not). It's … "cleaner".

HOWEVER, if we decide to keep the auto loading of modules, then this [PR] is the very least we can do.

`libzfs_init()`'s just-in-time load of the module before using it is racy because Linux kernel module initialization is asynchronous. This causes a sporadic failure whenever `libzfs_init()` is required to load the kernel modules. This happens during the boot process on EPEL systems, Fedora and likely others such as Ubuntu. The general mode of failure is that `libzfs_init()` is expected to load the module, module initialization does not complete before /dev/zfs is opened and pool import fails. This could explain the infamous mountall failure on Ubuntu where pools will import, but things fail to mount. The general explanation is that the userland process expected to mount things fails because the module loses the race with libzfs_init(), the module loads the pools by reading the zpool.cache and nothing mounts because the userland process expected to perform the mount has already failed. A related issue can also manifest itself in initramfs archives that mount / on ZFS, which affected Gentoo until 2013 when a busy-wait was implemented to ensure that the module loaded: https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=c812c35100771bb527f6b03853fa6d8ef66a48fe https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=a21728ae287e988a1848435ab27f7ab503def784 https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=32585f117ffbf6d6a0aa317e6876ae7711a7f307 The busy-wait approach was chosen because it imposed minimal latency and was implementable in shell code. Unfortunately, it was not known at the time that `libzfs_init()` had the same problem, so this went unfixed. It caused sporadic failures in the flocker tutorial, which caught our attention at ClusterHQ: https://clusterhq.atlassian.net/browse/FLOC-1834 Subsequent analysis following reproduction in a development environment concluded that the failures were caused by module initialization losing the race with `libzfs_init()`. While all Linux kernel modules needed ASAP during the boot process suffer from this race, the zfs module's dependence on additional modules make it particularly vulnerable to this issue. The solution that has been chosen mirrors the solution chosen for genkernel with the addition of `sched_yield()` for greater efficiency. This fails to close the race in the scenario where system execution in a virtual machine is paused in the exact window necessary to introduce a delay between a failure and subsequent try greater than the timeout. Closing the race in that situation would require hooking into udev and/or the kernel hotplug events. That has been left as a future improvement because it would require significant development time and it is quite likely that the busy-wait approach implemented here would be required for a fallback on exotic systems systems where neither are available. The chosen approach should be sufficient for achieving >99.999% reliability. Closes openzfs#2556 Signed-off-by: Richard Yao <[email protected]> Reviewed-by: Turbo Fredriksson <[email protected]>

ryao · 2015-05-18T19:51:18Z

I have never liked it either. However, removing it as soon as we do something better could cause problems for people updating to the next release like the hostid change did in 0.6.4. That would increase the support load in IRC and on the mailing list. It could also open us to the understandable criticism that we break things unnecessarily. I would rather fix this and call it deprecated in the next release. We could remove it in the future after we are certain that users have had sufficient time to migrate to the new method of loading the modules.

One way of doing a transition would be to add a DEPRECATED_ZFS_MODULE_LOADING environment variable to allow us to dynamically turns off this functionality in environments where it should not be used to permit both us and users to verify that things are properly migrated ahead of actual removal. That way we could stage it such that in the next version, we would be using DEPRECATED_ZFS_MODULE_LOADING=no in our initialization scripts and in the version that follows, remove it such that it would be off by default and setting DEPRECATED_ZFS_MODULE_LOADING=yes would turn this back on. Then we could remove it at some later point after we are sure everyone is migrated.

If we implement an environment variable, we should also have a discussion of what the right way to load the modules not just in the scripts, but in initramfs archives as well.

ryao · 2015-05-18T20:00:02Z

@behlendorf I suspect that this patch is something that would be reasonable to backport to 0.6.4 while other options are not. It is your call though.

behlendorf · 2015-05-18T20:40:27Z

OK, then since nobody likes it let's get rid of it. I thought you guys might have a reason for keeping it but it sure doesn't sound that way.

I agree that we should phase it out slowly to mitigate any unexpected breakage which might occur and an environment variable is a totally reasonable way to do this. Although, I'd suggest using just ZFS_MODULE_LOADING for the name and make it similar to ZFS_ABORT where the mere existence of the variable would be enough to enable it. Just because that makes things simple and it's code we fully intent to remove fairly soon and for most people not to use.

@ryao let me provide some additional comments about this patch and if you could refresh it to behave like we've described we could it for an 0.6.4.2.

dun · 2015-05-18T20:54:30Z

I should have a pull request ready within the next couple of weeks so the zed will no longer be dependent upon this autoload behavior.

dweeezil · 2015-05-18T20:56:37Z

I've not been following this issue too closely but I wanted to chime in with a reminder, in case it matters in this context, of another contender in the "load the module race": our standard packaging includes a udev module 90-zfs.rules which will load the module as soon as a block device partition identified by libblikd as either "zfs" or "zfs_member" identified.

behlendorf · 2015-05-18T21:04:43Z

@dweeezil Good point, that definitely improves the odds the modules will already be loaded. It also looks like we're going to want to update the systemd units to do a modprobe zfs before the import.

ryao · 2015-05-18T21:37:02Z

@behlendorf If you update the systemd units to load the module, how will they ensure that /dev/zfs has actually loaded before they run zpool import?

I had this exact problem in genkernel, which is why I implemented the busy wait in shell code.

behlendorf · 2015-05-18T21:43:19Z

@ryao I suppose we'll need some sort of polling loop like in the init scripts. Although, there might be a better way like creating another unit file which does the modprobe and then the import units might require the /dev/zfs file? I'm no expert but you're right it's a wrinkle we'll need to contend with.

ryao · 2015-05-18T21:45:11Z

@behlendorf A colleague had the same idea. Unfortunately, the functionality to specify a file that is required will disable a service from starting when it does not find it rather than waiting for it to appear.

behlendorf · 2015-05-18T22:01:46Z

@ryao we'll need to do a little research then. This certainly sounds like it should be a solved problem.

behlendorf · 2015-05-18T23:03:28Z

@ryao upon further reflection I think this needs to be broken in to two parts.

Part 1: When ZFS_MODULE_LOADING is set that triggers the modprobe for compatibility. This is disabled by default, better to break it now and deal with any potential fallout.
Part 2: When /sys/module/zfs exists but /dev/zfs does not block for up to N seconds waiting for it. When /sys/module/zfs doesn't exist it can immediately return an error.

This allows for the init scripts and systemd units to safely do the following which simplifies things.

modprobe zfs; zpool import -a

ryao · 2015-05-18T23:10:21Z

@behlendorf How do you propose that we block?

behlendorf · 2015-05-18T23:13:06Z

@ryao either busy-wait or polling/sleep until the device shows up. I'd argue for polling ever 1 ms or so, but if you'ld prefer to busy wait that's OK. In practice this is such a tiny optimization I doubt it matters either way.

behlendorf · 2015-05-19T20:50:16Z

@ryao @FransUrbo I'm proposing #3431 which I believe addresses all the concerns raised.

ryao mentioned this pull request May 18, 2015

ZFS pool not mounted on boot on Ubuntu 14.04.1 (trusty) #2556

Closed

ryao force-pushed the time branch 8 times, most recently from 7fe9dab to d6fbda4 Compare May 18, 2015 16:05

ryao force-pushed the time branch from d6fbda4 to a554c31 Compare May 18, 2015 19:38

behlendorf closed this May 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libzfs_init() should busy-wait on module initialization #3427

libzfs_init() should busy-wait on module initialization #3427

ryao commented May 18, 2015

ryao commented May 18, 2015

behlendorf commented May 18, 2015

FransUrbo commented May 18, 2015

ryao commented May 18, 2015

ryao commented May 18, 2015

behlendorf commented May 18, 2015

dun commented May 18, 2015

dweeezil commented May 18, 2015

behlendorf commented May 18, 2015

ryao commented May 18, 2015

behlendorf commented May 18, 2015

ryao commented May 18, 2015

behlendorf commented May 18, 2015

behlendorf commented May 18, 2015

ryao commented May 18, 2015

behlendorf commented May 18, 2015

behlendorf commented May 19, 2015

libzfs_init() should busy-wait on module initialization #3427

libzfs_init() should busy-wait on module initialization #3427

Conversation

ryao commented May 18, 2015

ryao commented May 18, 2015

behlendorf commented May 18, 2015

FransUrbo commented May 18, 2015

ryao commented May 18, 2015

ryao commented May 18, 2015

behlendorf commented May 18, 2015

dun commented May 18, 2015

dweeezil commented May 18, 2015

behlendorf commented May 18, 2015

ryao commented May 18, 2015

behlendorf commented May 18, 2015

ryao commented May 18, 2015

behlendorf commented May 18, 2015

behlendorf commented May 18, 2015

ryao commented May 18, 2015

behlendorf commented May 18, 2015

behlendorf commented May 19, 2015