Skip to content

Commit

Permalink
libzfs_init() should busy-wait on module initialization
Browse files Browse the repository at this point in the history
`libzfs_init()`'s just-in-time load of the module before using it is
racy because Linux kernel module initialization is asynchronous. This
causes a sporadic failure whenever `libzfs_init()` is required to load
the kernel modules. This happens during the boot process on EPEL
systems, Fedora and likely others such as Ubuntu.

The general mode of failure is that `libzfs_init()` is expected to load
the module, module initialization does not complete before /dev/zfs is
opened and pool import fails. This could explain the infamous mountall
failure on Ubuntu where pools will import, but things fail to mount.
The general explanation is that the userland process expected to mount
things fails because the module loses the race with libzfs_init(), the
module loads the pools by reading the zpool.cache and nothing mounts
because the userland process expected to perform the mount has already
failed.

A related issue can also manifest itself in initramfs archives that
mount / on ZFS, which affected Gentoo until 2013 when a busy-wait was
implemented to ensure that the module loaded:

https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=c812c35100771bb527f6b03853fa6d8ef66a48fe
https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=a21728ae287e988a1848435ab27f7ab503def784
https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=32585f117ffbf6d6a0aa317e6876ae7711a7f307

The busy-wait approach was chosen because it imposed minimal latency and
was implementable in shell code.  Unfortunately, it was not known at the
time that `libzfs_init()` had the same problem, so this went unfixed. It
caused sporadic failures in the flocker tutorial, which caught our
attention at ClusterHQ:

https://clusterhq.atlassian.net/browse/FLOC-1834

Subsequent analysis following reproduction in a development environment
concluded that the failures were caused by module initialization losing
the race with `libzfs_init()`. While all Linux kernel modules needed
ASAP during the boot process suffer from this race, the zfs module's
dependence on additional modules make it particularly vulnerable to this
issue. The solution that has been chosen mirrors the solution chosen for
genkernel with the addition of `sched_yield()` for greater efficiency.

This fails to close the race in the scenario where system execution in a
virtual machine is paused in the exact window necessary to introduce a
delay between a failure and subsequent try greater than the timeout.
Closing the race in that situation would require hooking into udev
and/or the kernel hotplug events. That has been left as a future
improvement because it would require significant development time and it
is quite likely that the busy-wait approach implemented here would be
required for a fallback on exotic systems systems where neither are
available. The chosen approach should be sufficient for achieving
>99.999% reliability.

Closes openzfs#2556

Signed-off-by: Richard Yao <[email protected]>
Reviewed-by: Turbo Fredriksson <[email protected]>
  • Loading branch information
ryao committed May 18, 2015
1 parent 98b2541 commit a554c31
Showing 1 changed file with 37 additions and 1 deletion.
38 changes: 37 additions & 1 deletion lib/libzfs/libzfs_util.c
Original file line number Diff line number Diff line change
Expand Up @@ -676,6 +676,7 @@ libzfs_handle_t *
libzfs_init(void)
{
libzfs_handle_t *hdl;
hrtime_t begin, delta;

if (libzfs_load_module("zfs") != 0) {
(void) fprintf(stderr, gettext("Failed to load ZFS module "
Expand All @@ -688,7 +689,42 @@ libzfs_init(void)
return (NULL);
}

if ((hdl->libzfs_fd = open(ZFS_DEV, O_RDWR)) < 0) {
/*
* Linux module loading is asynchronous. It is therefore possible for
* us to try to open ZFS_DEV before the module has reached the point in
* its initialization where it has created it. We workaround this by
* yielding the CPU in the hope that the module initialization process
* finishes before we regain it. The expectation in these situations is
* that the module initialization process will almost always finish
* before the second try. However, we retry for up to a second before
* giving up. Doing this allows us to implement a busy-wait with
* minimal loss of CPU time.
*
* If a VM that loses this race is paused between the first failure and
* time calculation for more than a second, this will still fail. That
* is an incredibly rare situation that will almost never happen in the
* field. The solution is to hook into udev's kernel events to try to
* find out when module load has finished, but we would still need the
* busy-wait fallback for systems that either lack udev or have not had
* the udev daemon started. The busy-wait is more than sufficient for
* >99.999% reliability, so the implementation of udev integration has
* been left as a future improvement.
*
* XXX: Hook into udev event notification where udev is available.
*/
begin = gethrtime();
do {

if ((hdl->libzfs_fd = open(ZFS_DEV, O_RDWR)) != -1 ||
errno == ENOENT)
break;

sched_yield();

delta = gethrtime() - begin;
} while (delta < NANOSEC);

This comment has been minimized.

Copy link
@behlendorf

behlendorf May 18, 2015

Let's move this logic in to libzfs_load_module() with the following changes.

  • When ZFS_MODULE_LOADING is not defined simply return (0).
  • When !libzfs_module_loaded() trigger libzfs_run_process("/sbin/modprobe", argv, 0))
  • Add a loop like this which blocks waiting for the /dev/zfs device to be created. Although I think it would be simpler to structure it as a simple for loop and call usleep(1000) to poll.

This comment has been minimized.

Copy link
@ryao

ryao May 18, 2015

Author Owner

When ZFS_MODULE_LOADING is not defined simply return (0). will mean that we are essentially disabling this in the next release. That would be roughly the equivalent of removing it outright in terms of the community support load. That is not quite what I had in mind when I suggested we deprecate this functionality.

As for Add a loop like this which blocks waiting for the /dev/zfs device to be created., I would have done that if I knew a sane way of doing it across all system using ZoL.

Doing this on the system using udev would involve using libudev to monitor for an add event on the misc character device type, such that we would be woken up for every event. I am not sure how to do this on the systems using mdev. If I ignore it and assume udev is around, that is doable, but it is not clear it would be better than a busy-wait CPU time wise and it would certainly annoy Gentoo users that use mdev (or even do static /dev nodes). I realize that we are already somewhat dependent on udev because I have had complaints from those users about it. Going with libudev would generate much louder complaints. We could try using inotify, but that involves its own set of wake ups.

Assuming that we do settle on a way of blocking, we would need to implement a timer to handle the situation where a bug causes /dev/zfs to never appear and spin anyway for the case when a VM running this is paused for a long enough duration at just the right time to bypass the safety mechanism. Doing two blocking spins with timers each set to half the total time should be sufficient unless someone decides to step through machine execution in gdb.

The busy-wait in comparison is portable, easy to implement and quite possibly more efficient time wise. The use of sched_yield() should enable us to only spin once when /dev/zfs loses the race, although more times are possible. Developing this further when we will likely remove it in the future seems like a poor use of time.

This comment has been minimized.

Copy link
@ryao

ryao May 18, 2015

Author Owner

If we do link to libudev anyway, we are going to have to check whether it is an old version that is GPL-encumbered or not. Otherwise, we would risk linking a CDDL binary to GPLv2 code.

This comment has been minimized.

Copy link
@behlendorf

behlendorf May 18, 2015

Yes, I was hoping we could disable this in the next release. One of the reason's I'd like to do this is so that the ZED doesn't automatically pull in the modules. It shouldn't trigger this behavior and having libzfs_init() do this somewhat complicates the ZED error handling.

Is your main concern that we won't be able to get the various modprobe tweaks in to the packaging and that will cause a support issue? I admit that's possible, but perhaps all the more reason to make the change now so there's enough time before tag for those updates to be made and soak.

waiting for the /dev/zfs device to be created

Actually, I didn't have anything nearly so sophisticated in mind. I just thought it might be cleaner to poll with usleep() rather than busy wait. Both would be very portable. I didn't mean we should attempt to tie in to udev. Roughly something like this.

        for (int i = 0; i < 1000; i++) {
                fd = open(DEV_ZFS, ...);
                if (fd == -1 && errno != ENOENT)
                        return (errno);

               usleep(1000);
        }

if ((hdl->libzfs_fd == -1)) {
(void) fprintf(stderr, gettext("Unable to open %s: %s.\n"),
ZFS_DEV, strerror(errno));

This comment has been minimized.

Copy link
@behlendorf

behlendorf May 18, 2015

While we're here please remove the fprintf() here and above. This is a library call and it should never be logging anything to stderr or stdout. That's a long standing mistake. These error message should be moved to the zfs/zpool commands.

if (errno == ENOENT)
Expand Down

0 comments on commit a554c31

Please sign in to comment.