From d6fbda44af70f65cbb57dff281a38d54657a8c74 Mon Sep 17 00:00:00 2001
From: Richard Yao <richard.yao@clusterhq.com>
Date: Mon, 18 May 2015 10:11:13 -0400
Subject: [PATCH] libzfs_init() should busy-wait on module initialization

`libzfs_init()`'s just-in-time load of the module before using it is
racy because Linux kernel module initialization is asynchronous. This
causes a sporadic failure whenever `libzfs_init()` is required to load
the kernel modules. This happens during the boot process on EPEL
systems, Fedora and likely others such as Ubuntu.

The general mode of failure is that `libzfs_init()` is expected to load
the module, module initialization does not complete before /dev/zfs is
opened and pool import fails. This could explain the infamous mountall
failure on Ubuntu where pools will import, but things fail to mount.
The general explanation is that the userland process expected to mount
things fails because the module loses the race with libzfs_init(), the
module loads the pools by reading the zpool.cache and nothing mounts
because the userland process expected to perform the mount has already
failed.

A related issue can also manifest itself in initramfs archives that
mount / on ZFS, which affected Gentoo until 2013 when a busy-wait was
implemented to ensure that the module loaded:

https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=c812c35100771bb527f6b03853fa6d8ef66a48fe
https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=a21728ae287e988a1848435ab27f7ab503def784
https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=32585f117ffbf6d6a0aa317e6876ae7711a7f307

The busy-wait approach was chosen because it imposed minimal latency and
was implementable in shell code.  Unfortunately, it was not known at the
time that `libzfs_init()` had the same problem, so this went unfixed. It
caused sporadic failures in the flocker tutorial, which caught our
attention at ClusterHQ:

https://clusterhq.atlassian.net/browse/FLOC-1834

Subsequent analysis following reproduction in a development environment
concluded that the failures were caused by module initialization losing
the race with `libzfs_init()`. While all Linux kernel modules needed
ASAP during the boot process suffer from this race, the zfs module's
dependence on additional modules make it particularly vulnerable to this
issue. The solution that has been chosen mirrors the solution chosen for
genkernel with the addition of `sched_yield()` for greater efficiency.

This fails to close the race in the scenario where system execution in a
virtual machine is paused in the exact window necessary to introduce a
delay between a failure and subsequent try greater than the timeout.
Closing the race in that situation would require hooking into udev
and/or the kernel hotplug events. That has been left as a future
improvement because it would require significant development time and it
is quite likely that the busy-wait approach implemented here would be
required for a fallback on exotic systems systems where neither are
available. The chosen approach should be sufficient for achieving
>99.999% reliability.

Closes zfsonlinux/zfs#2556

Signed-off-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Turbo Fredriksson <turbo@bayour.com>
---
 lib/libzfs/libzfs_util.c | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/lib/libzfs/libzfs_util.c b/lib/libzfs/libzfs_util.c
index d340fa49ded9..33b393a2fe1c 100644
--- a/lib/libzfs/libzfs_util.c
+++ b/lib/libzfs/libzfs_util.c
@@ -676,6 +676,7 @@ libzfs_handle_t *
 libzfs_init(void)
 {
 	libzfs_handle_t *hdl;
+	hrtime_t begin, delta;
 
 	if (libzfs_load_module("zfs") != 0) {
 		(void) fprintf(stderr, gettext("Failed to load ZFS module "
@@ -688,7 +689,41 @@ libzfs_init(void)
 		return (NULL);
 	}
 
-	if ((hdl->libzfs_fd = open(ZFS_DEV, O_RDWR)) < 0) {
+	/*
+	 * Linux module loading is asynchronous. It is therefore possible for
+	 * us to try to open ZFS_DEV before the module has reached the point in
+	 * its initialization where it has created it. We workaround this by
+	 * yielding the CPU in the hope that the module initialization process
+	 * finishes before we regain it. The expectation in these situations is
+	 * that the module initialization process will almost always finish
+	 * before the second try. However, we retry for up to a second before
+	 * giving up. Doing this allows us to implement a busy-wait with
+	 * minimal loss of CPU time.
+	 *
+	 * If a VM that loses this race is paused between the first failure and
+	 * time calculation for more than a second, this will still fail.  That
+	 * is an incredibly rare situation that will almost never happen in the
+	 * field. The solution is to hook into udev's kernel events to try to
+	 * find out when module load has finished, but we would still need the
+	 * busy-wait fallback for systems that either lack udev or have not had
+	 * the udev daemon started. The busy-wait is more than sufficient for
+	 * >99.999% reliability, so the implementation of udev integration has
+	 * been left as a future improvement.
+	 *
+	 * XXX: Hook into udev event notification where udev is available.
+	 */
+	begin = gethrtime();
+	do {
+
+		if ((hdl->libzfs_fd = open(ZFS_DEV, O_RDWR)) != -1)
+			break;
+
+		sched_yield();
+
+		delta = gethrtime() - begin;
+	} while (delta < NANOSEC);
+
+	if ((hdl->libzfs_fd == -1)) {
 		(void) fprintf(stderr, gettext("Unable to open %s: %s.\n"),
 		    ZFS_DEV, strerror(errno));
 		if (errno == ENOENT)