libzfs_init() should busy-wait on module initialization

`libzfs_init()`'s just-in-time load of the module before using it is racy because Linux kernel module initialization is asynchronous. This causes a sporadic failure whenever `libzfs_init()` is required to load the kernel modules. This happens during the boot process on EPEL systems, Fedora and likely others such as Ubuntu. The general mode of failure is that `libzfs_init()` is expected to load the module, module initialization does not complete before /dev/zfs is opened and pool import fails. This could explain the infamous mountall failure on Ubuntu where pools will import, but things fail to mount. The general explanation is that the userland process expected to mount things fails because the module loses the race with libzfs_init(), the module loads the pools by reading the zpool.cache and nothing mounts because the userland process expected to perform the mount has already failed. A related issue can also manifest itself in initramfs archives that mount / on ZFS, which affected Gentoo until 2013 when a busy-wait was implemented to ensure that the module loaded: https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=c812c35100771bb527f6b03853fa6d8ef66a48fe https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=a21728ae287e988a1848435ab27f7ab503def784 https://gitweb.gentoo.org/proj/genkernel.git/commit/defaults/initrd.scripts?id=32585f117ffbf6d6a0aa317e6876ae7711a7f307 The busy-wait approach was chosen because it imposed minimal latency and was implementable in shell code. Unfortunately, it was not known at the time that `libzfs_init()` had the same problem, so this went unfixed. It caused sporadic failures in the flocker tutorial, which caught our attention at ClusterHQ: https://clusterhq.atlassian.net/browse/FLOC-1834 Subsequent analysis following reproduction in a development environment concluded that the failures were caused by module initialization losing the race with `libzfs_init()`. While all Linux kernel modules needed ASAP during the boot process suffer from this race, the zfs module's dependence on additional modules make it particularly vulnerable to this issue. The solution that has been chosen mirrors the solution chosen for genkernel with the addition of `sched_yield()` for greater efficiency. This fails to close the race in the scenario where system execution in a virtual machine is paused in the exact window necessary to introduce a delay between a failure and subsequent try greater than the timeout. Closing the race in that situation would require hooking into udev and/or the kernel hotplug events. That has been left as a future improvement because it would require significant development time and it is quite likely that the busy-wait approach implemented here would be required for a fallback on exotic systems systems where neither are available. The chosen approach should be sufficient for achieving >99.999% reliability. Closes openzfs#2556 Signed-off-by: Richard Yao <[email protected]> Reviewed-by: Turbo Fredriksson <[email protected]>
ryao · May 18, 2015 · a554c31 · behlendorf · May 18, 2015 · ryao
1 parent 98b2541
commit a554c31
Showing 1 changed file with 37 additions and 1 deletion.
diff --git a/lib/libzfs/libzfs_util.c b/lib/libzfs/libzfs_util.c
@@ -676,6 +676,7 @@ libzfs_handle_t *
 libzfs_init(void)
 {
 	libzfs_handle_t *hdl;
+	hrtime_t begin, delta;
 
 	if (libzfs_load_module("zfs") != 0) {
 		(void) fprintf(stderr, gettext("Failed to load ZFS module "
@@ -688,7 +689,42 @@ libzfs_init(void)
 		return (NULL);
 	}
 
-	if ((hdl->libzfs_fd = open(ZFS_DEV, O_RDWR)) < 0) {
+	/*
+	 * Linux module loading is asynchronous. It is therefore possible for
+	 * us to try to open ZFS_DEV before the module has reached the point in
+	 * its initialization where it has created it. We workaround this by
+	 * yielding the CPU in the hope that the module initialization process
+	 * finishes before we regain it. The expectation in these situations is
+	 * that the module initialization process will almost always finish
+	 * before the second try. However, we retry for up to a second before
+	 * giving up. Doing this allows us to implement a busy-wait with
+	 * minimal loss of CPU time.
+	 *
+	 * If a VM that loses this race is paused between the first failure and
+	 * time calculation for more than a second, this will still fail.  That
+	 * is an incredibly rare situation that will almost never happen in the
+	 * field. The solution is to hook into udev's kernel events to try to
+	 * find out when module load has finished, but we would still need the
+	 * busy-wait fallback for systems that either lack udev or have not had
+	 * the udev daemon started. The busy-wait is more than sufficient for
+	 * >99.999% reliability, so the implementation of udev integration has
+	 * been left as a future improvement.
+	 *
+	 * XXX: Hook into udev event notification where udev is available.
+	 */
+	begin = gethrtime();
+	do {
+
+		if ((hdl->libzfs_fd = open(ZFS_DEV, O_RDWR)) != -1 ||
+		    errno == ENOENT)
+			break;
+
+		sched_yield();
+
+		delta = gethrtime() - begin;
+	} while (delta < NANOSEC);
+
+	if ((hdl->libzfs_fd == -1)) {
 		(void) fprintf(stderr, gettext("Unable to open %s: %s.\n"),
 		    ZFS_DEV, strerror(errno));
 		if (errno == ENOENT)