Fix parallel mount race #8878

kusumi · 2019-06-10T06:09:23Z

Motivation and Context

Description

new PR after update

Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same mount point, and this puts threads under race
condition. This appeared as mount issue on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8)
on mount.

original PR (workaround)

Parallel mount behaves differently in ZoL by each pthread forking
another process to exec `/bin/mount`, whereas illumos and FreeBSD
directly use `mount(2)` and `nmount(2)` syscall respectively.

This can cause parallel mount to mount datasets in incorrect order
depending on the timing `/bin/mount` runs, and results in shadowing
dataset(s) whose mountpoint is descendant of another.

Having a global lock around `/bin/mount` (and also `/bin/umount`) call
adjusts the timing (eliminates the timing difference caused by
fork/exec) and makes its behavior similar to illumos and FreeBSD.

Note that this isn't to "fix" a race condition by proper locking.
Parallel mount by design doesn't guarantee ordering of threads when
multiple datasets have the same mount point. Adjusting is all it
can do. See #8833 for details.

How Has This Been Tested?

The test script mentioned in #8833.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

sebroy · 2019-06-10T13:38:05Z

Given that the bulk of the work for each mount thread is doing the mount, I would expect that putting global lock around the mount effectively undoes most of the performance gain that would be achieved from the parallel mount algorithm. Is the performance of parallel mounting with this change similar to mounting with ZFS_SERIAL_MOUNT?

That said, I don't understand how the fix addresses the problem. The problematic scenario includes two pools that share some mount point parentage in their datasets as described in #8833.

/tmp # zfs list -o name,mounted,canmount,mountpoint
NAME     MOUNTED  CANMOUNT  MOUNTPOINT
test0         no        on  none
test0/a      yes        on  /a
test1         no       off  /a
test1/b      yes        on  /a/b

If I understand the problem description, the problem is that test1/b somehow gets mounted before test0/a even though test0/a's mountpoint is the parent test1/b's mountpoint. This would appear to be happening because test0/a and test1 have the same top-level root mountpoint, and these are being kicked off in parallel, resulting in test1/b sometimes being mounted before test0/a. Is that right?

If my understanding is correct, then how does adding a lock around the mount address this? Given slightly different timing, could test1/b not be mounted and all locks dropped before the thread that attempts to mount test0/a gets a chance to run?

I could be wrong here (if so, my apologies, I'm trying to get my head fully around this), but if I'm not, could a fix instead be to not include any datasets whose canmount property is off in the list of datasets to mount?

kusumi · 2019-06-10T15:26:32Z

The problem I'm seeing here is that the initial delegation in zfs_foreach_mountpoint() can spawn >1 threads that run concurrently for datasets with the same mount point. After that, subsequent delegation in zfs_mount_task() are serialized.

In case of below,

NAME   MOUNTED  CANMOUNT  MOUNTPOINT
p1          no        on  none
p1/f1      yes        on  /mntpt
p2          no       off  /mntpt
p2/f2      yes        on  /mntpt/f2

mounting p1/f1 is dispatched from initial delegation, while mounting p2/f2 is dispatched from subsequent delegation from initial delegation for p2, so these two go parallel.

That being said, as mentioned in #8833, since p2/f2 is mounted at a cost of extra at a cost of extra pthread_create(3) or pthread_cond_signal(3) call, p1/f1 usually seems to win. That's what I observed from FreeBSD, and is also the expected order because otherwise p2/f2 is shadowed.

But when another complexity from fork(2)+exec(2) comes in, it can go out of order, and that's what's been observed in ZoL. Having a global lock eliminates it, and keeps the logic similar to that of FreeBSD (and probably illumos too).

Regarding the performance penalty, it can of course be penalty, but mount(2) syscall itself has its own locking in kernel space. So even if ZoL didn't invoke /bin/mount, bunch of threads trying to invoke mount(2) concurrently don't just go concurrently.

So to answer your question,

This would appear to be happening because test0/a and test1 have the same top-level root mountpoint, and these are being kicked off in parallel, resulting in test1/b sometimes being mounted before test0/a. Is that right?

Right.

If my understanding is correct, then how does adding a lock around the mount address this? Given slightly different timing, could test1/b not be mounted and all locks dropped before the thread that attempts to mount test0/a gets a chance to run?

I didn't say this change "fix" or "address" the problem. Please read #8833 and this PR carefully. Instead I explained this as prevent the race by adjusting the timing so that it works similarly to FreeBSD (and probably illumos too).

As mentioned earlier, I think the real problem is the fact that two threads for the same mount point can run concurrently on initial delegation, which is caused by libzfs_path_contains() returning false for the same directory on determining non-descendant index on initial delegation.

I didn't go far as to change the parallel mount behavior itself, hence this commit.

kusumi · 2019-06-10T15:39:53Z

Was missing one of your question.

I could be wrong here (if so, my apologies, I'm trying to get my head fully around this), but if I'm not, could a fix instead be to not include any datasets whose canmount property is off in the list of datasets to mount?

As mentioned in #8833, I don't think this is directly related to canmount property. I mean zfs_foreach_mountpoint() has nothing to do with the canmount property, while the problem I've pointed out is in that function.

sebroy · 2019-06-10T19:17:01Z

The problem I'm seeing here is that the initial delegation in zfs_foreach_mountpoint() can spawn >1 threads that run concurrently for datasets with the same mount point.

That's right, I came to the same conclusion. I also came to a separate conclusion that there should only be one such thread kicked off, because there should only be one filesystem for that top-level mount point that has a canmount property of on. Pruning the array of handles of filesystems that have canmount set to off addresses this issue, I think.

Circling back on the proposed fix in this PR, I don't understand how this fix would work. Furthermore, at the start of my initial comment in this PR, I point out that the fix undoes almost all of the benefits of parallel mounting, as now no mount can be done concurrent to another one. Am I mistaken? If not, how can we go forward with this?

kusumi · 2019-06-11T03:08:46Z

Circling back on the proposed fix in this PR, I don't understand how this fix would work. Furthermore, at the start of my initial comment in this PR, I point out that the fix undoes almost all of the benefits of parallel mounting, as now no mount can be done concurrent to another one. Am I mistaken? If not, how can we go forward with this?

As already mentioned, threads can't go concurrently all the way through mount(2). The locking around /bin/mount which also eventually invoke mount(2) is to emulate that in higher level, to catch the timing problem caused by fork/exec, which is caused by the design issue in parallel mount (<- and we both agree on this). Maybe we can measure on this (I haven't).

After all, this change is a workaround. For a real fix, the author of parallel mount should probably rewrite the initial dispatching logic that can race. The problem here is that threads can race when this feature doesn't provide (and shouldn't provide) synchronization among threads.

behlendorf · 2019-06-11T19:22:42Z

I'm still trying to wrap my head around this as well. From the above conversation, it sounds like there's agreement on the root problem. And that this proposed change doesn't address that core issue. It merely adds some locking which happens perturb the timing in such a way to avoid the issue for this particular test case.

could a fix instead be to not include any datasets whose canmount property is off in the list of datasets to mount?

This sounds like a good idea and I think it would resolve this specific case. However, I don't believe it solves the general case. You can imagine other ways in which to interleave two (or more!) pool's datasets such that they would not be mounted in the correct order. Part of this confusion I think comes from this text in the man page.

One example of setting canmount=off is to have two datasets with the same
mountpoint, so that the children of both datasets appear in the same direc-
tory, but might have different inherited characteristics.

@sebroy since you authored this feature please correct me if I'm wrong about the intended design. But it's my understanding that:

The example given above assumes (but doesn't explicitly state) that the datasets in question must be part of the same pool. In which case, this should work fine. And,
Since the processing of the root of each pool's datasets is handled concurrently, arbitrarily interleaving dataset mountpoints between pools was never intended to be a supported use case.

Assuming the above is true, and I haven't misunderstood the intent of parallel mounts, perhaps the best thing to do here would be to instead update the documentation to make this limitation clear. Of course, if someone wants to add support for this kind of arbitrary interleaving of datasets that would be wonderful.

kusumi · 2019-06-12T11:33:47Z

@behlendorf
I just wrote and tried the real fix that I've mentioned earlier that I had in mind but never tried until this point, which intends to fix parallel mount behavior instead of working around the timing issue.

As mentioned earlier, I think the real problem is the fact that two threads for the same mount point can run concurrently on initial delegation, which is caused by libzfs_path_contains() returning false for the same directory on determining non-descendant index on initial delegation.

It's just two loc, and only affects zfs mount related. It works for this specific case at least, but haven't used it for anything else.

# git diff
diff --git a/lib/libzfs/libzfs_mount.c b/lib/libzfs/libzfs_mount.c
index 649c232aa..2b363f437 100644
--- a/lib/libzfs/libzfs_mount.c
+++ b/lib/libzfs/libzfs_mount.c
@@ -1307,6 +1307,8 @@ mountpoint_cmp(const void *arga, const void *argb)
 static boolean_t
 libzfs_path_contains(const char *path1, const char *path2)
 {
+       if (!strcmp(path1, path2))
+               return B_TRUE;
        return (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/');
 }

It changes the initial dispatching algorithm so that zfs_foreach_mountpoint() doesn't spawn >1 threads for the same mount point.

From what I understand by looking at the code, the initial dispatching of threads is to select sets of mount points that don't have dependencies on each other, hence it should/can go lock-less and shouldn't race with other sets of mount points. Subsequent dispatching of threads are for each set of mounts with dependencies (i.e. child directories) hence these are serialized.

Edited:
After reading comments around the code, I think this is the way it's intended to work. Force pushed this one. Let's see how tests go.

Rudd-O · 2019-06-12T23:29:56Z

Sometimes I wonder if it isn't easier to use a systemd generator to establish which things to mount, then let systemd mount everything in parallel.

kusumi · 2019-06-13T10:00:41Z

It probably is for systemd based distros. systemd also seems to support mount dependencies, where in zfs(8) mounts are serialized.

c0d3z3r0 · 2019-06-15T20:48:52Z

this seems to be related to #8450

kusumi · 2019-06-16T04:16:18Z

Most likely same issue as #8833, and also basically same analysis,
i.e. threads are under race condition and either of below fixes this.

small delay
serialize fork/exec
fix algorithm (apparently preferred way)

behlendorf

This makes sense to me! @sebroy would you mind reviewing the proposed fix.

behlendorf · 2019-06-19T19:00:01Z

lib/libzfs/libzfs_mount.c

 */
 static boolean_t
 libzfs_path_contains(const char *path1, const char *path2)
 {
-	return (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/');
+	return (strcmp(path1, path2) == 0) ||
+	    (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/');


nit: can you add parenthesis around the entire return. e.g.

return ((strcmp(path1, path2) == 0) || (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/'));

behlendorf · 2019-06-19T19:02:59Z

tests/zfs-tests/tests/functional/cli_root/zfs_mount/Makefile.am

@@ -19,7 +19,8 @@ dist_pkgdata_SCRIPTS = \
 	zfs_mount_all_mountpoints.ksh \
 	zfs_mount_encrypted.ksh \
 	zfs_mount_remount.ksh \
-	zfs_multi_mount.ksh
+	zfs_multi_mount.ksh \
+	zfs_mount_test_race.sh


Can you place this after zfs_mount_remount.ksh in order to keep the tests in alphabetical order. It'd be nice to order the linux.run file similarly too, it appears to also be out of order.

behlendorf · 2019-06-19T19:12:17Z

tests/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_test_race.sh

+}
+log_onexit cleanup
+
+log_note "Testing github.com/zfsonlinux/zfs/issues/8833"


Since the issue was referenced in the above comment, I think a human readable explaination would be better here. How about the following based on your existing log_pass message.

log_note " log_pass "Verify parallel mount ordering is consistent"

behlendorf · 2019-06-19T19:13:46Z

tests/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_test_race.sh

+log_note "Testing github.com/zfsonlinux/zfs/issues/8833"
+
+log_must mkfile $MINVDEVSIZE $DISK1
+log_must mkfile $MINVDEVSIZE $DISK2


We've been gradually switching to the lighter weight log_must truncate -s $MINVDEVSIZE for one-off file vdevs.

Will fix things you pointed out, I think we also want to explicitly rm these two tmp files in ksh cleanup function. I forgot to remove those.

behlendorf · 2019-06-19T19:19:11Z

tests/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_test_race.sh

+log_must zfs set mountpoint=$MNTPT $TESTPOOL2
+
+log_must zfs list -o name,mounted,canmount,mountpoint
+log_must mount | grep zfs


log_must ismounted $TESTPOOL1/$TESTFS1 can be used for this check, here and below.

sebroy · 2019-06-19T20:10:28Z

lib/libzfs/libzfs_mount.c

 */
 static boolean_t
 libzfs_path_contains(const char *path1, const char *path2)
 {
-	return (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/');
+	return (strcmp(path1, path2) == 0) ||


Can you explain how this fixes the issue at hand? If we have a list of mountpoints like this:

1. /a (canmount=off) 2. /a 3. /a/b 4. /a/c

How will this prevent the thread mounting (1) (which incidentally won't mount /a) from dispatching mounts for (3) and (4)?

If libzfs_path_contains() return True for the same paths (i.e. two same top level dirs on initial dispatching), then there's no chance race can happen because they never go in multi-thread (read zfs_foreach_mountpoint()).

kusumi · 2019-06-20T08:54:54Z

@behlendorf
Updated + rebased.

Note that I'm seeing build failure from recent 30af21b, which I PR'd a fix in #8939 as a separate PR. I'm also seeing test failure on setup stage possibly due to the same commit with below error message (not sure).
cannot create 'testpool': invalid feature 'redaction_bookmarks'

So even though the branch being pushed here is rebased on the current origin/master, I've dropped 30af21b for my local testing of zfs_mount_test_race.sh. This race fix is completely userspace matter with limited code coverage, so it won't affect.

behlendorf

Thanks. Since the CI bots didn't hit the issue you fixed in #8939 it's be nice to still have this PR rebased against master.

behlendorf · 2019-06-21T17:12:56Z

tests/runfiles/linux.run

-    'zfs_multi_mount', 'zfs_mount_all_fail', 'zfs_mount_all_mountpoints']
+    'zfs_mount_all_001_pos', 'zfs_mount_all_fail', 'zfs_mount_all_mountpoints',
+    'zfs_mount_encrypted', 'zfs_mount_remount', 'zfs_mount_test_race',
+    'zfs_multi_mount']


It looks like reordering these tests exposed some missing cleanup code which caused zfs_mount_encrypted and zfs_mount_remount to fail.

Then I guess we just want to add zfs_mount_test_race at the end of this list ? This was reordered in alphabetical order per your suggestion.

It would be nice to resolve whatever the cleanup issue is, since we should be able to reorder these tests. But if that's not going to be straight forward, I'm fine with preserving the ordering for now.

behlendorf · 2019-06-21T17:27:35Z

tests/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_test_race.sh

+TESTPOOL2=zfs_mount_test_race_tp2
+TESTFS2=zfs_mount_test_race_tf2
+
+log_must zfs unmount -a # unmount zfs mounts from previous tests


Rather than unmounting and remounting let's use the __ZFS_POOL_RESTRICT environment variable which exists for this purpose. You should be able to export __ZFS_POOL_RESTRICT="$TESTPOOL1 $TESTPOOL2" which will restrict zfs mount to only those pools.

Will take a look at __ZFS_POOL_RESTRICT.

behlendorf · 2019-06-21T17:28:46Z

tests/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_test_race.sh

+
+DISK1="$TMPDIR/zfs_mount_test_race_disk1"
+TESTPOOL1=zfs_mount_test_race_tp1
+TESTFS1=zfs_mount_test_race_tf1


Do we need to redefine TESTPOOL{1,2} and TESTFS{1,2}, they're already defined in default.cfg.

I used unique names since it rm -rf /$TESTPOOL{1,2}, and also to be defensive against other tests in zfs_mount category.

I see. So there shouldn't be hidden dependencies between tests so using the original names should be fine, but if you'd prefer the custom names just in case that's fine with me.

I got why I used unique names for TESTPOOL{1,2}. If the test runs with default names, it will not want to rm -rf /$TESTPOOL{1,2} on cleanup to be defensive with other tests in zfs_mount category. Other tests could have created something to use throughout several tests.

But then this test won't succeed if /testpool{1,2} already has a dirent(s) in it. e.g. Running this test 2 times in a row at local fails on the second attempt with below for having a directory under /testpool2 already.

So yes, it's safer to just go with unique names.

ERROR: zpool create -f testpool2 /var/tmp/zfs_mount_test_race_disk2 exited 1 mountpoint '/testpool2' exists and is not empty use '-m' option to provide a different default

# tree /testpool2/ /testpool2/ `-- testfs2 1 directory, 0 files

behlendorf · 2019-06-21T17:29:35Z

tests/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_test_race.sh

+log_must zfs unmount -a # unmount zfs mounts from previous tests
+log_must rm -rf $MNTPT
+log_must rm -rf /$TESTPOOL1
+log_must rm -rf /$TESTPOOL2


Is this actually needed because a previous test didn't cleanup, or is it just defensive?

Just being defensive given unique names for TESTPOOL{1,2}. I can drop it if this is redundant.

The let's drop it.

behlendorf · 2019-06-21T17:30:40Z

tests/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_test_race.sh

+	rm -rf /$TESTPOOL2
+	rm -f $DISK1
+	rm -f $DISK2
+	log_must zfs mount -a # remount unmounted zfs mounts


Shouldn't be needed if you use __ZFS_POOL_RESTRICT, but it should be unset. See zfs_mount_all_mountpoints.ksh as an example.

tests/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_test_race.sh

kusumi · 2019-06-21T18:22:23Z

The branch is actually rebased on origin/master as of yesterday, though I had problem with it for my local testing on both compile and runtime (hence dropped that commit locally for local testing).

I'll re-rebase anyway when pushing test code update next time.

sebroy · 2019-06-19T20:21:07Z

tests/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_test_race.sh

+log_must zfs set mountpoint=none $TESTPOOL1
+log_must zfs set mountpoint=$MNTPT $TESTPOOL1/$TESTFS1
+
+# Note that unmount can fail (due to race condition on `zfs mount -a`) with or


Can you expand on this comment and explain what you mean? How can an unmount failure be caused by a race condition in zfs mount -a?

How can an unmount failure be caused by a race condition in zfs mount -a?

????
That's what's been discussed since the original issue reported in #8833.
When threads are under race condition, mount order is undefined.
When a thread for a mountpoint deeper in directory tree wins (which is a bug), that's not what zfs unmount -a expects, hence unmount fails.

@kusumi Please be patient with me. I'm asking clarifying questions in good faith. I legitimately did not understand how a race condition in the mount code could cause unmount to fail, even after having read the issue. Furthermore, people reading this code in the future won't know to go look at that issue. Comments should be self-contained and self-explanatory.

This comment is evidently trying to explain why canmount has to be set to off on $TESTPOOL2, which has a mountpoint of $MNTPT. This has to be done because you can't mount two filesystems on the same mountpoint. In the test, $TESTPOOL1/$TESTFS1 has the same mountpoint. Both cannot be mounted; hence, one of them has to have canmount set to off.

I legitimately did not understand how a race condition in the mount code could cause unmount to fail, even after having read the issue.

The unmount assumes mount order was in correct order, so if a mount point deeper in the directory tree is shadowed by another one due to undefined mount order caused by the race among pthreads, then unmount will fail, because unmount can't see a mount that is supposed to exist.

(See my fist comment and example I pasted in #8833, as well as my first comment in this thread. Also see #8450 which is likely the same problem.)

Furthermore, people reading this code in the future won't know to go look at that issue.

I did add a bit more description (basically my previous comment to your question) in my previous push, but apparently I won't/can't explain all the details of implementation I explained in this thread as well as the commit message (plus this test file already does have lots of comments with ascii-art). That's why it has had a reference to GitHub url.

This comment is evidently trying to explain why canmount has to be set to off on $TESTPOOL2, which has a mountpoint of $MNTPT. This has to be done because you can't mount two filesystems on the same mountpoint. In the test, $TESTPOOL1/$TESTFS1 has the same mountpoint. Both cannot be mounted; hence, one of them has to have canmount set to off.

Right, that's why it has turn it off for convenience of mount layout used in this test case, but whether that property is set or not, the race itself in initial thread dispatching does happen. See #8450, it's not using canmount.

sebroy · 2019-06-19T20:24:44Z

tests/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_test_race.sh

+
+# Note that unmount can fail (due to race condition on `zfs mount -a`) with or
+# without `canmount=off`. The race has nothing to do with canmount property.
+log_must zfs set canmount=off $TESTPOOL2


Can you also add a test that does the exact same thing, except sets canmount=off on TESTPOOL1 instead, just in case the code is accidentally dependent on the ordering of the filesystems passed in?

just in case the code is accidentally dependent on the ordering of the filesystems passed in?

I don't get what "accidentally" is supposed to mean, but the entire point of this race fix on initial thread dispatching isn't anything to do with canmount property. If canmount=off on a different top level dir is causing difference or a problem, that's another issue.

@kusumi From my understanding, the canmount property has to be set to off on the filesystems with duplicate mount points in order to hit this bug. Can you show me a case where this bug gets hit and canmount isn't set to off on any of the filesystems?

Essentially, this has to do with two or more filesystems having the same mountpoint, all but one of which does not get mounted (canmount=off). If a thread handling the mounting of a filesystem whose canmount property is set to off spawns child filesystem mounts before others, then those child mounts will fail because their parent filesystems will not have been mounted. This updated fix addresses this issue because the threads with duplicate mountpoints will no longer be spawned in parallel, and the one legitimate filesystem (the one that will be mounted) will be guaranteed to be mounted before the threads that mount its child filesystems get spawned.

I'm not trying to criticize your code or tests here, I'm providing you with feedback so that we can solidify the code. This is supposed to be constructive.

Can you show me a case where this bug gets hit and canmount isn't set to off on any of the filesystems?

This example does fail to unmount without canmount=off. In this test or an example in #8833, canmount=off is set to TESTPOOL2 or p2, so while it does fail to mount TESTPOOL2 or p2 on set mountpoint=/... without set canmount=off, the race condition in other two mounts (which are to be mounted by different threads without this fix, but run in a single thread with this fix) exist.

Also #8450 isn't using canmount property as well.

kusumi · 2019-07-02T08:56:39Z

@behlendorf
rebased + force pushed again. please merge.

kusumi · 2019-07-02T18:48:13Z

Looking at #8450 (another one which is likely the same issue) more carefully, the race is likely happening with / involved, so it probably needs a special handling, something like below. It wants to count /var/data/... as a descendant of /.

Unfortunately I can't reproduce #8450, so I can't test this one.

diff --git a/lib/libzfs/libzfs_mount.c b/lib/libzfs/libzfs_mount.c
index 8c7a7eabe..e70acd425 100644
--- a/lib/libzfs/libzfs_mount.c
+++ b/lib/libzfs/libzfs_mount.c
@@ -1341,6 +1341,8 @@ mountpoint_cmp(const void *arga, const void *argb)
 static boolean_t
 libzfs_path_contains(const char *path1, const char *path2)
 {
+       if (strcmp(path1, "/") == 0 || strcmp(path2, "/") == 0)
+               return B_TRUE;
        return ((strcmp(path1, path2) == 0) ||
            (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/'));
 }

kusumi · 2019-07-04T10:53:09Z

Added above (strcmp(path1, "/") == 0 ||) and rebased, so you might want to wait until tests completes before merging. Since now that there are two known patterns, I've also added below to commit messaage.

In terms of testing, I much prefer #8450 layout than #8833 due to its simplicity without use of canmount, except that it requires zfs mount at / (which fails on my environment due to zfs userspace not allowing a directory with dirent(s) in it, so basically I have not reproduced #8450).

There are currently two known patterns of `zfs_foreach_mountpoint(...,
handles, ...)` input which cause the race condition.

1) #8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directory.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of #8833, ThreadA won the race. ThreadB was created because
 "/a" wasn't considered as `"/a" contains "/a"`.

2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list which contains "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of #8450, ThreadA won the race. ThreadB was created because
 "/var/data" wasn't considered as `"/" contains "/var/data"`. In other
 words, if there is at least one "/" in the input list, it must be
 single threaded since every directory is a child of "/", meaning they
 all directly or indirectly depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result create another thread
when mounts should be done in a single thread. Fix a conditional in
libzfs_path_contains() to consider above two cases.

Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) openzfs#8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of openzfs#8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of openzfs#8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Closes openzfs#8450 Closes openzfs#8833 Closes openzfs#8878 Signed-off-by: Tomohiro Kusumi <[email protected]>

kusumi · 2019-07-06T10:50:07Z

Test results are good.
Updated/revised commit message a bit, rebased and force pushed.

codecov · 2019-07-06T19:23:30Z

Codecov Report

Merging #8878 into master will increase coverage by 0.07%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8878      +/-   ##
==========================================
+ Coverage   78.62%    78.7%   +0.07%     
==========================================
  Files         388      388              
  Lines      119992   119993       +1     
==========================================
+ Hits        94349    94445      +96     
+ Misses      25643    25548      -95

Flag	Coverage Δ
#kernel	`79.46% <ø> (+0.02%)`	⬆️
#user	`66.52% <100%> (+0.34%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1086f54...152fb96. Read the comment docs.

c0d3z3r0 · 2019-07-07T11:55:01Z

Looking at #8450 (another one which is likely the same issue) more carefully, the race is likely happening with / involved, so it probably needs a special handling, something like below. It wants to count /var/data/... as a descendant of /.

Unfortunately I can't reproduce #8450, so I can't test this one.
diff --git a/lib/libzfs/libzfs_mount.c b/lib/libzfs/libzfs_mount.c
index 8c7a7eabe..e70acd425 100644
--- a/lib/libzfs/libzfs_mount.c
+++ b/lib/libzfs/libzfs_mount.c
@@ -1341,6 +1341,8 @@ mountpoint_cmp(const void *arga, const void *argb)
 static boolean_t
 libzfs_path_contains(const char *path1, const char *path2)
 {
+       if (strcmp(path1, "/") == 0 || strcmp(path2, "/") == 0)
+               return B_TRUE;
        return ((strcmp(path1, path2) == 0) ||
            (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/'));
 }

@kusumi #8450 is resolved by this patch ;)

root@debian:[~]# zfs create -o mountpoint=/var/test dpool/var_test
root@debian:[~]# zfs create -o mountpoint=/var/test/testlower dpool/var_test/testlower
root@debian:[~]# mount | grep test
dpool/var_test on /var/test type zfs (rw,relatime,xattr,posixacl)
dpool/var_test/testlower on /var/test/testlower type zfs (rw,relatime,xattr,posixacl)
root@debian:[~]# zfs mount | grep test
dpool/var_test                  /var/test
dpool/var_test/testlower        /var/test/testlower

root@debian:[~]# touch /var/test/testfile
root@debian:[~]# touch /var/test/testlower/testfilelower
root@debian:[~]# zfs unmount dpool/var_test/testlower
root@debian:[~]# zfs unmount dpool/var_test

root@debian:[~]# zfs mount -a
root@debian:[~]# zfs mount | grep test
dpool/var_test                  /var/test
dpool/var_test/testlower        /var/test/testlower
root@debian:[~]# ls /var/test
testfile  testlower
root@debian:[~]# ls /var/test/testlower/
testfilelower

kusumi · 2019-07-07T12:56:24Z

@c0d3z3r0 Great.

By the way, I dropped strcmp(path2, "/") == 0 conditional from my final diff, since this is redundant from the way this api is designed (path1 is always parent side, so testing path1 is enough). It shouldn't affect though.

The final diff looks like this.

 /*
- * Return true if path2 is a child of path1.
+ * Return true if path2 is a child of path1 or path2 equals path1 or
+ * path1 is "/" (path2 is always a child of "/").
  */
 static boolean_t
 libzfs_path_contains(const char *path1, const char *path2)
 {
-       return (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/');
+       return (strcmp(path1, path2) == 0 || strcmp(path1, "/") == 0 ||
+           (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/'));
 }

behlendorf · 2019-07-08T16:53:38Z

@c0d3z3r0 thank you for verifying the fix!

behlendorf

I've resubmitted the two failed builders which hit an unrelated issue. Once they pass I'll go ahead and merge this. @kusumi thanks for resolving this issue.

Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) openzfs#8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of openzfs#8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of openzfs#8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes openzfs#8450 Closes openzfs#8833 Closes openzfs#8878

Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) #8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of #8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) #8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of #8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8450 Closes #8833 Closes #8878

kusumi added the Status: Code Review Needed Ready for review and testing label Jun 10, 2019

kusumi requested a review from behlendorf June 10, 2019 06:12

kusumi force-pushed the parallel1 branch from b8fa705 to e0c7944 Compare June 12, 2019 15:52

behlendorf requested a review from sebroy June 12, 2019 23:39

kusumi force-pushed the parallel1 branch 2 times, most recently from 415ecae to 762913b Compare June 16, 2019 16:54

kusumi changed the title ~~Fix parallel mount failure caused by ZoL specific behavior~~ Fix parallel mount race Jun 16, 2019

kusumi force-pushed the parallel1 branch 2 times, most recently from 221f336 to b6f1221 Compare June 17, 2019 06:51

behlendorf approved these changes Jun 19, 2019

View reviewed changes

sebroy reviewed Jun 19, 2019

View reviewed changes

kusumi force-pushed the parallel1 branch 2 times, most recently from 0231bfb to 817ad17 Compare June 20, 2019 08:49

kusumi force-pushed the parallel1 branch from 817ad17 to 6078670 Compare June 21, 2019 09:31

behlendorf reviewed Jun 21, 2019

View reviewed changes

kusumi force-pushed the parallel1 branch 2 times, most recently from f41c755 to 16899b2 Compare June 23, 2019 07:32

behlendorf requested a review from sebroy June 28, 2019 23:24

sebroy approved these changes Jul 1, 2019

View reviewed changes

kusumi force-pushed the parallel1 branch 2 times, most recently from 36cb976 to 0192b86 Compare July 2, 2019 08:50

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 3, 2019

kusumi force-pushed the parallel1 branch 2 times, most recently from 0261f37 to 685f2ef Compare July 4, 2019 10:08

kusumi force-pushed the parallel1 branch from 685f2ef to 84a7e7b Compare July 4, 2019 11:43

kusumi force-pushed the parallel1 branch from 84a7e7b to 152fb96 Compare July 6, 2019 10:47

behlendorf approved these changes Jul 8, 2019

View reviewed changes

behlendorf merged commit ab5036d into openzfs:master Jul 9, 2019

kusumi mentioned this pull request Jul 29, 2019

Filesystems appear simultaneously mounted and unmounted - can't export pool, unmount, mount, or see files. #9082

Closed

Fix parallel mount race #8878

Fix parallel mount race #8878

Conversation

kusumi commented Jun 10, 2019 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

sebroy commented Jun 10, 2019 • edited Loading

kusumi commented Jun 10, 2019

kusumi commented Jun 10, 2019 • edited Loading

sebroy commented Jun 10, 2019

kusumi commented Jun 11, 2019

behlendorf commented Jun 11, 2019

kusumi commented Jun 12, 2019 • edited Loading

Rudd-O commented Jun 12, 2019

kusumi commented Jun 13, 2019

c0d3z3r0 commented Jun 15, 2019

kusumi commented Jun 16, 2019

behlendorf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kusumi commented Jun 20, 2019

behlendorf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kusumi commented Jun 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebroy Jul 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kusumi commented Jul 2, 2019

kusumi commented Jul 2, 2019

kusumi commented Jul 4, 2019 • edited Loading

kusumi commented Jul 6, 2019

codecov bot commented Jul 6, 2019 • edited Loading

Codecov Report

c0d3z3r0 commented Jul 7, 2019

kusumi commented Jul 7, 2019

behlendorf commented Jul 8, 2019

behlendorf left a comment

Choose a reason for hiding this comment

kusumi commented Jun 10, 2019 •

edited

Loading

sebroy commented Jun 10, 2019 •

edited

Loading

kusumi commented Jun 10, 2019 •

edited

Loading

kusumi commented Jun 12, 2019 •

edited

Loading

sebroy Jul 2, 2019 •

edited

Loading

kusumi commented Jul 4, 2019 •

edited

Loading

codecov bot commented Jul 6, 2019 •

edited

Loading