-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parallel mount race #8878
Fix parallel mount race #8878
Conversation
Given that the bulk of the work for each mount thread is doing the mount, I would expect that putting global lock around the mount effectively undoes most of the performance gain that would be achieved from the parallel mount algorithm. Is the performance of parallel mounting with this change similar to mounting with That said, I don't understand how the fix addresses the problem. The problematic scenario includes two pools that share some mount point parentage in their datasets as described in #8833.
If I understand the problem description, the problem is that If my understanding is correct, then how does adding a lock around the mount address this? Given slightly different timing, could I could be wrong here (if so, my apologies, I'm trying to get my head fully around this), but if I'm not, could a fix instead be to not include any datasets whose |
The problem I'm seeing here is that the initial delegation in In case of below,
mounting p1/f1 is dispatched from initial delegation, while mounting p2/f2 is dispatched from subsequent delegation from initial delegation for p2, so these two go parallel. That being said, as mentioned in #8833, since p2/f2 is mounted at a cost of extra at a cost of extra But when another complexity from Regarding the performance penalty, it can of course be penalty, but So to answer your question,
Right.
I didn't say this change "fix" or "address" the problem. Please read #8833 and this PR carefully. Instead I explained this as prevent the race by adjusting the timing so that it works similarly to FreeBSD (and probably illumos too). As mentioned earlier, I think the real problem is the fact that two threads for the same mount point can run concurrently on initial delegation, which is caused by I didn't go far as to change the parallel mount behavior itself, hence this commit. |
Was missing one of your question.
As mentioned in #8833, I don't think this is directly related to canmount property. I mean |
That's right, I came to the same conclusion. I also came to a separate conclusion that there should only be one such thread kicked off, because there should only be one filesystem for that top-level mount point that has a Circling back on the proposed fix in this PR, I don't understand how this fix would work. Furthermore, at the start of my initial comment in this PR, I point out that the fix undoes almost all of the benefits of parallel mounting, as now no mount can be done concurrent to another one. Am I mistaken? If not, how can we go forward with this? |
As already mentioned, threads can't go concurrently all the way through After all, this change is a workaround. For a real fix, the author of parallel mount should probably rewrite the initial dispatching logic that can race. The problem here is that threads can race when this feature doesn't provide (and shouldn't provide) synchronization among threads. |
I'm still trying to wrap my head around this as well. From the above conversation, it sounds like there's agreement on the root problem. And that this proposed change doesn't address that core issue. It merely adds some locking which happens perturb the timing in such a way to avoid the issue for this particular test case.
This sounds like a good idea and I think it would resolve this specific case. However, I don't believe it solves the general case. You can imagine other ways in which to interleave two (or more!) pool's datasets such that they would not be mounted in the correct order. Part of this confusion I think comes from this text in the man page.
@sebroy since you authored this feature please correct me if I'm wrong about the intended design. But it's my understanding that:
Assuming the above is true, and I haven't misunderstood the intent of parallel mounts, perhaps the best thing to do here would be to instead update the documentation to make this limitation clear. Of course, if someone wants to add support for this kind of arbitrary interleaving of datasets that would be wonderful. |
@behlendorf
It's just two loc, and only affects zfs mount related. It works for this specific case at least, but haven't used it for anything else.
It changes the initial dispatching algorithm so that From what I understand by looking at the code, the initial dispatching of threads is to select sets of mount points that don't have dependencies on each other, hence it should/can go lock-less and shouldn't race with other sets of mount points. Subsequent dispatching of threads are for each set of mounts with dependencies (i.e. child directories) hence these are serialized. Edited: |
Sometimes I wonder if it isn't easier to use a systemd generator to establish which things to mount, then let systemd mount everything in parallel. |
It probably is for systemd based distros. systemd also seems to support mount dependencies, where in zfs(8) mounts are serialized. |
this seems to be related to #8450 |
Most likely same issue as #8833, and also basically same analysis,
|
415ecae
to
762913b
Compare
221f336
to
b6f1221
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me! @sebroy would you mind reviewing the proposed fix.
lib/libzfs/libzfs_mount.c
Outdated
*/ | ||
static boolean_t | ||
libzfs_path_contains(const char *path1, const char *path2) | ||
{ | ||
return (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/'); | ||
return (strcmp(path1, path2) == 0) || | ||
(strstr(path2, path1) == path2 && path2[strlen(path1)] == '/'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you add parenthesis around the entire return. e.g.
return ((strcmp(path1, path2) == 0) ||
(strstr(path2, path1) == path2 && path2[strlen(path1)] == '/'));
@@ -19,7 +19,8 @@ dist_pkgdata_SCRIPTS = \ | |||
zfs_mount_all_mountpoints.ksh \ | |||
zfs_mount_encrypted.ksh \ | |||
zfs_mount_remount.ksh \ | |||
zfs_multi_mount.ksh | |||
zfs_multi_mount.ksh \ | |||
zfs_mount_test_race.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you place this after zfs_mount_remount.ksh
in order to keep the tests in alphabetical order. It'd be nice to order the linux.run
file similarly too, it appears to also be out of order.
} | ||
log_onexit cleanup | ||
|
||
log_note "Testing github.com/zfsonlinux/zfs/issues/8833" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the issue was referenced in the above comment, I think a human readable explaination would be better here. How about the following based on your existing log_pass
message.
log_note " log_pass "Verify parallel mount ordering is consistent"
log_note "Testing github.com/zfsonlinux/zfs/issues/8833" | ||
|
||
log_must mkfile $MINVDEVSIZE $DISK1 | ||
log_must mkfile $MINVDEVSIZE $DISK2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been gradually switching to the lighter weight log_must truncate -s $MINVDEVSIZE
for one-off file vdevs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix things you pointed out, I think we also want to explicitly rm
these two tmp files in ksh cleanup function. I forgot to remove those.
log_must zfs set mountpoint=$MNTPT $TESTPOOL2 | ||
|
||
log_must zfs list -o name,mounted,canmount,mountpoint | ||
log_must mount | grep zfs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_must ismounted $TESTPOOL1/$TESTFS1
can be used for this check, here and below.
lib/libzfs/libzfs_mount.c
Outdated
*/ | ||
static boolean_t | ||
libzfs_path_contains(const char *path1, const char *path2) | ||
{ | ||
return (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/'); | ||
return (strcmp(path1, path2) == 0) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain how this fixes the issue at hand? If we have a list of mountpoints like this:
1. /a (canmount=off)
2. /a
3. /a/b
4. /a/c
How will this prevent the thread mounting (1) (which incidentally won't mount /a
) from dispatching mounts for (3) and (4)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If libzfs_path_contains()
return True for the same paths (i.e. two same top level dirs on initial dispatching), then there's no chance race can happen because they never go in multi-thread (read zfs_foreach_mountpoint()
).
0231bfb
to
817ad17
Compare
@behlendorf Note that I'm seeing build failure from recent 30af21b, which I PR'd a fix in #8939 as a separate PR. I'm also seeing test failure on setup stage possibly due to the same commit with below error message (not sure). So even though the branch being pushed here is rebased on the current origin/master, I've dropped 30af21b for my local testing of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Since the CI bots didn't hit the issue you fixed in #8939 it's be nice to still have this PR rebased against master.
tests/runfiles/linux.run
Outdated
'zfs_multi_mount', 'zfs_mount_all_fail', 'zfs_mount_all_mountpoints'] | ||
'zfs_mount_all_001_pos', 'zfs_mount_all_fail', 'zfs_mount_all_mountpoints', | ||
'zfs_mount_encrypted', 'zfs_mount_remount', 'zfs_mount_test_race', | ||
'zfs_multi_mount'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like reordering these tests exposed some missing cleanup code which caused zfs_mount_encrypted
and zfs_mount_remount
to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I guess we just want to add zfs_mount_test_race
at the end of this list ? This was reordered in alphabetical order per your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to resolve whatever the cleanup issue is, since we should be able to reorder these tests. But if that's not going to be straight forward, I'm fine with preserving the ordering for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
TESTPOOL2=zfs_mount_test_race_tp2 | ||
TESTFS2=zfs_mount_test_race_tf2 | ||
|
||
log_must zfs unmount -a # unmount zfs mounts from previous tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than unmounting and remounting let's use the __ZFS_POOL_RESTRICT
environment variable which exists for this purpose. You should be able to export __ZFS_POOL_RESTRICT="$TESTPOOL1 $TESTPOOL2"
which will restrict zfs mount
to only those pools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a look at __ZFS_POOL_RESTRICT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
|
||
DISK1="$TMPDIR/zfs_mount_test_race_disk1" | ||
TESTPOOL1=zfs_mount_test_race_tp1 | ||
TESTFS1=zfs_mount_test_race_tf1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to redefine TESTPOOL{1,2}
and TESTFS{1,2}
, they're already defined in default.cfg
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used unique names since it rm -rf /$TESTPOOL{1,2}
, and also to be defensive against other tests in zfs_mount
category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So there shouldn't be hidden dependencies between tests so using the original names should be fine, but if you'd prefer the custom names just in case that's fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got why I used unique names for TESTPOOL{1,2}
. If the test runs with default names, it will not want to rm -rf /$TESTPOOL{1,2}
on cleanup to be defensive with other tests in zfs_mount
category. Other tests could have created something to use throughout several tests.
But then this test won't succeed if /testpool{1,2}
already has a dirent(s) in it. e.g. Running this test 2 times in a row at local fails on the second attempt with below for having a directory under /testpool2
already.
So yes, it's safer to just go with unique names.
ERROR: zpool create -f testpool2 /var/tmp/zfs_mount_test_race_disk2 exited 1
mountpoint '/testpool2' exists and is not empty use '-m' option to provide a different default
# tree /testpool2/
/testpool2/
`-- testfs2
1 directory, 0 files
log_must zfs unmount -a # unmount zfs mounts from previous tests | ||
log_must rm -rf $MNTPT | ||
log_must rm -rf /$TESTPOOL1 | ||
log_must rm -rf /$TESTPOOL2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually needed because a previous test didn't cleanup, or is it just defensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just being defensive given unique names for TESTPOOL{1,2}
. I can drop it if this is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The let's drop it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
rm -rf /$TESTPOOL2 | ||
rm -f $DISK1 | ||
rm -f $DISK2 | ||
log_must zfs mount -a # remount unmounted zfs mounts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be needed if you use __ZFS_POOL_RESTRICT
, but it should be unset
. See zfs_mount_all_mountpoints.ksh
as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
tests/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_test_race.sh
Show resolved
Hide resolved
The branch is actually rebased on origin/master as of yesterday, though I had problem with it for my local testing on both compile and runtime (hence dropped that commit locally for local testing). I'll re-rebase anyway when pushing test code update next time. |
f41c755
to
16899b2
Compare
log_must zfs set mountpoint=none $TESTPOOL1 | ||
log_must zfs set mountpoint=$MNTPT $TESTPOOL1/$TESTFS1 | ||
|
||
# Note that unmount can fail (due to race condition on `zfs mount -a`) with or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on this comment and explain what you mean? How can an unmount failure be caused by a race condition in zfs mount -a
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can an unmount failure be caused by a race condition in zfs mount -a?
????
That's what's been discussed since the original issue reported in #8833.
When threads are under race condition, mount order is undefined.
When a thread for a mountpoint deeper in directory tree wins (which is a bug), that's not what zfs unmount -a
expects, hence unmount fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kusumi Please be patient with me. I'm asking clarifying questions in good faith. I legitimately did not understand how a race condition in the mount code could cause unmount to fail, even after having read the issue. Furthermore, people reading this code in the future won't know to go look at that issue. Comments should be self-contained and self-explanatory.
This comment is evidently trying to explain why canmount
has to be set to off
on $TESTPOOL2
, which has a mountpoint
of $MNTPT
. This has to be done because you can't mount two filesystems on the same mountpoint. In the test, $TESTPOOL1/$TESTFS1
has the same mountpoint. Both cannot be mounted; hence, one of them has to have canmount
set to off
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I legitimately did not understand how a race condition in the mount code could cause unmount to fail, even after having read the issue.
The unmount assumes mount order was in correct order, so if a mount point deeper in the directory tree is shadowed by another one due to undefined mount order caused by the race among pthreads, then unmount will fail, because unmount can't see a mount that is supposed to exist.
(See my fist comment and example I pasted in #8833, as well as my first comment in this thread. Also see #8450 which is likely the same problem.)
Furthermore, people reading this code in the future won't know to go look at that issue.
I did add a bit more description (basically my previous comment to your question) in my previous push, but apparently I won't/can't explain all the details of implementation I explained in this thread as well as the commit message (plus this test file already does have lots of comments with ascii-art). That's why it has had a reference to GitHub url.
This comment is evidently trying to explain why canmount has to be set to off on $TESTPOOL2, which has a mountpoint of $MNTPT. This has to be done because you can't mount two filesystems on the same mountpoint. In the test, $TESTPOOL1/$TESTFS1 has the same mountpoint. Both cannot be mounted; hence, one of them has to have canmount set to off.
Right, that's why it has turn it off for convenience of mount layout used in this test case
, but whether that property is set or not, the race itself in initial thread dispatching does happen. See #8450, it's not using canmount
.
|
||
# Note that unmount can fail (due to race condition on `zfs mount -a`) with or | ||
# without `canmount=off`. The race has nothing to do with canmount property. | ||
log_must zfs set canmount=off $TESTPOOL2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test that does the exact same thing, except sets canmount=off
on TESTPOOL1
instead, just in case the code is accidentally dependent on the ordering of the filesystems passed in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just in case the code is accidentally dependent on the ordering of the filesystems passed in?
I don't get what "accidentally" is supposed to mean, but the entire point of this race fix on initial thread dispatching isn't anything to do with canmount
property. If canmount=off
on a different top level dir is causing difference or a problem, that's another issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kusumi From my understanding, the canmount
property has to be set to off
on the filesystems with duplicate mount points in order to hit this bug. Can you show me a case where this bug gets hit and canmount
isn't set to off
on any of the filesystems?
Essentially, this has to do with two or more filesystems having the same mountpoint, all but one of which does not get mounted (canmount=off
). If a thread handling the mounting of a filesystem whose canmount
property is set to off
spawns child filesystem mounts before others, then those child mounts will fail because their parent filesystems will not have been mounted. This updated fix addresses this issue because the threads with duplicate mountpoints will no longer be spawned in parallel, and the one legitimate filesystem (the one that will be mounted) will be guaranteed to be mounted before the threads that mount its child filesystems get spawned.
I'm not trying to criticize your code or tests here, I'm providing you with feedback so that we can solidify the code. This is supposed to be constructive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show me a case where this bug gets hit and canmount isn't set to off on any of the filesystems?
This example does fail to unmount without canmount=off
. In this test or an example in #8833, canmount=off
is set to TESTPOOL2 or p2, so while it does fail to mount TESTPOOL2 or p2 on set mountpoint=/...
without set canmount=off
, the race condition in other two mounts (which are to be mounted by different threads without this fix, but run in a single thread with this fix) exist.
Also #8450 isn't using canmount property as well.
36cb976
to
0192b86
Compare
@behlendorf |
Looking at #8450 (another one which is likely the same issue) more carefully, the race is likely happening with Unfortunately I can't reproduce #8450, so I can't test this one.
|
0261f37
to
685f2ef
Compare
Added above ( In terms of testing, I much prefer #8450 layout than #8833 due to its simplicity without use of
|
Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) openzfs#8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of openzfs#8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of openzfs#8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Closes openzfs#8450 Closes openzfs#8833 Closes openzfs#8878 Signed-off-by: Tomohiro Kusumi <[email protected]>
Test results are good. |
Codecov Report
@@ Coverage Diff @@
## master #8878 +/- ##
==========================================
+ Coverage 78.62% 78.7% +0.07%
==========================================
Files 388 388
Lines 119992 119993 +1
==========================================
+ Hits 94349 94445 +96
+ Misses 25643 25548 -95
Continue to review full report at Codecov.
|
@kusumi #8450 is resolved by this patch ;)
|
@c0d3z3r0 Great. By the way, I dropped The final diff looks like this.
|
@c0d3z3r0 thank you for verifying the fix! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've resubmitted the two failed builders which hit an unrelated issue. Once they pass I'll go ahead and merge this. @kusumi thanks for resolving this issue.
Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) openzfs#8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of openzfs#8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of openzfs#8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes openzfs#8450 Closes openzfs#8833 Closes openzfs#8878
Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) openzfs#8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of openzfs#8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of openzfs#8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes openzfs#8450 Closes openzfs#8833 Closes openzfs#8878
Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) openzfs#8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of openzfs#8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of openzfs#8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes openzfs#8450 Closes openzfs#8833 Closes openzfs#8878
Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) openzfs#8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of openzfs#8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of openzfs#8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes openzfs#8450 Closes openzfs#8833 Closes openzfs#8878
Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) openzfs#8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of openzfs#8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of openzfs#8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes openzfs#8450 Closes openzfs#8833 Closes openzfs#8878
Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) openzfs#8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of openzfs#8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of openzfs#8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes openzfs#8450 Closes openzfs#8833 Closes openzfs#8878
Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) openzfs#8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of openzfs#8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of openzfs#8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes openzfs#8450 Closes openzfs#8833 Closes openzfs#8878
Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) #8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of #8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) #8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of #8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8450 Closes #8833 Closes #8878
Motivation and Context
#8833
#8450
Description
new PR after update
original PR (workaround)
How Has This Been Tested?
The test script mentioned in #8833.
Types of changes
Checklist:
Signed-off-by
.