Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parallel mount race #8878

Merged
merged 1 commit into from
Jul 9, 2019
Merged

Fix parallel mount race #8878

merged 1 commit into from
Jul 9, 2019

Conversation

kusumi
Copy link
Member

@kusumi kusumi commented Jun 10, 2019

Motivation and Context

#8833
#8450

Description

new PR after update

Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same mount point, and this puts threads under race
condition. This appeared as mount issue on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8)
on mount.

original PR (workaround)

Parallel mount behaves differently in ZoL by each pthread forking
another process to exec `/bin/mount`, whereas illumos and FreeBSD
directly use `mount(2)` and `nmount(2)` syscall respectively.

This can cause parallel mount to mount datasets in incorrect order
depending on the timing `/bin/mount` runs, and results in shadowing
dataset(s) whose mountpoint is descendant of another.

Having a global lock around `/bin/mount` (and also `/bin/umount`) call
adjusts the timing (eliminates the timing difference caused by
fork/exec) and makes its behavior similar to illumos and FreeBSD.

Note that this isn't to "fix" a race condition by proper locking.
Parallel mount by design doesn't guarantee ordering of threads when
multiple datasets have the same mount point. Adjusting is all it
can do. See #8833 for details.

How Has This Been Tested?

The test script mentioned in #8833.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@kusumi kusumi added the Status: Code Review Needed Ready for review and testing label Jun 10, 2019
@kusumi kusumi requested a review from behlendorf June 10, 2019 06:12
@sebroy
Copy link

sebroy commented Jun 10, 2019

Given that the bulk of the work for each mount thread is doing the mount, I would expect that putting global lock around the mount effectively undoes most of the performance gain that would be achieved from the parallel mount algorithm. Is the performance of parallel mounting with this change similar to mounting with ZFS_SERIAL_MOUNT?

That said, I don't understand how the fix addresses the problem. The problematic scenario includes two pools that share some mount point parentage in their datasets as described in #8833.

/tmp # zfs list -o name,mounted,canmount,mountpoint
NAME     MOUNTED  CANMOUNT  MOUNTPOINT
test0         no        on  none
test0/a      yes        on  /a
test1         no       off  /a
test1/b      yes        on  /a/b

If I understand the problem description, the problem is that test1/b somehow gets mounted before test0/a even though test0/a's mountpoint is the parent test1/b's mountpoint. This would appear to be happening because test0/a and test1 have the same top-level root mountpoint, and these are being kicked off in parallel, resulting in test1/b sometimes being mounted before test0/a. Is that right?

If my understanding is correct, then how does adding a lock around the mount address this? Given slightly different timing, could test1/b not be mounted and all locks dropped before the thread that attempts to mount test0/a gets a chance to run?

I could be wrong here (if so, my apologies, I'm trying to get my head fully around this), but if I'm not, could a fix instead be to not include any datasets whose canmount property is off in the list of datasets to mount?

@kusumi
Copy link
Member Author

kusumi commented Jun 10, 2019

The problem I'm seeing here is that the initial delegation in zfs_foreach_mountpoint() can spawn >1 threads that run concurrently for datasets with the same mount point. After that, subsequent delegation in zfs_mount_task() are serialized.

In case of below,

NAME   MOUNTED  CANMOUNT  MOUNTPOINT
p1          no        on  none
p1/f1      yes        on  /mntpt
p2          no       off  /mntpt
p2/f2      yes        on  /mntpt/f2

mounting p1/f1 is dispatched from initial delegation, while mounting p2/f2 is dispatched from subsequent delegation from initial delegation for p2, so these two go parallel.

That being said, as mentioned in #8833, since p2/f2 is mounted at a cost of extra at a cost of extra pthread_create(3) or pthread_cond_signal(3) call, p1/f1 usually seems to win. That's what I observed from FreeBSD, and is also the expected order because otherwise p2/f2 is shadowed.

But when another complexity from fork(2)+exec(2) comes in, it can go out of order, and that's what's been observed in ZoL. Having a global lock eliminates it, and keeps the logic similar to that of FreeBSD (and probably illumos too).

Regarding the performance penalty, it can of course be penalty, but mount(2) syscall itself has its own locking in kernel space. So even if ZoL didn't invoke /bin/mount, bunch of threads trying to invoke mount(2) concurrently don't just go concurrently.

So to answer your question,

This would appear to be happening because test0/a and test1 have the same top-level root mountpoint, and these are being kicked off in parallel, resulting in test1/b sometimes being mounted before test0/a. Is that right?

Right.

If my understanding is correct, then how does adding a lock around the mount address this? Given slightly different timing, could test1/b not be mounted and all locks dropped before the thread that attempts to mount test0/a gets a chance to run?

I didn't say this change "fix" or "address" the problem. Please read #8833 and this PR carefully. Instead I explained this as prevent the race by adjusting the timing so that it works similarly to FreeBSD (and probably illumos too).

As mentioned earlier, I think the real problem is the fact that two threads for the same mount point can run concurrently on initial delegation, which is caused by libzfs_path_contains() returning false for the same directory on determining non-descendant index on initial delegation.

I didn't go far as to change the parallel mount behavior itself, hence this commit.

@kusumi
Copy link
Member Author

kusumi commented Jun 10, 2019

Was missing one of your question.

I could be wrong here (if so, my apologies, I'm trying to get my head fully around this), but if I'm not, could a fix instead be to not include any datasets whose canmount property is off in the list of datasets to mount?

As mentioned in #8833, I don't think this is directly related to canmount property. I mean zfs_foreach_mountpoint() has nothing to do with the canmount property, while the problem I've pointed out is in that function.

@sebroy
Copy link

sebroy commented Jun 10, 2019

The problem I'm seeing here is that the initial delegation in zfs_foreach_mountpoint() can spawn >1 threads that run concurrently for datasets with the same mount point.

That's right, I came to the same conclusion. I also came to a separate conclusion that there should only be one such thread kicked off, because there should only be one filesystem for that top-level mount point that has a canmount property of on. Pruning the array of handles of filesystems that have canmount set to off addresses this issue, I think.

Circling back on the proposed fix in this PR, I don't understand how this fix would work. Furthermore, at the start of my initial comment in this PR, I point out that the fix undoes almost all of the benefits of parallel mounting, as now no mount can be done concurrent to another one. Am I mistaken? If not, how can we go forward with this?

@kusumi
Copy link
Member Author

kusumi commented Jun 11, 2019

Circling back on the proposed fix in this PR, I don't understand how this fix would work. Furthermore, at the start of my initial comment in this PR, I point out that the fix undoes almost all of the benefits of parallel mounting, as now no mount can be done concurrent to another one. Am I mistaken? If not, how can we go forward with this?

As already mentioned, threads can't go concurrently all the way through mount(2). The locking around /bin/mount which also eventually invoke mount(2) is to emulate that in higher level, to catch the timing problem caused by fork/exec, which is caused by the design issue in parallel mount (<- and we both agree on this). Maybe we can measure on this (I haven't).

After all, this change is a workaround. For a real fix, the author of parallel mount should probably rewrite the initial dispatching logic that can race. The problem here is that threads can race when this feature doesn't provide (and shouldn't provide) synchronization among threads.

@behlendorf
Copy link
Contributor

I'm still trying to wrap my head around this as well. From the above conversation, it sounds like there's agreement on the root problem. And that this proposed change doesn't address that core issue. It merely adds some locking which happens perturb the timing in such a way to avoid the issue for this particular test case.

could a fix instead be to not include any datasets whose canmount property is off in the list of datasets to mount?

This sounds like a good idea and I think it would resolve this specific case. However, I don't believe it solves the general case. You can imagine other ways in which to interleave two (or more!) pool's datasets such that they would not be mounted in the correct order. Part of this confusion I think comes from this text in the man page.

One example of setting canmount=off is to have two datasets with the same
mountpoint, so that the children of both datasets appear in the same direc-
tory, but might have different inherited characteristics.

@sebroy since you authored this feature please correct me if I'm wrong about the intended design. But it's my understanding that:

  1. The example given above assumes (but doesn't explicitly state) that the datasets in question must be part of the same pool. In which case, this should work fine. And,
  2. Since the processing of the root of each pool's datasets is handled concurrently, arbitrarily interleaving dataset mountpoints between pools was never intended to be a supported use case.

Assuming the above is true, and I haven't misunderstood the intent of parallel mounts, perhaps the best thing to do here would be to instead update the documentation to make this limitation clear. Of course, if someone wants to add support for this kind of arbitrary interleaving of datasets that would be wonderful.

@kusumi
Copy link
Member Author

kusumi commented Jun 12, 2019

@behlendorf
I just wrote and tried the real fix that I've mentioned earlier that I had in mind but never tried until this point, which intends to fix parallel mount behavior instead of working around the timing issue.

As mentioned earlier, I think the real problem is the fact that two threads for the same mount point can run concurrently on initial delegation, which is caused by libzfs_path_contains() returning false for the same directory on determining non-descendant index on initial delegation.

It's just two loc, and only affects zfs mount related. It works for this specific case at least, but haven't used it for anything else.

# git diff
diff --git a/lib/libzfs/libzfs_mount.c b/lib/libzfs/libzfs_mount.c
index 649c232aa..2b363f437 100644
--- a/lib/libzfs/libzfs_mount.c
+++ b/lib/libzfs/libzfs_mount.c
@@ -1307,6 +1307,8 @@ mountpoint_cmp(const void *arga, const void *argb)
 static boolean_t
 libzfs_path_contains(const char *path1, const char *path2)
 {
+       if (!strcmp(path1, path2))
+               return B_TRUE;
        return (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/');
 }

It changes the initial dispatching algorithm so that zfs_foreach_mountpoint() doesn't spawn >1 threads for the same mount point.

From what I understand by looking at the code, the initial dispatching of threads is to select sets of mount points that don't have dependencies on each other, hence it should/can go lock-less and shouldn't race with other sets of mount points. Subsequent dispatching of threads are for each set of mounts with dependencies (i.e. child directories) hence these are serialized.

Edited:
After reading comments around the code, I think this is the way it's intended to work. Force pushed this one. Let's see how tests go.

@Rudd-O
Copy link
Contributor

Rudd-O commented Jun 12, 2019

Sometimes I wonder if it isn't easier to use a systemd generator to establish which things to mount, then let systemd mount everything in parallel.

@behlendorf behlendorf requested a review from sebroy June 12, 2019 23:39
@kusumi
Copy link
Member Author

kusumi commented Jun 13, 2019

It probably is for systemd based distros. systemd also seems to support mount dependencies, where in zfs(8) mounts are serialized.

@c0d3z3r0
Copy link
Contributor

this seems to be related to #8450

@kusumi
Copy link
Member Author

kusumi commented Jun 16, 2019

Most likely same issue as #8833, and also basically same analysis,
i.e. threads are under race condition and either of below fixes this.

  1. small delay
  2. serialize fork/exec
  3. fix algorithm (apparently preferred way)

@kusumi kusumi force-pushed the parallel1 branch 2 times, most recently from 415ecae to 762913b Compare June 16, 2019 16:54
@kusumi kusumi changed the title Fix parallel mount failure caused by ZoL specific behavior Fix parallel mount race Jun 16, 2019
@kusumi kusumi force-pushed the parallel1 branch 2 times, most recently from 221f336 to b6f1221 Compare June 17, 2019 06:51
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me! @sebroy would you mind reviewing the proposed fix.

*/
static boolean_t
libzfs_path_contains(const char *path1, const char *path2)
{
return (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/');
return (strcmp(path1, path2) == 0) ||
(strstr(path2, path1) == path2 && path2[strlen(path1)] == '/');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add parenthesis around the entire return. e.g.

return ((strcmp(path1, path2) == 0) ||
    (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/'));

@@ -19,7 +19,8 @@ dist_pkgdata_SCRIPTS = \
zfs_mount_all_mountpoints.ksh \
zfs_mount_encrypted.ksh \
zfs_mount_remount.ksh \
zfs_multi_mount.ksh
zfs_multi_mount.ksh \
zfs_mount_test_race.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you place this after zfs_mount_remount.ksh in order to keep the tests in alphabetical order. It'd be nice to order the linux.run file similarly too, it appears to also be out of order.

}
log_onexit cleanup

log_note "Testing github.com/zfsonlinux/zfs/issues/8833"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the issue was referenced in the above comment, I think a human readable explaination would be better here. How about the following based on your existing log_pass message.

 log_note " log_pass "Verify parallel mount ordering is consistent"

log_note "Testing github.com/zfsonlinux/zfs/issues/8833"

log_must mkfile $MINVDEVSIZE $DISK1
log_must mkfile $MINVDEVSIZE $DISK2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been gradually switching to the lighter weight log_must truncate -s $MINVDEVSIZE for one-off file vdevs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix things you pointed out, I think we also want to explicitly rm these two tmp files in ksh cleanup function. I forgot to remove those.

log_must zfs set mountpoint=$MNTPT $TESTPOOL2

log_must zfs list -o name,mounted,canmount,mountpoint
log_must mount | grep zfs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_must ismounted $TESTPOOL1/$TESTFS1 can be used for this check, here and below.

*/
static boolean_t
libzfs_path_contains(const char *path1, const char *path2)
{
return (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/');
return (strcmp(path1, path2) == 0) ||
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how this fixes the issue at hand? If we have a list of mountpoints like this:

1. /a (canmount=off)
2. /a
3. /a/b
4. /a/c

How will this prevent the thread mounting (1) (which incidentally won't mount /a) from dispatching mounts for (3) and (4)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If libzfs_path_contains() return True for the same paths (i.e. two same top level dirs on initial dispatching), then there's no chance race can happen because they never go in multi-thread (read zfs_foreach_mountpoint()).

@kusumi kusumi force-pushed the parallel1 branch 2 times, most recently from 0231bfb to 817ad17 Compare June 20, 2019 08:49
@kusumi
Copy link
Member Author

kusumi commented Jun 20, 2019

@behlendorf
Updated + rebased.

Note that I'm seeing build failure from recent 30af21b, which I PR'd a fix in #8939 as a separate PR. I'm also seeing test failure on setup stage possibly due to the same commit with below error message (not sure).
cannot create 'testpool': invalid feature 'redaction_bookmarks'

So even though the branch being pushed here is rebased on the current origin/master, I've dropped 30af21b for my local testing of zfs_mount_test_race.sh. This race fix is completely userspace matter with limited code coverage, so it won't affect.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Since the CI bots didn't hit the issue you fixed in #8939 it's be nice to still have this PR rebased against master.

'zfs_multi_mount', 'zfs_mount_all_fail', 'zfs_mount_all_mountpoints']
'zfs_mount_all_001_pos', 'zfs_mount_all_fail', 'zfs_mount_all_mountpoints',
'zfs_mount_encrypted', 'zfs_mount_remount', 'zfs_mount_test_race',
'zfs_multi_mount']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like reordering these tests exposed some missing cleanup code which caused zfs_mount_encrypted and zfs_mount_remount to fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I guess we just want to add zfs_mount_test_race at the end of this list ? This was reordered in alphabetical order per your suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to resolve whatever the cleanup issue is, since we should be able to reorder these tests. But if that's not going to be straight forward, I'm fine with preserving the ordering for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

TESTPOOL2=zfs_mount_test_race_tp2
TESTFS2=zfs_mount_test_race_tf2

log_must zfs unmount -a # unmount zfs mounts from previous tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than unmounting and remounting let's use the __ZFS_POOL_RESTRICT environment variable which exists for this purpose. You should be able to export __ZFS_POOL_RESTRICT="$TESTPOOL1 $TESTPOOL2" which will restrict zfs mount to only those pools.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look at __ZFS_POOL_RESTRICT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


DISK1="$TMPDIR/zfs_mount_test_race_disk1"
TESTPOOL1=zfs_mount_test_race_tp1
TESTFS1=zfs_mount_test_race_tf1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to redefine TESTPOOL{1,2} and TESTFS{1,2}, they're already defined in default.cfg.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used unique names since it rm -rf /$TESTPOOL{1,2}, and also to be defensive against other tests in zfs_mount category.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So there shouldn't be hidden dependencies between tests so using the original names should be fine, but if you'd prefer the custom names just in case that's fine with me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got why I used unique names for TESTPOOL{1,2}. If the test runs with default names, it will not want to rm -rf /$TESTPOOL{1,2} on cleanup to be defensive with other tests in zfs_mount category. Other tests could have created something to use throughout several tests.

But then this test won't succeed if /testpool{1,2} already has a dirent(s) in it. e.g. Running this test 2 times in a row at local fails on the second attempt with below for having a directory under /testpool2 already.

So yes, it's safer to just go with unique names.

ERROR: zpool create -f testpool2 /var/tmp/zfs_mount_test_race_disk2 exited 1
mountpoint '/testpool2' exists and is not empty use '-m' option to provide a different default
# tree /testpool2/
/testpool2/
`-- testfs2

1 directory, 0 files

log_must zfs unmount -a # unmount zfs mounts from previous tests
log_must rm -rf $MNTPT
log_must rm -rf /$TESTPOOL1
log_must rm -rf /$TESTPOOL2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually needed because a previous test didn't cleanup, or is it just defensive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just being defensive given unique names for TESTPOOL{1,2}. I can drop it if this is redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The let's drop it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

rm -rf /$TESTPOOL2
rm -f $DISK1
rm -f $DISK2
log_must zfs mount -a # remount unmounted zfs mounts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be needed if you use __ZFS_POOL_RESTRICT, but it should be unset. See zfs_mount_all_mountpoints.ksh as an example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@kusumi
Copy link
Member Author

kusumi commented Jun 21, 2019

The branch is actually rebased on origin/master as of yesterday, though I had problem with it for my local testing on both compile and runtime (hence dropped that commit locally for local testing).

I'll re-rebase anyway when pushing test code update next time.

@kusumi kusumi force-pushed the parallel1 branch 2 times, most recently from f41c755 to 16899b2 Compare June 23, 2019 07:32
@behlendorf behlendorf requested a review from sebroy June 28, 2019 23:24
log_must zfs set mountpoint=none $TESTPOOL1
log_must zfs set mountpoint=$MNTPT $TESTPOOL1/$TESTFS1

# Note that unmount can fail (due to race condition on `zfs mount -a`) with or
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on this comment and explain what you mean? How can an unmount failure be caused by a race condition in zfs mount -a?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can an unmount failure be caused by a race condition in zfs mount -a?

????
That's what's been discussed since the original issue reported in #8833.
When threads are under race condition, mount order is undefined.
When a thread for a mountpoint deeper in directory tree wins (which is a bug), that's not what zfs unmount -a expects, hence unmount fails.

Copy link

@sebroy sebroy Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kusumi Please be patient with me. I'm asking clarifying questions in good faith. I legitimately did not understand how a race condition in the mount code could cause unmount to fail, even after having read the issue. Furthermore, people reading this code in the future won't know to go look at that issue. Comments should be self-contained and self-explanatory.

This comment is evidently trying to explain why canmount has to be set to off on $TESTPOOL2, which has a mountpoint of $MNTPT. This has to be done because you can't mount two filesystems on the same mountpoint. In the test, $TESTPOOL1/$TESTFS1 has the same mountpoint. Both cannot be mounted; hence, one of them has to have canmount set to off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I legitimately did not understand how a race condition in the mount code could cause unmount to fail, even after having read the issue.

The unmount assumes mount order was in correct order, so if a mount point deeper in the directory tree is shadowed by another one due to undefined mount order caused by the race among pthreads, then unmount will fail, because unmount can't see a mount that is supposed to exist.

(See my fist comment and example I pasted in #8833, as well as my first comment in this thread. Also see #8450 which is likely the same problem.)

Furthermore, people reading this code in the future won't know to go look at that issue.

I did add a bit more description (basically my previous comment to your question) in my previous push, but apparently I won't/can't explain all the details of implementation I explained in this thread as well as the commit message (plus this test file already does have lots of comments with ascii-art). That's why it has had a reference to GitHub url.

This comment is evidently trying to explain why canmount has to be set to off on $TESTPOOL2, which has a mountpoint of $MNTPT. This has to be done because you can't mount two filesystems on the same mountpoint. In the test, $TESTPOOL1/$TESTFS1 has the same mountpoint. Both cannot be mounted; hence, one of them has to have canmount set to off.

Right, that's why it has turn it off for convenience of mount layout used in this test case, but whether that property is set or not, the race itself in initial thread dispatching does happen. See #8450, it's not using canmount.


# Note that unmount can fail (due to race condition on `zfs mount -a`) with or
# without `canmount=off`. The race has nothing to do with canmount property.
log_must zfs set canmount=off $TESTPOOL2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test that does the exact same thing, except sets canmount=off on TESTPOOL1 instead, just in case the code is accidentally dependent on the ordering of the filesystems passed in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just in case the code is accidentally dependent on the ordering of the filesystems passed in?

I don't get what "accidentally" is supposed to mean, but the entire point of this race fix on initial thread dispatching isn't anything to do with canmount property. If canmount=off on a different top level dir is causing difference or a problem, that's another issue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kusumi From my understanding, the canmount property has to be set to off on the filesystems with duplicate mount points in order to hit this bug. Can you show me a case where this bug gets hit and canmount isn't set to off on any of the filesystems?

Essentially, this has to do with two or more filesystems having the same mountpoint, all but one of which does not get mounted (canmount=off). If a thread handling the mounting of a filesystem whose canmount property is set to off spawns child filesystem mounts before others, then those child mounts will fail because their parent filesystems will not have been mounted. This updated fix addresses this issue because the threads with duplicate mountpoints will no longer be spawned in parallel, and the one legitimate filesystem (the one that will be mounted) will be guaranteed to be mounted before the threads that mount its child filesystems get spawned.

I'm not trying to criticize your code or tests here, I'm providing you with feedback so that we can solidify the code. This is supposed to be constructive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show me a case where this bug gets hit and canmount isn't set to off on any of the filesystems?

This example does fail to unmount without canmount=off. In this test or an example in #8833, canmount=off is set to TESTPOOL2 or p2, so while it does fail to mount TESTPOOL2 or p2 on set mountpoint=/... without set canmount=off, the race condition in other two mounts (which are to be mounted by different threads without this fix, but run in a single thread with this fix) exist.

Also #8450 isn't using canmount property as well.

@kusumi kusumi force-pushed the parallel1 branch 2 times, most recently from 36cb976 to 0192b86 Compare July 2, 2019 08:50
@kusumi
Copy link
Member Author

kusumi commented Jul 2, 2019

@behlendorf
rebased + force pushed again. please merge.

@kusumi
Copy link
Member Author

kusumi commented Jul 2, 2019

Looking at #8450 (another one which is likely the same issue) more carefully, the race is likely happening with / involved, so it probably needs a special handling, something like below. It wants to count /var/data/... as a descendant of /.

Unfortunately I can't reproduce #8450, so I can't test this one.

diff --git a/lib/libzfs/libzfs_mount.c b/lib/libzfs/libzfs_mount.c
index 8c7a7eabe..e70acd425 100644
--- a/lib/libzfs/libzfs_mount.c
+++ b/lib/libzfs/libzfs_mount.c
@@ -1341,6 +1341,8 @@ mountpoint_cmp(const void *arga, const void *argb)
 static boolean_t
 libzfs_path_contains(const char *path1, const char *path2)
 {
+       if (strcmp(path1, "/") == 0 || strcmp(path2, "/") == 0)
+               return B_TRUE;
        return ((strcmp(path1, path2) == 0) ||
            (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/'));
 }

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 3, 2019
@kusumi kusumi force-pushed the parallel1 branch 2 times, most recently from 0261f37 to 685f2ef Compare July 4, 2019 10:08
@kusumi
Copy link
Member Author

kusumi commented Jul 4, 2019

Added above (strcmp(path1, "/") == 0 ||) and rebased, so you might want to wait until tests completes before merging. Since now that there are two known patterns, I've also added below to commit messaage.

In terms of testing, I much prefer #8450 layout than #8833 due to its simplicity without use of canmount, except that it requires zfs mount at / (which fails on my environment due to zfs userspace not allowing a directory with dirent(s) in it, so basically I have not reproduced #8450).

There are currently two known patterns of `zfs_foreach_mountpoint(...,
handles, ...)` input which cause the race condition.

1) #8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directory.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of #8833, ThreadA won the race. ThreadB was created because
 "/a" wasn't considered as `"/a" contains "/a"`.

2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list which contains "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of #8450, ThreadA won the race. ThreadB was created because
 "/var/data" wasn't considered as `"/" contains "/var/data"`. In other
 words, if there is at least one "/" in the input list, it must be
 single threaded since every directory is a child of "/", meaning they
 all directly or indirectly depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result create another thread
when mounts should be done in a single thread. Fix a conditional in
libzfs_path_contains() to consider above two cases.

Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878

Signed-off-by: Tomohiro Kusumi <[email protected]>
@kusumi
Copy link
Member Author

kusumi commented Jul 6, 2019

Test results are good.
Updated/revised commit message a bit, rebased and force pushed.

@codecov
Copy link

codecov bot commented Jul 6, 2019

Codecov Report

Merging #8878 into master will increase coverage by 0.07%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8878      +/-   ##
==========================================
+ Coverage   78.62%    78.7%   +0.07%     
==========================================
  Files         388      388              
  Lines      119992   119993       +1     
==========================================
+ Hits        94349    94445      +96     
+ Misses      25643    25548      -95
Flag Coverage Δ
#kernel 79.46% <ø> (+0.02%) ⬆️
#user 66.52% <100%> (+0.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1086f54...152fb96. Read the comment docs.

@c0d3z3r0
Copy link
Contributor

c0d3z3r0 commented Jul 7, 2019

Looking at #8450 (another one which is likely the same issue) more carefully, the race is likely happening with / involved, so it probably needs a special handling, something like below. It wants to count /var/data/... as a descendant of /.

Unfortunately I can't reproduce #8450, so I can't test this one.

diff --git a/lib/libzfs/libzfs_mount.c b/lib/libzfs/libzfs_mount.c
index 8c7a7eabe..e70acd425 100644
--- a/lib/libzfs/libzfs_mount.c
+++ b/lib/libzfs/libzfs_mount.c
@@ -1341,6 +1341,8 @@ mountpoint_cmp(const void *arga, const void *argb)
 static boolean_t
 libzfs_path_contains(const char *path1, const char *path2)
 {
+       if (strcmp(path1, "/") == 0 || strcmp(path2, "/") == 0)
+               return B_TRUE;
        return ((strcmp(path1, path2) == 0) ||
            (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/'));
 }

@kusumi #8450 is resolved by this patch ;)

root@debian:[~]# zfs create -o mountpoint=/var/test dpool/var_test
root@debian:[~]# zfs create -o mountpoint=/var/test/testlower dpool/var_test/testlower
root@debian:[~]# mount | grep test
dpool/var_test on /var/test type zfs (rw,relatime,xattr,posixacl)
dpool/var_test/testlower on /var/test/testlower type zfs (rw,relatime,xattr,posixacl)
root@debian:[~]# zfs mount | grep test
dpool/var_test                  /var/test
dpool/var_test/testlower        /var/test/testlower

root@debian:[~]# touch /var/test/testfile
root@debian:[~]# touch /var/test/testlower/testfilelower
root@debian:[~]# zfs unmount dpool/var_test/testlower
root@debian:[~]# zfs unmount dpool/var_test

root@debian:[~]# zfs mount -a
root@debian:[~]# zfs mount | grep test
dpool/var_test                  /var/test
dpool/var_test/testlower        /var/test/testlower
root@debian:[~]# ls /var/test
testfile  testlower
root@debian:[~]# ls /var/test/testlower/
testfilelower

@kusumi
Copy link
Member Author

kusumi commented Jul 7, 2019

@c0d3z3r0 Great.

By the way, I dropped strcmp(path2, "/") == 0 conditional from my final diff, since this is redundant from the way this api is designed (path1 is always parent side, so testing path1 is enough). It shouldn't affect though.

The final diff looks like this.

 /*
- * Return true if path2 is a child of path1.
+ * Return true if path2 is a child of path1 or path2 equals path1 or
+ * path1 is "/" (path2 is always a child of "/").
  */
 static boolean_t
 libzfs_path_contains(const char *path1, const char *path2)
 {
-       return (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/');
+       return (strcmp(path1, path2) == 0 || strcmp(path1, "/") == 0 ||
+           (strstr(path2, path1) == path2 && path2[strlen(path1)] == '/'));
 }

@behlendorf
Copy link
Contributor

@c0d3z3r0 thank you for verifying the fix!

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've resubmitted the two failed builders which hit an unrelated issue. Once they pass I'll go ahead and merge this. @kusumi thanks for resolving this issue.

@behlendorf behlendorf merged commit ab5036d into openzfs:master Jul 9, 2019
TulsiJain pushed a commit to TulsiJain/zfs that referenced this pull request Jul 20, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
TulsiJain pushed a commit to TulsiJain/zfs that referenced this pull request Jul 20, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Aug 13, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Aug 22, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Aug 23, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 17, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 18, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
tonyhutter pushed a commit that referenced this pull request Sep 26, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) #8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of #8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of #8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes #8450
Closes #8833
Closes #8878
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants