Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsl_dataset_handoff_check fails with EBUSY, perhaps due to z_zvol hold #7863

Closed
prakashsurya opened this issue Sep 4, 2018 · 9 comments
Closed
Labels
Component: ZVOL ZFS Volumes

Comments

@prakashsurya
Copy link
Member

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 18.04
Linux Kernel 4.15.0-33-generic
Architecture x86_64
ZFS Version https://github.com/delphix/zfs/commit/0ef8832
SPL Version https://github.com/delphix/zfs/commit/0ef8832

Describe the problem you're observing

When running an application that heavily uses zfs send/recv, we'll sometimes see lzc_receive_resumable return EBUSY. Using trace-bpfcc we've determined that EBUSY is coming from the dsl_dataset_handoff_check function (I think, via dsl_dataset_clone_swap_check_impl). Additionally, we've only seen this failure occur when the application is receiving a ZVOL.

Before I continue, I want to make it clear, I'm not sure if the root cause of the EBUSY is due to ZFS or something else (e.g. an application bug), and unfortunately I haven't yet reproduced the problem without using this proprietary application to trigger the EBUSY condition. Regardless, I wanted to (perhaps prematurely) open a ZFS bug since the evidence that I have so far leads me to think this might end up being a problem in ZFS. If it's determined this isn't due to a ZFS defect, feel free to close this out.

Here's what my reproducer looks like:

$ cat ~/reproducer.sh
#!/bin/bash -x

while true; do
        sudo trace-bpfcc -t -K \
                'r::dsl_dataset_handoff_check (retval == 16)' \
                'p::dsl_dataset_handoff_check "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
                'p::dsl_dataset_long_hold "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
                'p::dsl_dataset_long_rele "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
                > ~/LX-1302.out &

        # Let BPF have a few seconds to start up before starting the test
        sleep 10

        # Run proprietary application, which may fail due to EBUSY
        sudo ./gradlew --daemon -x jFR :server:replication:dxosTest --tests '*sourceLaunchesNewJobOnStartup'
        rc=$?

        sudo pkill trace-bpfcc
        [[ $rc -eq 0 ]] || break
done

After a few iterations of the loop, it fails, and looking at the output from trace-bpfcc I see the following:

... snip ...

203.5538 12849  12849  txg_sync     dsl_dataset_handoff_check ds = 0xffff88ffe1c8c800, owner = 0x0
        dsl_dataset_handoff_check+0x1 [kernel]
        dmu_recv_end_check+0x19e [kernel]
        dsl_sync_task_sync+0x7a [kernel]
        dsl_pool_sync+0x2af [kernel]
        spa_sync+0x4ef [kernel]
        txg_sync_thread+0x223 [kernel]
        thread_generic_wrapper+0x74 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

... snip ...

203.5948 12834  12834  z_zvol       dsl_dataset_long_hold ds = 0xffff88ffe1c8c800, owner = 0xffff89004de44400
        dsl_dataset_long_hold+0x1 [kernel]
        dsl_dataset_own+0x3b [kernel]
        dmu_objset_own+0x92 [kernel]
        zvol_first_open+0x98 [kernel]
        zvol_open+0x187 [kernel]
        __blkdev_get+0xda [kernel]
        blkdev_get+0x129 [kernel]
        device_add_disk+0x3f9 [kernel]
        add_disk+0xe [kernel]
        zvol_create_minor_impl+0x45b [kernel]
        zvol_create_minors_impl+0x12d [kernel]
        zvol_task_cb+0x6d [kernel]
        taskq_thread+0x210 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

... snip ...

204.2233 12834  12834  z_zvol       dsl_dataset_long_rele ds = 0xffff88ffe1c8c800, owner = 0xffff89004de44400
        dsl_dataset_long_rele+0x1 [kernel]
        dmu_objset_disown+0x30 [kernel]
        zvol_last_close+0x26 [kernel]
        zvol_release+0xeb [kernel]
        __blkdev_put+0x1dc [kernel]
        blkdev_put+0x4c [kernel]
        device_add_disk+0x40e [kernel]
        add_disk+0xe [kernel]
        zvol_create_minor_impl+0x45b [kernel]
        zvol_create_minors_impl+0x12d [kernel]
        zvol_task_cb+0x6d [kernel]
        taskq_thread+0x210 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

... snip ...

206.2021 12834  12834  z_zvol       dsl_dataset_long_hold ds = 0xffff88ffe1c8c800, owner = 0xffffffffc052e7b0
        dsl_dataset_long_hold+0x1 [kernel]
        dsl_dataset_own+0x3b [kernel]
        dmu_objset_own+0x92 [kernel]
        zvol_create_minor_impl+0x139 [kernel]
        zvol_create_minors_impl+0x12d [kernel]
        zvol_task_cb+0x6d [kernel]
        taskq_thread+0x210 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

... snip ...

206.2097 12849  12849  txg_sync     dsl_dataset_handoff_check ds = 0xffff88ffe1c8c800, owner = 0x0
        dsl_dataset_handoff_check+0x1 [kernel]
        dmu_recv_end_check+0x19e [kernel]
        dsl_sync_task_sync+0x7a [kernel]
        dsl_pool_sync+0x2af [kernel]
        spa_sync+0x4ef [kernel]
        txg_sync_thread+0x223 [kernel]
        thread_generic_wrapper+0x74 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

206.2097 12849  12849  txg_sync     dsl_dataset_handoff_check 
        kretprobe_trampoline+0x0 [kernel]
        dmu_recv_end_check+0x19e [kernel]
        dsl_sync_task_sync+0x7a [kernel]
        dsl_pool_sync+0x2af [kernel]
        spa_sync+0x4ef [kernel]
        txg_sync_thread+0x223 [kernel]
        thread_generic_wrapper+0x74 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

... snip ...

206.2230 12834  12834  z_zvol       dsl_dataset_long_rele ds = 0xffff88ffe1c8c800, owner = 0xffffffffc052e7b0
        dsl_dataset_long_rele+0x1 [kernel]
        dmu_objset_disown+0x30 [kernel]
        zvol_create_minor_impl+0x411 [kernel]
        zvol_create_minors_impl+0x12d [kernel]
        zvol_task_cb+0x6d [kernel]
        taskq_thread+0x210 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

... snip ...

From this, we can see the call to dsl_dataset_handoff_check by txg_sync at time 206.2097 failed with EBUSY. Looking just above that call, we can see z_zvol call dsl_dataset_long_hold for this same dataset, at time 206.2021. And then z_zvol doesn't drop this hold until time 206.2230, when it calls dsl_dataset_long_rele.

Since I don't fully understand how/when the ZVOL minors are created I could be mistaken, but I think it's this hold by z_zvol that's causing the EBUSY condition, and resultant failure in txg_sync (and application failure).

@prakashsurya
Copy link
Member Author

With help from @jgallag88, we've managed to reproduce what we think to be this same issue, but this time we're not relying on our closed source application to trigger it...

First, we have to make the following change to the ZFS kernel module:

--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -733,6 +733,9 @@ dsl_dataset_long_hold(dsl_dataset_t *ds, void *tag)
 void
 dsl_dataset_long_rele(dsl_dataset_t *ds, void *tag)
 {
+#ifdef _KERNEL
+   mdelay(200);
+#endif
    (void) refcount_remove(&ds->ds_longholds, tag);
 }

With this change in place, the following script will quickly cause a failure:

#!/bin/bash -ex

set -o pipefail

function cleanup
{
        if pgrep trace-bpfcc &>/dev/null; then
                sudo pkill trace-bpfcc
        fi

        if zfs list rpool/zvol-recv &>/dev/null; then
                while ! sudo zfs destroy -r rpool/zvol-recv; do
                        sleep 1
                done
        fi

        if zfs list rpool/zvol-send &>/dev/null; then
                sudo zfs destroy -r rpool/zvol-send
        fi

        if [[ -d "$DIR" ]]; then
                rm -rf "$DIR"
        fi
}

trap cleanup EXIT

DIR=$(mktemp -d -p /tmp zfs-recv-test.XXXXXXX)

sudo stdbuf -oL -eL trace-bpfcc -t -K \
        'r::dsl_dataset_handoff_check (retval == 16)' \
        'p::dsl_dataset_handoff_check "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
        'p::dsl_dataset_long_hold "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
        'p::dsl_dataset_long_rele "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
        > ~/LX-1302.out &
sleep 5

sudo zfs create -V 1G rpool/zvol-send
sudo zfs snapshot rpool/zvol-send@snap1
sudo zfs snapshot rpool/zvol-send@snap2

sudo zfs send rpool/zvol-send@snap1 \
        >"$DIR/zvol-send.snap1"

sudo zfs send -i rpool/zvol-send@snap1 rpool/zvol-send@snap2 \
        >"$DIR/zvol-send.snap2"

while true; do
        sudo zfs recv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"
        sleep 0.1
        sudo zfs recv -F rpool/zvol-recv@snap2 < "$DIR/zvol-send.snap2"

        while [[ ! -e /dev/zvol/rpool/zvol-recv ]]; do
                sleep .1
        done

        sudo udevadm settle
        sudo zfs destroy -r rpool/zvol-recv
done

In case it's not obvious, the actual reproducer is the following lines from that script:

while true; do
        sudo zfs recv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"
        sleep 0.1
        sudo zfs recv -F rpool/zvol-recv@snap2 < "$DIR/zvol-send.snap2"
done

The rest is just setup/teardown required to run this loop.

When running this, the second zfs recv will fail with EBUSY; e.g.

+ sudo zfs recv -F rpool/zvol-recv@snap1                      
+ sleep 0.1                    
+ sudo zfs recv -F rpool/zvol-recv@snap2
cannot receive incremental stream: dataset is busy

Looking at the trace-bpfcc output, we can see the following:

8.858035 362    362    z_zvol       dsl_dataset_long_hold ds = 0xffff95c6f03ae000, owner = 0xffff95c6aa2a5400
        dsl_dataset_long_hold+0x1 [kernel]
        dsl_dataset_own+0x3b [kernel]
        dmu_objset_own+0x92 [kernel]
        zvol_first_open+0xc5 [kernel]
        zvol_open+0x3fd [kernel]
        __blkdev_get+0xda [kernel]
        blkdev_get+0x129 [kernel]
        device_add_disk+0x3f9 [kernel]
        add_disk+0xe [kernel]
        zvol_create_minor_impl+0x4ce [kernel]
        zvol_create_minors_impl+0x131 [kernel]
        zvol_task_cb+0x6d [kernel]
        taskq_thread+0x22d [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

8.860954 407    407    txg_sync     dsl_dataset_handoff_check ds = 0xffff95c6f03ae000, owner = 0x0
        dsl_dataset_handoff_check+0x1 [kernel]
        dmu_recv_end_check+0x1d0 [kernel]
        dsl_sync_task_sync+0x88 [kernel]
        dsl_pool_sync+0x3da [kernel]
        spa_sync+0x62f [kernel]
        txg_sync_thread+0x28c [kernel]
        thread_generic_wrapper+0x7d [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

8.860994 407    407    txg_sync     dsl_dataset_handoff_check 
        kretprobe_trampoline+0x0 [kernel]
        dmu_recv_end_check+0x1d0 [kernel]
        dsl_sync_task_sync+0x88 [kernel]
        dsl_pool_sync+0x3da [kernel]
        spa_sync+0x62f [kernel]
        txg_sync_thread+0x28c [kernel]
        thread_generic_wrapper+0x7d [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

8.983752 362    362    z_zvol       dsl_dataset_long_rele ds = 0xffff95c6f03ae000, owner = 0xffff95c6aa2a5400
        dsl_dataset_long_rele+0x1 [kernel]
        dmu_objset_disown+0x30 [kernel]
        zvol_last_close+0x4e [kernel]
        zvol_release+0x16a [kernel]
        __blkdev_put+0x1dc [kernel]
        blkdev_put+0x4c [kernel]
        device_add_disk+0x40e [kernel]
        add_disk+0xe [kernel]
        zvol_create_minor_impl+0x4ce [kernel]
        zvol_create_minors_impl+0x131 [kernel]
        zvol_task_cb+0x6d [kernel]
        taskq_thread+0x22d [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

Just like when using our closed source application to trigger this, it looks like the hold taken by the z_zvol thread is causing the dsl_dataset_handoff_check function to fail, resulting in the zfs recv operation failing.

@prakashsurya
Copy link
Member Author

Also worth noting, if I modify the reproducer to sleep for 10 seconds in between zfs recv calls, instead of 0.1 seconds, the failure no longer occurs (at least, I haven't seen it occur yet); i.e.

sudo zfs recv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"
sleep 10
sudo zfs recv -F rpool/zvol-recv@snap2 < "$DIR/zvol-send.snap2"

@prakashsurya
Copy link
Member Author

@behlendorf @bprotopopov if either of you have a chance to look at this, I'd appreciate any insight you might have. so far, I think this might be related to the change that introduced the z_zvol thread, here.

I tried to test this theory by making the ZVOL minor creation synchronous with this change:

--- a/module/zfs/dmu_send.c
+++ b/module/zfs/dmu_send.c
@@ -4217,7 +4217,7 @@ dmu_recv_end_sync(void *arg, dmu_tx_t *tx)
        drc->drc_newsnapobj =
            dsl_dataset_phys(drc->drc_ds)->ds_prev_snap_obj;
    }
-   zvol_create_minors(dp->dp_spa, drc->drc_tofs, B_TRUE);
+   zvol_create_minors(dp->dp_spa, drc->drc_tofs, B_FALSE);

    /*
     * Release the hold from dmu_recv_begin.  This must be done before

but now I think I've reintroduced the deadlock that the taskq was intending to fix.

@bprotopopov
Copy link
Contributor

bprotopopov commented Sep 6, 2018 via email

@prakashsurya
Copy link
Member Author

prakashsurya commented Sep 6, 2018

@bprotopopov what is the default behavior? I don't think they're being exposed read-only, but I'd have to go back and double check to be sure. I'm not doing anything specific w.r.t. that, so it should just be doing whatever is the default behavior.

I'm not 100% sure of this, since I'm still learning how this part of the code works.. but I think the minor that's being created and causing the EBUSY, is for the rpool/zvol-recv ZVOL, not for either of the snapshots of that filesystem.

Specifically, when we do the first zfs recv, I believe it will create the rpool/zvol-recv dataset and start the ZVOL minor creation. Then, when the second zfs recv calls into dsl_dataset_handoff_check (via dsl_dataset_clone_swap_check_impl) at the end of the zfs recv ioctl, it detects the hold from the z_zvol thread (that's still attempting to create the minor from the first zfs recv) and returns EBUSY.

I'm speculating at this point, since I haven't actually tested this, but if my analysis so far is correct, I think this same issue can be triggered if we were to change the loop to something like this:

sudo zfs create -V 1G rpool/zvol-recv
sleep .1
sudo zfs recv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"

i.e. we just need to do an operation that'll start the asynchronous ZVOL minor creation, and then do another operation that'll call dsl_dataset_clone_swap_check_impl on that same dataset. The dsl_dataset_clone_swap_check_impl call may race with the ZVOL minor creation (since that creation code path will take a hold on the dataset), sometimes resulting in the EBUSY error.

I also wouldn't be surprised if this could fail:

sudo zfs create -V 1G rpool/zvol-recv
sleep .1
sudo zfs destroy rpool/zvol-recv

where the zfs destroy fails with EBUSY because of the hold from the z_zvol thread.

@bprotopopov
Copy link
Contributor

bprotopopov commented Sep 6, 2018 via email

@prakashsurya
Copy link
Member Author

The code in question is designed to allocate and deallocate device minor numbers and to create and destroy device nodes for zvols. The latter happens via udev, which works asynchronously in Linux. So for instance, a script that creates a zvol and then writes to it via the device node will fail intermittently, because the device might or might not be created in time for the write.

Yea. This is consistent with the behavior we see. udev is another problem we see that also causes the same EBUSY issues.

I would expect that retrying the operation that fails with EBUSY should succeed; the retries could be automated with back-off up to a configurable max time (in the application code or zfs code).

Right. We've worked around this limitation so far by adding logic that will do the following each time we create a ZVOL:

  1. wait for the device link to appear
  2. call udevadm settle

This way, by the time these two operations finish, we know we can safely use the ZVOL without hitting an EBUSY error due to holds from z_zvol or udev.

IMO, the fact that one can't use a ZVOL immediately after it's created (e.g. can't even immediately destroy it) without risking an EBUSY error is broken behavior, even if this has been the existing behavior for awhile.

Further, even if we made the creation of the minor node synchronous, I think we still have the problem of udev taking a hold asynchronously, and that hold causing EBUSY errors.

Part of me thinks that the ZFS userspace command/libraries should not return after creating a ZVOL, until the ZVOL can be reliably used/destroyed/etc. E.g. libzfs could create the ZVOL, then wait for its device link to appear, and then call udevadm settle; and only return after all of those steps complete. This way, after calling zfs create, one could immediately call zfs destroy without risking the EBUSY issue.

Anyways, @bprotopopov thanks for chiming in, I really appreciate it! In the short term, we're working around this problem by retrying/waiting as you suggest. In the long term, it'd be great not to have to do that, but I'll have to think more about this to try and come up with a better solution.

@loli10K loli10K added the Component: ZVOL ZFS Volumes label Sep 6, 2018
@behlendorf
Copy link
Contributor

Thanks for digging in to this, sorry I'm a little late to the discussion. Your analysis looks entirely reasonable to me as the root cause. And we have definitely noticed exactly these kinds of issues even the ZTS due to the asynchronous nature of udev.

Part of me thinks that the ZFS userspace command/libraries should not return after creating a ZVOL.

This. We should do this. The good news is that these days virtually all the required code to handle this is already in place. Previously doing this would have been awkward but now the libzfs library already links against libudev so it should be straight forward to block waiting on the event, or retry in the EBUSY case. Making this the default behavior would be a nice improvement, though I would like to see an option to allow it to be async for those systems which can take advantage of it.

@bprotopopov
Copy link
Contributor

bprotopopov commented Sep 7, 2018 via email

ahrens added a commit to ahrens/zfs that referenced this issue Jan 24, 2020
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see openzfs#3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Signed-off-by: Matthew Ahrens <[email protected]>
External issue: DLPX-65948
Closes openzfs#7863
ahrens added a commit to ahrens/zfs that referenced this issue Jan 24, 2020
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see openzfs#3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-65948
Closes openzfs#7863
ahrens added a commit to ahrens/zfs that referenced this issue Jan 24, 2020
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see openzfs#3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-65948
Closes openzfs#7863
ahrens added a commit to ahrens/zfs that referenced this issue Jan 24, 2020
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see openzfs#3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-65948
Closes openzfs#7863
ahrens added a commit to ahrens/zfs that referenced this issue Feb 14, 2020
…eceiveTest failed due to "DelphixFatalException: unexpected ZFS error: EBUSY on..."

Cherry-pick of the following commit from master.  For the backport, I
had to move some of the code around because the zvol os-specific
restructuring is not in 6.0.

async zvol minor node creation interferes with receive

When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see openzfs#3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Prakash Surya <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-65948
Closes openzfs#7863
Closes openzfs#9885
jsai20 pushed a commit to jsai20/zfs that referenced this issue Mar 30, 2021
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see openzfs#3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Prakash Surya <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-65948
Closes openzfs#7863
Closes openzfs#9885
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: ZVOL ZFS Volumes
Projects
None yet
Development

No branches or pull requests

4 participants