dsl_dataset_handoff_check fails with EBUSY, perhaps due to z_zvol hold #7863

prakashsurya · 2018-09-04T16:25:00Z

System information

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	18.04
Linux Kernel	4.15.0-33-generic
Architecture	x86_64
ZFS Version	https://github.com/delphix/zfs/commit/0ef8832
SPL Version	https://github.com/delphix/zfs/commit/0ef8832

Describe the problem you're observing

When running an application that heavily uses zfs send/recv, we'll sometimes see lzc_receive_resumable return EBUSY. Using trace-bpfcc we've determined that EBUSY is coming from the dsl_dataset_handoff_check function (I think, via dsl_dataset_clone_swap_check_impl). Additionally, we've only seen this failure occur when the application is receiving a ZVOL.

Before I continue, I want to make it clear, I'm not sure if the root cause of the EBUSY is due to ZFS or something else (e.g. an application bug), and unfortunately I haven't yet reproduced the problem without using this proprietary application to trigger the EBUSY condition. Regardless, I wanted to (perhaps prematurely) open a ZFS bug since the evidence that I have so far leads me to think this might end up being a problem in ZFS. If it's determined this isn't due to a ZFS defect, feel free to close this out.

Here's what my reproducer looks like:

$ cat ~/reproducer.sh
#!/bin/bash -x

while true; do
        sudo trace-bpfcc -t -K \
                'r::dsl_dataset_handoff_check (retval == 16)' \
                'p::dsl_dataset_handoff_check "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
                'p::dsl_dataset_long_hold "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
                'p::dsl_dataset_long_rele "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
                > ~/LX-1302.out &

        # Let BPF have a few seconds to start up before starting the test
        sleep 10

        # Run proprietary application, which may fail due to EBUSY
        sudo ./gradlew --daemon -x jFR :server:replication:dxosTest --tests '*sourceLaunchesNewJobOnStartup'
        rc=$?

        sudo pkill trace-bpfcc
        [[ $rc -eq 0 ]] || break
done

After a few iterations of the loop, it fails, and looking at the output from trace-bpfcc I see the following:

... snip ...

203.5538 12849  12849  txg_sync     dsl_dataset_handoff_check ds = 0xffff88ffe1c8c800, owner = 0x0
        dsl_dataset_handoff_check+0x1 [kernel]
        dmu_recv_end_check+0x19e [kernel]
        dsl_sync_task_sync+0x7a [kernel]
        dsl_pool_sync+0x2af [kernel]
        spa_sync+0x4ef [kernel]
        txg_sync_thread+0x223 [kernel]
        thread_generic_wrapper+0x74 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

... snip ...

203.5948 12834  12834  z_zvol       dsl_dataset_long_hold ds = 0xffff88ffe1c8c800, owner = 0xffff89004de44400
        dsl_dataset_long_hold+0x1 [kernel]
        dsl_dataset_own+0x3b [kernel]
        dmu_objset_own+0x92 [kernel]
        zvol_first_open+0x98 [kernel]
        zvol_open+0x187 [kernel]
        __blkdev_get+0xda [kernel]
        blkdev_get+0x129 [kernel]
        device_add_disk+0x3f9 [kernel]
        add_disk+0xe [kernel]
        zvol_create_minor_impl+0x45b [kernel]
        zvol_create_minors_impl+0x12d [kernel]
        zvol_task_cb+0x6d [kernel]
        taskq_thread+0x210 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

... snip ...

204.2233 12834  12834  z_zvol       dsl_dataset_long_rele ds = 0xffff88ffe1c8c800, owner = 0xffff89004de44400
        dsl_dataset_long_rele+0x1 [kernel]
        dmu_objset_disown+0x30 [kernel]
        zvol_last_close+0x26 [kernel]
        zvol_release+0xeb [kernel]
        __blkdev_put+0x1dc [kernel]
        blkdev_put+0x4c [kernel]
        device_add_disk+0x40e [kernel]
        add_disk+0xe [kernel]
        zvol_create_minor_impl+0x45b [kernel]
        zvol_create_minors_impl+0x12d [kernel]
        zvol_task_cb+0x6d [kernel]
        taskq_thread+0x210 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

... snip ...

206.2021 12834  12834  z_zvol       dsl_dataset_long_hold ds = 0xffff88ffe1c8c800, owner = 0xffffffffc052e7b0
        dsl_dataset_long_hold+0x1 [kernel]
        dsl_dataset_own+0x3b [kernel]
        dmu_objset_own+0x92 [kernel]
        zvol_create_minor_impl+0x139 [kernel]
        zvol_create_minors_impl+0x12d [kernel]
        zvol_task_cb+0x6d [kernel]
        taskq_thread+0x210 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

... snip ...

206.2097 12849  12849  txg_sync     dsl_dataset_handoff_check ds = 0xffff88ffe1c8c800, owner = 0x0
        dsl_dataset_handoff_check+0x1 [kernel]
        dmu_recv_end_check+0x19e [kernel]
        dsl_sync_task_sync+0x7a [kernel]
        dsl_pool_sync+0x2af [kernel]
        spa_sync+0x4ef [kernel]
        txg_sync_thread+0x223 [kernel]
        thread_generic_wrapper+0x74 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

206.2097 12849  12849  txg_sync     dsl_dataset_handoff_check 
        kretprobe_trampoline+0x0 [kernel]
        dmu_recv_end_check+0x19e [kernel]
        dsl_sync_task_sync+0x7a [kernel]
        dsl_pool_sync+0x2af [kernel]
        spa_sync+0x4ef [kernel]
        txg_sync_thread+0x223 [kernel]
        thread_generic_wrapper+0x74 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

... snip ...

206.2230 12834  12834  z_zvol       dsl_dataset_long_rele ds = 0xffff88ffe1c8c800, owner = 0xffffffffc052e7b0
        dsl_dataset_long_rele+0x1 [kernel]
        dmu_objset_disown+0x30 [kernel]
        zvol_create_minor_impl+0x411 [kernel]
        zvol_create_minors_impl+0x12d [kernel]
        zvol_task_cb+0x6d [kernel]
        taskq_thread+0x210 [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

... snip ...

From this, we can see the call to dsl_dataset_handoff_check by txg_sync at time 206.2097 failed with EBUSY. Looking just above that call, we can see z_zvol call dsl_dataset_long_hold for this same dataset, at time 206.2021. And then z_zvol doesn't drop this hold until time 206.2230, when it calls dsl_dataset_long_rele.

Since I don't fully understand how/when the ZVOL minors are created I could be mistaken, but I think it's this hold by z_zvol that's causing the EBUSY condition, and resultant failure in txg_sync (and application failure).

The text was updated successfully, but these errors were encountered:

prakashsurya · 2018-09-05T04:03:22Z

With help from @jgallag88, we've managed to reproduce what we think to be this same issue, but this time we're not relying on our closed source application to trigger it...

First, we have to make the following change to the ZFS kernel module:

--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -733,6 +733,9 @@ dsl_dataset_long_hold(dsl_dataset_t *ds, void *tag)
 void
 dsl_dataset_long_rele(dsl_dataset_t *ds, void *tag)
 {
+#ifdef _KERNEL
+   mdelay(200);
+#endif
    (void) refcount_remove(&ds->ds_longholds, tag);
 }

With this change in place, the following script will quickly cause a failure:

#!/bin/bash -ex

set -o pipefail

function cleanup
{
        if pgrep trace-bpfcc &>/dev/null; then
                sudo pkill trace-bpfcc
        fi

        if zfs list rpool/zvol-recv &>/dev/null; then
                while ! sudo zfs destroy -r rpool/zvol-recv; do
                        sleep 1
                done
        fi

        if zfs list rpool/zvol-send &>/dev/null; then
                sudo zfs destroy -r rpool/zvol-send
        fi

        if [[ -d "$DIR" ]]; then
                rm -rf "$DIR"
        fi
}

trap cleanup EXIT

DIR=$(mktemp -d -p /tmp zfs-recv-test.XXXXXXX)

sudo stdbuf -oL -eL trace-bpfcc -t -K \
        'r::dsl_dataset_handoff_check (retval == 16)' \
        'p::dsl_dataset_handoff_check "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
        'p::dsl_dataset_long_hold "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
        'p::dsl_dataset_long_rele "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
        > ~/LX-1302.out &
sleep 5

sudo zfs create -V 1G rpool/zvol-send
sudo zfs snapshot rpool/zvol-send@snap1
sudo zfs snapshot rpool/zvol-send@snap2

sudo zfs send rpool/zvol-send@snap1 \
        >"$DIR/zvol-send.snap1"

sudo zfs send -i rpool/zvol-send@snap1 rpool/zvol-send@snap2 \
        >"$DIR/zvol-send.snap2"

while true; do
        sudo zfs recv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"
        sleep 0.1
        sudo zfs recv -F rpool/zvol-recv@snap2 < "$DIR/zvol-send.snap2"

        while [[ ! -e /dev/zvol/rpool/zvol-recv ]]; do
                sleep .1
        done

        sudo udevadm settle
        sudo zfs destroy -r rpool/zvol-recv
done

In case it's not obvious, the actual reproducer is the following lines from that script:

while true; do
        sudo zfs recv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"
        sleep 0.1
        sudo zfs recv -F rpool/zvol-recv@snap2 < "$DIR/zvol-send.snap2"
done

The rest is just setup/teardown required to run this loop.

When running this, the second zfs recv will fail with EBUSY; e.g.

+ sudo zfs recv -F rpool/zvol-recv@snap1                      
+ sleep 0.1                    
+ sudo zfs recv -F rpool/zvol-recv@snap2
cannot receive incremental stream: dataset is busy

Looking at the trace-bpfcc output, we can see the following:

8.858035 362    362    z_zvol       dsl_dataset_long_hold ds = 0xffff95c6f03ae000, owner = 0xffff95c6aa2a5400
        dsl_dataset_long_hold+0x1 [kernel]
        dsl_dataset_own+0x3b [kernel]
        dmu_objset_own+0x92 [kernel]
        zvol_first_open+0xc5 [kernel]
        zvol_open+0x3fd [kernel]
        __blkdev_get+0xda [kernel]
        blkdev_get+0x129 [kernel]
        device_add_disk+0x3f9 [kernel]
        add_disk+0xe [kernel]
        zvol_create_minor_impl+0x4ce [kernel]
        zvol_create_minors_impl+0x131 [kernel]
        zvol_task_cb+0x6d [kernel]
        taskq_thread+0x22d [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

8.860954 407    407    txg_sync     dsl_dataset_handoff_check ds = 0xffff95c6f03ae000, owner = 0x0
        dsl_dataset_handoff_check+0x1 [kernel]
        dmu_recv_end_check+0x1d0 [kernel]
        dsl_sync_task_sync+0x88 [kernel]
        dsl_pool_sync+0x3da [kernel]
        spa_sync+0x62f [kernel]
        txg_sync_thread+0x28c [kernel]
        thread_generic_wrapper+0x7d [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

8.860994 407    407    txg_sync     dsl_dataset_handoff_check 
        kretprobe_trampoline+0x0 [kernel]
        dmu_recv_end_check+0x1d0 [kernel]
        dsl_sync_task_sync+0x88 [kernel]
        dsl_pool_sync+0x3da [kernel]
        spa_sync+0x62f [kernel]
        txg_sync_thread+0x28c [kernel]
        thread_generic_wrapper+0x7d [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

8.983752 362    362    z_zvol       dsl_dataset_long_rele ds = 0xffff95c6f03ae000, owner = 0xffff95c6aa2a5400
        dsl_dataset_long_rele+0x1 [kernel]
        dmu_objset_disown+0x30 [kernel]
        zvol_last_close+0x4e [kernel]
        zvol_release+0x16a [kernel]
        __blkdev_put+0x1dc [kernel]
        blkdev_put+0x4c [kernel]
        device_add_disk+0x40e [kernel]
        add_disk+0xe [kernel]
        zvol_create_minor_impl+0x4ce [kernel]
        zvol_create_minors_impl+0x131 [kernel]
        zvol_task_cb+0x6d [kernel]
        taskq_thread+0x22d [kernel]
        kthread+0x121 [kernel]
        ret_from_fork+0x35 [kernel]

Just like when using our closed source application to trigger this, it looks like the hold taken by the z_zvol thread is causing the dsl_dataset_handoff_check function to fail, resulting in the zfs recv operation failing.

prakashsurya · 2018-09-05T04:13:44Z

Also worth noting, if I modify the reproducer to sleep for 10 seconds in between zfs recv calls, instead of 0.1 seconds, the failure no longer occurs (at least, I haven't seen it occur yet); i.e.

sudo zfs recv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"
sleep 10
sudo zfs recv -F rpool/zvol-recv@snap2 < "$DIR/zvol-send.snap2"

prakashsurya · 2018-09-05T15:04:44Z

@behlendorf @bprotopopov if either of you have a chance to look at this, I'd appreciate any insight you might have. so far, I think this might be related to the change that introduced the z_zvol thread, here.

I tried to test this theory by making the ZVOL minor creation synchronous with this change:

--- a/module/zfs/dmu_send.c
+++ b/module/zfs/dmu_send.c
@@ -4217,7 +4217,7 @@ dmu_recv_end_sync(void *arg, dmu_tx_t *tx)
        drc->drc_newsnapobj =
            dsl_dataset_phys(drc->drc_ds)->ds_prev_snap_obj;
    }
-   zvol_create_minors(dp->dp_spa, drc->drc_tofs, B_TRUE);
+   zvol_create_minors(dp->dp_spa, drc->drc_tofs, B_FALSE);

    /*
     * Release the hold from dmu_recv_begin.  This must be done before

but now I think I've reintroduced the deadlock that the taskq was intending to fix.

bprotopopov · 2018-09-06T02:36:39Z

@prakashsurya, are the snapshots being automatically exposed as read-only devices? Typos courtesy of my iPhone On Sep 5, 2018, at 11:05 AM, Prakash Surya <[email protected]<mailto:[email protected]>> wrote: @behlendorf<https://github.com/behlendorf> @bprotopopov<https://github.com/bprotopopov> if either of you have a chance to look at this, I'd appreciate any insight you might have. so far, I think this might be related to the change that introduced the z_zvol thread, here<a0bd735>. I tried to test this theory by making the ZVOL minor creation synchronous with this change:

…

--- a/module/zfs/dmu_send.c +++ b/module/zfs/dmu_send.c @@ -4217,7 +4217,7 @@ dmu_recv_end_sync(void *arg, dmu_tx_t *tx) drc->drc_newsnapobj = dsl_dataset_phys(drc->drc_ds)->ds_prev_snap_obj; } - zvol_create_minors(dp->dp_spa, drc->drc_tofs, B_TRUE); + zvol_create_minors(dp->dp_spa, drc->drc_tofs, B_FALSE); /* * Release the hold from dmu_recv_begin. This must be done before but now I think I've reintroduced the deadlock that the taskq was intending to fix. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#7863 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACX4ucodesOm1G3uRXKBaTEFo-OIuVdBks5uX-g4gaJpZM4WZUuX>.

prakashsurya · 2018-09-06T03:54:01Z

@bprotopopov what is the default behavior? I don't think they're being exposed read-only, but I'd have to go back and double check to be sure. I'm not doing anything specific w.r.t. that, so it should just be doing whatever is the default behavior.

I'm not 100% sure of this, since I'm still learning how this part of the code works.. but I think the minor that's being created and causing the EBUSY, is for the rpool/zvol-recv ZVOL, not for either of the snapshots of that filesystem.

Specifically, when we do the first zfs recv, I believe it will create the rpool/zvol-recv dataset and start the ZVOL minor creation. Then, when the second zfs recv calls into dsl_dataset_handoff_check (via dsl_dataset_clone_swap_check_impl) at the end of the zfs recv ioctl, it detects the hold from the z_zvol thread (that's still attempting to create the minor from the first zfs recv) and returns EBUSY.

I'm speculating at this point, since I haven't actually tested this, but if my analysis so far is correct, I think this same issue can be triggered if we were to change the loop to something like this:

sudo zfs create -V 1G rpool/zvol-recv
sleep .1
sudo zfs recv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"

i.e. we just need to do an operation that'll start the asynchronous ZVOL minor creation, and then do another operation that'll call dsl_dataset_clone_swap_check_impl on that same dataset. The dsl_dataset_clone_swap_check_impl call may race with the ZVOL minor creation (since that creation code path will take a hold on the dataset), sometimes resulting in the EBUSY error.

I also wouldn't be surprised if this could fail:

sudo zfs create -V 1G rpool/zvol-recv
sleep .1
sudo zfs destroy rpool/zvol-recv

where the zfs destroy fails with EBUSY because of the hold from the z_zvol thread.

bprotopopov · 2018-09-06T11:19:52Z

@prakashsurya, there is a zvol property that controls exposing snapshots as devices. It is not set by default but i think it could be inherited. its been a while since i looked at this, but i think your analysis sounds reasonable. The code in question is designed to allocate and deallocate device minor numbers and to create and destroy device nodes for zvols. The latter happens via udev, which works asynchronously in Linux. So for instance, a script that creates a zvol and then writes to it via the device node will fail intermittently, because the device might or might not be created in time for the write. I would expect that retrying the operation that fails with EBUSY should succeed; the retries could be automated with back-off up to a configurable max time (in the application code or zfs code). Making minor creation synchronous is risky because of possible deadlocks. At the time i looked at this, i thought that because of the complexity of dealing with the deadlocks, the asynchronous solution was a good alternative. Typos courtesy of my iPhone On Sep 5, 2018, at 11:54 PM, Prakash Surya <[email protected]<mailto:[email protected]>> wrote: @bprotopopov<https://github.com/bprotopopov> what is the default behavior? I don't think they're being exposed read-only, but I'd have to go back and double check to be sure. I'm not doing anything specific w.r.t. that, so it should just be doing whatever is the default behavior. I'm not 100% sure of this, since I'm still learning how this part of the code works.. but I think the minor that's being created and causing the EBUSY, is for the rpool/zvol-recv ZVOL, not for either of snapshots of that filesystem. Specifically, when we do the first zfs recv, I believe it will create the rpool/zvol-recv dataset and start the ZVOL minor creation. Then, when the second zfs recv calls into dsl_dataset_handoff_check (via dsl_dataset_clone_swap_check_impl) at the end of the zfs recv ioctl, it detects the hold from the z_zvol thread (that's still attempting to create the minor from the first zfs recv) and returns EBUSY. I'm speculating at this point, since I haven't actually tested this, but if my analysis so far is correct, I think this same issue can be triggered if we were to change the loop to something like this: sudo zfs create -V 1G rpool/zvol-recv sleep .1 sudo zfs create -V 1grecv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1" i.e. we just need to do an operation that'll start the asynchronous ZVOL minor creation, and then do another operation that'll call dsl_dataset_clone_swap_check_impl on that same dataset. The dsl_dataset_clone_swap_check_impl call may race with the ZVOL minor creation (since that creation code path will take a hold on the dataset), sometimes resulting in the EBUSY error. I also wouldn't be surprised if this could fail: sudo zfs create -V 1G rpool/zvol-recv sleep .1 sudo zfs destroy rpool/zvol-recv where the zfs destroy fails with EBUSY because of the hold from the z_zvol thread. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#7863 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACX4uexYOZSqs6Hk0GtNzlhejLW0qAocks5uYJxjgaJpZM4WZUuX>.

prakashsurya · 2018-09-06T15:47:42Z

The code in question is designed to allocate and deallocate device minor numbers and to create and destroy device nodes for zvols. The latter happens via udev, which works asynchronously in Linux. So for instance, a script that creates a zvol and then writes to it via the device node will fail intermittently, because the device might or might not be created in time for the write.

Yea. This is consistent with the behavior we see. udev is another problem we see that also causes the same EBUSY issues.

I would expect that retrying the operation that fails with EBUSY should succeed; the retries could be automated with back-off up to a configurable max time (in the application code or zfs code).

Right. We've worked around this limitation so far by adding logic that will do the following each time we create a ZVOL:

wait for the device link to appear
call udevadm settle

This way, by the time these two operations finish, we know we can safely use the ZVOL without hitting an EBUSY error due to holds from z_zvol or udev.

IMO, the fact that one can't use a ZVOL immediately after it's created (e.g. can't even immediately destroy it) without risking an EBUSY error is broken behavior, even if this has been the existing behavior for awhile.

Further, even if we made the creation of the minor node synchronous, I think we still have the problem of udev taking a hold asynchronously, and that hold causing EBUSY errors.

Part of me thinks that the ZFS userspace command/libraries should not return after creating a ZVOL, until the ZVOL can be reliably used/destroyed/etc. E.g. libzfs could create the ZVOL, then wait for its device link to appear, and then call udevadm settle; and only return after all of those steps complete. This way, after calling zfs create, one could immediately call zfs destroy without risking the EBUSY issue.

Anyways, @bprotopopov thanks for chiming in, I really appreciate it! In the short term, we're working around this problem by retrying/waiting as you suggest. In the long term, it'd be great not to have to do that, but I'll have to think more about this to try and come up with a better solution.

behlendorf · 2018-09-06T23:16:33Z

Thanks for digging in to this, sorry I'm a little late to the discussion. Your analysis looks entirely reasonable to me as the root cause. And we have definitely noticed exactly these kinds of issues even the ZTS due to the asynchronous nature of udev.

Part of me thinks that the ZFS userspace command/libraries should not return after creating a ZVOL.

This. We should do this. The good news is that these days virtually all the required code to handle this is already in place. Previously doing this would have been awkward but now the libzfs library already links against libudev so it should be straight forward to block waiting on the event, or retry in the EBUSY case. Making this the default behavior would be a nice improvement, though I would like to see an option to allow it to be async for those systems which can take advantage of it.

bprotopopov · 2018-09-07T10:56:13Z

I concur - this would definitely be a usability improvement. Typos courtesy of my iPhone On Sep 6, 2018, at 7:16 PM, Brian Behlendorf <[email protected]<mailto:[email protected]>> wrote: Thanks for digging in to this, sorry I'm a little late to the discussion. Your analysis looks entirely reasonable to me as the root cause. And we have definitely noticed exactly these kinds of issues even the ZTS due to the asynchronous nature of udev. Part of me thinks that the ZFS userspace command/libraries should not return after creating a ZVOL. This. We should do this. The good news is that these days virtually all the required code to handle this is already in place. Previously doing this would have been awkward but now the libzfs library already links against libudev so it should be straight forward to block waiting on the event, or retry in the EBUSY case. Making this the default behavior would be a nice improvement, though I would like to see an option to allow it to be async for those systems which can take advantage of it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#7863 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACX4uZ8Drdp7BHoLw26cCjYnGkSAWliuks5uYazcgaJpZM4WZUuX>.

When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Signed-off-by: Matthew Ahrens <[email protected]> External issue: DLPX-65948 Closes openzfs#7863

When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863

…eceiveTest failed due to "DelphixFatalException: unexpected ZFS error: EBUSY on..." Cherry-pick of the following commit from master. For the backport, I had to move some of the code around because the zvol os-specific restructuring is not in 6.0. async zvol minor node creation interferes with receive When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863 Closes openzfs#9885

When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863 Closes openzfs#9885

loli10K added the Component: ZVOL ZFS Volumes label Sep 6, 2018

behlendorf mentioned this issue Sep 7, 2018

zfs create -V should wait for udev #7875

Closed

ahrens mentioned this issue Jan 24, 2020

async zvol minor node creation interferes with receive #9885

Merged

12 tasks

behlendorf closed this as completed in ec21397 Feb 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dsl_dataset_handoff_check fails with EBUSY, perhaps due to z_zvol hold #7863

dsl_dataset_handoff_check fails with EBUSY, perhaps due to z_zvol hold #7863

prakashsurya commented Sep 4, 2018

prakashsurya commented Sep 5, 2018

prakashsurya commented Sep 5, 2018

prakashsurya commented Sep 5, 2018

bprotopopov commented Sep 6, 2018 via email

prakashsurya commented Sep 6, 2018 •

edited

Loading

bprotopopov commented Sep 6, 2018 via email

prakashsurya commented Sep 6, 2018

behlendorf commented Sep 6, 2018

bprotopopov commented Sep 7, 2018 via email

dsl_dataset_handoff_check fails with EBUSY, perhaps due to z_zvol hold #7863

dsl_dataset_handoff_check fails with EBUSY, perhaps due to z_zvol hold #7863

Comments

prakashsurya commented Sep 4, 2018

System information

Describe the problem you're observing

prakashsurya commented Sep 5, 2018

prakashsurya commented Sep 5, 2018

prakashsurya commented Sep 5, 2018

bprotopopov commented Sep 6, 2018 via email

prakashsurya commented Sep 6, 2018 • edited Loading

bprotopopov commented Sep 6, 2018 via email

prakashsurya commented Sep 6, 2018

behlendorf commented Sep 6, 2018

bprotopopov commented Sep 7, 2018 via email

prakashsurya commented Sep 6, 2018 •

edited

Loading