-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dsl_dataset_handoff_check fails with EBUSY, perhaps due to z_zvol hold #7863
Comments
With help from @jgallag88, we've managed to reproduce what we think to be this same issue, but this time we're not relying on our closed source application to trigger it... First, we have to make the following change to the ZFS kernel module:
With this change in place, the following script will quickly cause a failure:
In case it's not obvious, the actual reproducer is the following lines from that script:
The rest is just setup/teardown required to run this loop. When running this, the second
Looking at the
Just like when using our closed source application to trigger this, it looks like the hold taken by the |
Also worth noting, if I modify the reproducer to sleep for 10 seconds in between
|
@behlendorf @bprotopopov if either of you have a chance to look at this, I'd appreciate any insight you might have. so far, I think this might be related to the change that introduced the I tried to test this theory by making the ZVOL minor creation synchronous with this change:
but now I think I've reintroduced the deadlock that the taskq was intending to fix. |
@prakashsurya, are the snapshots being automatically exposed as read-only devices?
Typos courtesy of my iPhone
On Sep 5, 2018, at 11:05 AM, Prakash Surya <[email protected]<mailto:[email protected]>> wrote:
@behlendorf<https://github.com/behlendorf> @bprotopopov<https://github.com/bprotopopov> if either of you have a chance to look at this, I'd appreciate any insight you might have. so far, I think this might be related to the change that introduced the z_zvol thread, here<a0bd735>.
I tried to test this theory by making the ZVOL minor creation synchronous with this change:
…--- a/module/zfs/dmu_send.c
+++ b/module/zfs/dmu_send.c
@@ -4217,7 +4217,7 @@ dmu_recv_end_sync(void *arg, dmu_tx_t *tx)
drc->drc_newsnapobj =
dsl_dataset_phys(drc->drc_ds)->ds_prev_snap_obj;
}
- zvol_create_minors(dp->dp_spa, drc->drc_tofs, B_TRUE);
+ zvol_create_minors(dp->dp_spa, drc->drc_tofs, B_FALSE);
/*
* Release the hold from dmu_recv_begin. This must be done before
but now I think I've reintroduced the deadlock that the taskq was intending to fix.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#7863 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACX4ucodesOm1G3uRXKBaTEFo-OIuVdBks5uX-g4gaJpZM4WZUuX>.
|
@bprotopopov what is the default behavior? I don't think they're being exposed read-only, but I'd have to go back and double check to be sure. I'm not doing anything specific w.r.t. that, so it should just be doing whatever is the default behavior. I'm not 100% sure of this, since I'm still learning how this part of the code works.. but I think the minor that's being created and causing the EBUSY, is for the Specifically, when we do the first I'm speculating at this point, since I haven't actually tested this, but if my analysis so far is correct, I think this same issue can be triggered if we were to change the loop to something like this:
i.e. we just need to do an operation that'll start the asynchronous ZVOL minor creation, and then do another operation that'll call I also wouldn't be surprised if this could fail:
where the |
@prakashsurya, there is a zvol property that controls exposing snapshots as devices. It is not set by default but i think it could be inherited.
its been a while since i looked at this, but i think your analysis sounds reasonable. The code in question is designed to allocate and deallocate device minor numbers and to create and destroy device nodes for zvols. The latter happens via udev, which works asynchronously in Linux. So for instance, a script that creates a zvol and then writes to it via the device node will fail intermittently, because the device might or might not be created in time for the write.
I would expect that retrying the operation that fails with EBUSY should succeed; the retries could be automated with back-off up to a configurable max time (in the application code or zfs code).
Making minor creation synchronous is risky because of possible deadlocks. At the time i looked at this, i thought that because of the complexity of dealing with the deadlocks, the asynchronous solution was a good alternative.
Typos courtesy of my iPhone
On Sep 5, 2018, at 11:54 PM, Prakash Surya <[email protected]<mailto:[email protected]>> wrote:
@bprotopopov<https://github.com/bprotopopov> what is the default behavior? I don't think they're being exposed read-only, but I'd have to go back and double check to be sure. I'm not doing anything specific w.r.t. that, so it should just be doing whatever is the default behavior.
I'm not 100% sure of this, since I'm still learning how this part of the code works.. but I think the minor that's being created and causing the EBUSY, is for the rpool/zvol-recv ZVOL, not for either of snapshots of that filesystem.
Specifically, when we do the first zfs recv, I believe it will create the rpool/zvol-recv dataset and start the ZVOL minor creation. Then, when the second zfs recv calls into dsl_dataset_handoff_check (via dsl_dataset_clone_swap_check_impl) at the end of the zfs recv ioctl, it detects the hold from the z_zvol thread (that's still attempting to create the minor from the first zfs recv) and returns EBUSY.
I'm speculating at this point, since I haven't actually tested this, but if my analysis so far is correct, I think this same issue can be triggered if we were to change the loop to something like this:
sudo zfs create -V 1G rpool/zvol-recv
sleep .1
sudo zfs create -V 1grecv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"
i.e. we just need to do an operation that'll start the asynchronous ZVOL minor creation, and then do another operation that'll call dsl_dataset_clone_swap_check_impl on that same dataset. The dsl_dataset_clone_swap_check_impl call may race with the ZVOL minor creation (since that creation code path will take a hold on the dataset), sometimes resulting in the EBUSY error.
I also wouldn't be surprised if this could fail:
sudo zfs create -V 1G rpool/zvol-recv
sleep .1
sudo zfs destroy rpool/zvol-recv
where the zfs destroy fails with EBUSY because of the hold from the z_zvol thread.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#7863 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACX4uexYOZSqs6Hk0GtNzlhejLW0qAocks5uYJxjgaJpZM4WZUuX>.
|
Yea. This is consistent with the behavior we see.
Right. We've worked around this limitation so far by adding logic that will do the following each time we create a ZVOL:
This way, by the time these two operations finish, we know we can safely use the ZVOL without hitting an EBUSY error due to holds from IMO, the fact that one can't use a ZVOL immediately after it's created (e.g. can't even immediately destroy it) without risking an EBUSY error is broken behavior, even if this has been the existing behavior for awhile. Further, even if we made the creation of the minor node synchronous, I think we still have the problem of Part of me thinks that the ZFS userspace command/libraries should not return after creating a ZVOL, until the ZVOL can be reliably used/destroyed/etc. E.g. libzfs could create the ZVOL, then wait for its device link to appear, and then call Anyways, @bprotopopov thanks for chiming in, I really appreciate it! In the short term, we're working around this problem by retrying/waiting as you suggest. In the long term, it'd be great not to have to do that, but I'll have to think more about this to try and come up with a better solution. |
Thanks for digging in to this, sorry I'm a little late to the discussion. Your analysis looks entirely reasonable to me as the root cause. And we have definitely noticed exactly these kinds of issues even the ZTS due to the asynchronous nature of udev.
This. We should do this. The good news is that these days virtually all the required code to handle this is already in place. Previously doing this would have been awkward but now the |
I concur - this would definitely be a usability improvement.
Typos courtesy of my iPhone
On Sep 6, 2018, at 7:16 PM, Brian Behlendorf <[email protected]<mailto:[email protected]>> wrote:
Thanks for digging in to this, sorry I'm a little late to the discussion. Your analysis looks entirely reasonable to me as the root cause. And we have definitely noticed exactly these kinds of issues even the ZTS due to the asynchronous nature of udev.
Part of me thinks that the ZFS userspace command/libraries should not return after creating a ZVOL.
This. We should do this. The good news is that these days virtually all the required code to handle this is already in place. Previously doing this would have been awkward but now the libzfs library already links against libudev so it should be straight forward to block waiting on the event, or retry in the EBUSY case. Making this the default behavior would be a nice improvement, though I would like to see an option to allow it to be async for those systems which can take advantage of it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#7863 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACX4uZ8Drdp7BHoLw26cCjYnGkSAWliuks5uYazcgaJpZM4WZUuX>.
|
When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Signed-off-by: Matthew Ahrens <[email protected]> External issue: DLPX-65948 Closes openzfs#7863
When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863
When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863
When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863
…eceiveTest failed due to "DelphixFatalException: unexpected ZFS error: EBUSY on..." Cherry-pick of the following commit from master. For the backport, I had to move some of the code around because the zvol os-specific restructuring is not in 6.0. async zvol minor node creation interferes with receive When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863 Closes openzfs#9885
When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863 Closes openzfs#9885
System information
Describe the problem you're observing
When running an application that heavily uses zfs send/recv, we'll sometimes see
lzc_receive_resumable
return EBUSY. Usingtrace-bpfcc
we've determined that EBUSY is coming from thedsl_dataset_handoff_check
function (I think, viadsl_dataset_clone_swap_check_impl
). Additionally, we've only seen this failure occur when the application is receiving a ZVOL.Before I continue, I want to make it clear, I'm not sure if the root cause of the EBUSY is due to ZFS or something else (e.g. an application bug), and unfortunately I haven't yet reproduced the problem without using this proprietary application to trigger the EBUSY condition. Regardless, I wanted to (perhaps prematurely) open a ZFS bug since the evidence that I have so far leads me to think this might end up being a problem in ZFS. If it's determined this isn't due to a ZFS defect, feel free to close this out.
Here's what my reproducer looks like:
After a few iterations of the loop, it fails, and looking at the output from
trace-bpfcc
I see the following:From this, we can see the call to
dsl_dataset_handoff_check
bytxg_sync
at time206.2097
failed with EBUSY. Looking just above that call, we can seez_zvol
calldsl_dataset_long_hold
for this same dataset, at time206.2021
. And thenz_zvol
doesn't drop this hold until time206.2230
, when it callsdsl_dataset_long_rele
.Since I don't fully understand how/when the ZVOL minors are created I could be mistaken, but I think it's this hold by
z_zvol
that's causing the EBUSY condition, and resultant failure intxg_sync
(and application failure).The text was updated successfully, but these errors were encountered: