-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() #3681
Comments
I am seriously considering performing zvol__minor related functions invoked from dsl_dataset.c, spa.c, and zfs_ioctl.c, through a taskq, in order to isolate it from any sort of locking conditions in the callers. |
@bprotopopov what about changing the mutex to a rwlock. The |
@behlendorf I believe the zvol_first_open() would then have to take the rwlock as a writer whereas the txg_sync_thread would take is as a reader, still resulting in the same deadlock, or am I missing something? I have prototyped the asynchronous (in a taskq) zvol minor operations, and I think the code is much simplified as far as the analysis of the possible deadlocks. I also think with this approach, there is no longer a need to try-lock the spa_namespace_lock in the zvol_open() path (because everybody now takes the zvol_state_lock first and then other locks). Given that the udev-related side effects of the minor operations are asynchronous as well, would this approach seem like a more straightforward way to get things done? |
@bprotopopov perhaps the race would be smaller but I think your right. It would still exist. I'm definitely not opposed to doing the minor creation asynchronously I was just tossing out another possible idea. I'd love to see your patch for this once your happy with it. In fact, what would be ideal is if you could build up a small patch stack with this change and your other zvol improvements. Then we could get them all reviewed in one go and merged. |
@behlendorf sure thing |
Hope to be done by end of week |
@behlendorf, hope you can help, what is the proper way to build SPL and ZFS to enable asserts and other debug checks ? I did make distclean, ./autogen.sh, ./configure --enable-debug, make rpm-utils && make rpm-kmod, and yum localinstall the RPMs for SPL first, then the same thing for ZFS. Yet I get a panic What am I missing ? |
I'm not sure I follow the question. Passing - - enable-debug as you did will enable the assertions and it looks like your hitting one so that issue will need to be resolved. |
Well the assertion looks like mismatch of type sizes in SPL and ZFS modules, e.g. as if one is build with and one without debug support. Thanks for confirming the procedure, it must be something in my environment. Typos courtesy of my iPhone On Sep 18, 2015, at 6:57 PM, Brian Behlendorf <[email protected]mailto:[email protected]> wrote: I'm not sure I follow the question. Passing - - enable-debug as you did will enable the assertions and it looks like your hitting one so that issue will need to be resolved. Reply to this email directly or view it on GitHubhttps://github.com//issues/3681#issuecomment-141589105. |
…() and dsl_pool_sync()...zvol_rename_minors() Execute zvol_*_minor(s) functions asynchronously in taskqs (one taskq per pool) in order to avoid having to deal with the locking state of the caller.
…n() and dsl_pool_sync()...zvol_rename_minors() Execute zvol_*_minor(s) functions asynchronously in taskqs (one taskq per pool) in order to avoid having to deal with the locking state of the caller.
@behlendorf, wanted to clarify regarding the other zvol improvements you mentioned to be included with this bugfix; I can only think of the zvol minors scalability issue (#2217); I rebased that fix as well and will put it the pull request shortly. Please let me know if that's what you meant. |
Thanks for linking to your patches and working on this. I think this improvement will help us close out a lot of long standing issues. I took a quick look at the patch and I have a few questions.
|
Hi, @behlendorf,
|
Hi, @behlendorf, sorry for the confusion, so I remembered why a taskq per pool (barring limitations of concurrency due to the zvol_state_lock) - the taskqs must be single-threaded to guarantee ordering or request processing, and it felt a bit unsettling to limit all the minor-related work for all the pools to a single thread. At present, the minor-related work can originate from many user contexts, and at least some of that processing happens concurrently (e.g. dmu_objset_find() - the actual reading from disk). So, I did not want to serialize I/O for multiple pools because that could result in various undesirable dependencies between pools (e.g. hung I/O on one pool hanging minor processing on other pools, etc.). I am not sure how to achieve this concurrency while guaranteeing the ordering of request processing without multiple queues. It also seemed that for a single pool, it should be OK to have a single thread executing requests sequentially, since the consecutive dmu_objset_find()s will likely co-benefit from I/O caching. In my testing, I did not notice any issues with performance. |
@behlendorf, I have moved the make_async_arg() logic into zvol_async_dispatch(). I held off on KM_SLEEP vs. MK_PUSHPAGE, and I think it is a good idea to keep a queue per pool as discussed above. What's your take ? |
…n() and dsl_pool_sync()...zvol_rename_minors() Execute zvol_*_minor(s) functions asynchronously in taskqs (one taskq per pool) in order to avoid having to deal with the locking state of the caller.
@bprotopopov I see, that makes sense. For example, we wouldn't want it to be possible for a minor removal to potentially be handled before its matching minor create so we're limited to a single thread to keep things simple and sequential. I suspect a single thread for the entire system would be more than enough but it's completely reasonable to create one per-pool just in case. However, in that case I'd suggest hanging the taskq directly off the spa_t, creating it right away in The KM_SLEEP change is safe to make so I'd go ahead with it. If it wasn't nothing would work in the latest code! |
@behlendorf, clarification: should I replace all KM_PUSHPAGE with KM_SLEEP in zvol.c ? |
@behlendorf, I looked into following your advice regarding creating queues in spa_activate(); the function is expected to succeed (no return value), yet taskq_create() could return NULL - many places in the code check for non-NUL return. So, I could make taskq creation ''best effort" and log a WARNING message if failed to create. |
I think it will be extremely rare if we cannot create a taskq at pool activation, and if we can't there are other more pressing issues to attend to. |
…n() and dsl_pool_sync()...zvol_rename_minors() Execute zvol_*_minor(s) functions asynchronously in taskqs (one taskq per pool) in order to avoid having to deal with the locking state of the caller.
…n() and dsl_pool_sync()...zvol_rename_minors() Remove trylock of spa_namespace_lock as it is no longer needed; the code paths that take zvol_state_lock no longer take spa_namespace_lock prior to taking zvol_state_lock
…n() and dsl_pool_sync()...zvol_rename_minors() Create/destroy per-pool zvol asynchroous work taskqs in spa_activate()/spa_deactivate(), remove explicit calls to create/destroy taskqs at pool create/import/export/destroy. Remove the zvol taskq list and rwlock in zvol.c
So, if one creates task queues in spa_activate(), one later has to get at those queues at work dispatch time. One simple way of doing so (spa_open()) seems to deadlock (again) in zvol_rename/zvol_create paths while trying to get the spa_namespace_lock in spa_open_common(). On the other hand, I am not sure if there is a way of getting at the taskq in some other way without modifying the callers of the zvolminor() functions. I have the commits shown above. The original code with creation of taskqs on demand dealt with this explicitly by keeping a list of queues in for opened pools. Getting those queues on demand seems to be problematic. |
Yes.
Right, it should be exceptionally unlikely. And in fact this assumption is already made by
I assumed the relevant spa_t point would already be available for all the zvol_minor_() callers. If it is then passing it as an augment seems entirely reasonable to me and is what I had in mind. As for |
Unfortunately, I found that associating task queues with the pool structures has a major issue. It does not work for the cases of pool export/destruction where we also need to remove minors for those pools after they are exported/destroyed (both imply removal from the pool namespace). The removal implies that that a valid pointer to the pool structure cannot be obtained. In fact, were the tests to not deadlock, I would have quickly found that zvol_async_remove_minors() in static int is guaranteed to fail because spa_destroyed() has called spa_remove() on the pool that should contain the pointer to the task queue. I think that given the ideas we have tried, it is advisable to go with the original approach of explicit tracking of per-pool queues in the zvol code (commit e390e94). The additional complexity seems minimal compared to the concerns raised by interacting with the pool calls and structures. |
…n() and dsl_pool_sync()...zvol_rename_minors() Execute zvol_*_minor(s) functions asynchronously in taskqs (one taskq per pool) in order to avoid having to deal with the locking state of the caller.
zfsonlinux issue openzfs#3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Remove trylock of spa_namespace_lock as it is no longer needed when zvol minor operations are performed in a separate context with no prior locking state; the spa_namespace_lock is no longer held when bdev->bd_mutex or zfs_state_lock might be taken in the code paths originating from the zvol minor operation callbacks. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#3681
zfsonlinux issue #2217 - zvol minor operations: check snapdev property before traversing snapshots of a dataset zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Create a per-pool zvol taskq for asynchronous zvol tasks. There are a few key design decisions to be aware of. * Each taskq must be single threaded to ensure tasks are always processed in the order in which they were dispatched. * There is a taskq per-pool in order to keep the pools independent. This way if one pool is suspended it will not impact another. * The preferred location to dispatch a zvol minor task is a sync task. In this context there is easy access to the spa_t and minimal error handling is required because the sync task must succeed. Support for asynchronous zvol minor operations address issue #3681. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2217 Closes #3678 Closes #3681
zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Remove trylock of spa_namespace_lock as it is no longer needed when zvol minor operations are performed in a separate context with no prior locking state; the spa_namespace_lock is no longer held when bdev->bd_mutex or zfs_state_lock might be taken in the code paths originating from the zvol minor operation callbacks. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3681
This issue still remains unresolved, I just encountered this error last night on ZoL 0.7.11 during send and receive, this is a full SSD system and running CentOS 7.5 x64.
|
It seems to me that the right direction here is to move the minor management code outside of any external locking domains; @behlendorf, what do you think about running minor management tasks in sinple threads the nstead if pool tasks ? |
Adding a dedicated thread would be fine, but it's not clear to me from the stacks above that actually would resolve the issue. We're already do this all asynchronously in a taskq thread and in most cases not blocking waiting. Could you explain further what you had in mind? |
@behlendorf, I am referring to the pool config lock taken prior to running minor management code. An external thread would not need to do that and the lock order inversion would not occur unless i am missing something. |
Could you propose a patch with what your proposing. I think that would probably be clearest. |
Will try to get to this as soon as i can. |
@beloncfy, could you please clarify how do you see a clear evidence of a deadlock (between the two contexts whose traces are shown)? |
It was zero io read and write to my zfs and my server load average were increased to > 200 without any CPU usage, the zfs list command were hanging there as well. It happened few times on my other servers as well with different version of ZoL. I guest it was triggered by the zfs receive, my servers were doing send and receive quite aggressive (every 2 minutes) and running concurrently (more than 5 volumes simultaneously). It has some VM running on it. I have tried to reduced the send and receive frequency to 20 minutes and the concurrency to 1 volume per job, and so far so good. |
@beloncfy, i get a sense that there are more threads that might be involved, can u dump thread traces with crash perhaps and attach to the ticket? |
When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Signed-off-by: Matthew Ahrens <[email protected]> External issue: DLPX-65948 Closes openzfs#7863
When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863
When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863
When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863
When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see #3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes #7863 Closes #9885
…eceiveTest failed due to "DelphixFatalException: unexpected ZFS error: EBUSY on..." Cherry-pick of the following commit from master. For the backport, I had to move some of the code around because the zvol os-specific restructuring is not in 6.0. async zvol minor node creation interferes with receive When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863 Closes openzfs#9885
With this change all ZVOL updates are initiated from the SPA sync context instead of a mix of the sync and open contexts. The updates are queued to be applied by a dedicated thread in the original order. This should ensure that ZVOLs always accurately reflect the corresponding datasets. ZFS ioctl operations wait on the mentioned thread to complete its work. Thus, the illusion of the synchronous ZVOL update is preserved. At the same time, the SPA sync thread never blocks on ZVOL related operations avoiding problems like reported in bug 203864. This change is based on earlier work in the same direction: D7179 and D14669 by Anthoine Bourgeois. D7179 tried to perform ZVOL operations in the open context and that opened races between them. D14669 uses a design very similar to this change but with different implementation details. This change also heavily borrows from similar code in ZoL, but there are many differences too. See: - openzfs/zfs@a0bd735 - openzfs/zfs#3681 - openzfs/zfs#2217 PR: 203864 MFC after: 5 weeks Sponsored by: CyberSecure Differential Revision: https://reviews.freebsd.org/D23478 git-svn-id: svn+ssh://svn.freebsd.org/base/head@362047 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
With this change all ZVOL updates are initiated from the SPA sync context instead of a mix of the sync and open contexts. The updates are queued to be applied by a dedicated thread in the original order. This should ensure that ZVOLs always accurately reflect the corresponding datasets. ZFS ioctl operations wait on the mentioned thread to complete its work. Thus, the illusion of the synchronous ZVOL update is preserved. At the same time, the SPA sync thread never blocks on ZVOL related operations avoiding problems like reported in bug 203864. This change is based on earlier work in the same direction: D7179 and D14669 by Anthoine Bourgeois. D7179 tried to perform ZVOL operations in the open context and that opened races between them. D14669 uses a design very similar to this change but with different implementation details. This change also heavily borrows from similar code in ZoL, but there are many differences too. See: - openzfs/zfs@a0bd735 - openzfs/zfs#3681 - openzfs/zfs#2217 PR: 203864 MFC after: 5 weeks Sponsored by: CyberSecure Differential Revision: https://reviews.freebsd.org/D23478
With this change all ZVOL updates are initiated from the SPA sync context instead of a mix of the sync and open contexts. The updates are queued to be applied by a dedicated thread in the original order. This should ensure that ZVOLs always accurately reflect the corresponding datasets. ZFS ioctl operations wait on the mentioned thread to complete its work. Thus, the illusion of the synchronous ZVOL update is preserved. At the same time, the SPA sync thread never blocks on ZVOL related operations avoiding problems like reported in bug 203864. This change is based on earlier work in the same direction: D7179 and D14669 by Anthoine Bourgeois. D7179 tried to perform ZVOL operations in the open context and that opened races between them. D14669 uses a design very similar to this change but with different implementation details. This change also heavily borrows from similar code in ZoL, but there are many differences too. See: - openzfs/zfs@a0bd735 - openzfs/zfs#3681 - openzfs/zfs#2217 PR: 203864 MFC after: 5 weeks Sponsored by: CyberSecure Differential Revision: https://reviews.freebsd.org/D23478
With this change all ZVOL updates are initiated from the SPA sync context instead of a mix of the sync and open contexts. The updates are queued to be applied by a dedicated thread in the original order. This should ensure that ZVOLs always accurately reflect the corresponding datasets. ZFS ioctl operations wait on the mentioned thread to complete its work. Thus, the illusion of the synchronous ZVOL update is preserved. At the same time, the SPA sync thread never blocks on ZVOL related operations avoiding problems like reported in bug 203864. This change is based on earlier work in the same direction: D7179 and D14669 by Anthoine Bourgeois. D7179 tried to perform ZVOL operations in the open context and that opened races between them. D14669 uses a design very similar to this change but with different implementation details. This change also heavily borrows from similar code in ZoL, but there are many differences too. See: - openzfs/zfs@a0bd735 - openzfs/zfs#3681 - openzfs/zfs#2217 PR: 203864 MFC after: 5 weeks Sponsored by: CyberSecure Differential Revision: https://reviews.freebsd.org/D23478 git-svn-id: https://svn.freebsd.org/base/head@362047 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
With this change all ZVOL updates are initiated from the SPA sync context instead of a mix of the sync and open contexts. The updates are queued to be applied by a dedicated thread in the original order. This should ensure that ZVOLs always accurately reflect the corresponding datasets. ZFS ioctl operations wait on the mentioned thread to complete its work. Thus, the illusion of the synchronous ZVOL update is preserved. At the same time, the SPA sync thread never blocks on ZVOL related operations avoiding problems like reported in bug 203864. This change is based on earlier work in the same direction: D7179 and D14669 by Anthoine Bourgeois. D7179 tried to perform ZVOL operations in the open context and that opened races between them. D14669 uses a design very similar to this change but with different implementation details. This change also heavily borrows from similar code in ZoL, but there are many differences too. See: - openzfs/zfs@a0bd735 - openzfs/zfs#3681 - openzfs/zfs#2217 PR: 203864 MFC after: 5 weeks Sponsored by: CyberSecure Differential Revision: https://reviews.freebsd.org/D23478
When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863 Closes openzfs#9885
There appears to be a problem with the following two code paths executed concurrently:
#0 [ffff8802a6e7ba20] schedule at ffffffff815299d0
#1 [ffff8802a6e7baf8] __mutex_lock_slowpath at ffffffff8152b3a6
#2 [ffff8802a6e7bb68] mutex_lock at ffffffff8152aecb
#3 [ffff8802a6e7bb88] zvol_rename_minors at ffffffffa02fe634 [zfs]
#4 [ffff8802a6e7bbd8] dsl_dir_rename_sync at ffffffffa027c04d [zfs]
#5 [ffff8802a6e7bc48] dsl_sync_task_sync at ffffffffa02850f2 [zfs]
#6 [ffff8802a6e7bc78] dsl_pool_sync at ffffffffa027d31b [zfs]
#7 [ffff8802a6e7bcf8] spa_sync at ffffffffa0293587 [zfs]
#8 [ffff8802a6e7bdc8] txg_sync_thread at ffffffffa02a890b [zfs]
#9 [ffff8802a6e7beb8] thread_generic_wrapper at ffffffffa0169978 [spl]
#10 [ffff8802a6e7bee8] kthread at ffffffff8109e78e
#11 [ffff8802a6e7bf48] kernel_thread at ffffffff8100c28a
#0 [ffff88020e53d9b8] schedule at ffffffff815299d0
Use Barriers in pre-2.6.24 kernels #1 [ffff88020e53da90] cv_wait_common at ffffffffa0171e65 [spl]
Implement zeventd daemon #2 [ffff88020e53db20] __cv_wait at ffffffffa0171f75 [spl]
Use New BIO_RW_FAILFAST_* API #3 [ffff88020e53db30] rrw_enter_read at ffffffffa028df7b [zfs]
Having ZVOL can cause EBUSY for certain operations #4 [ffff88020e53db60] rrw_enter at ffffffffa028e0d0 [zfs]
Verify On Disk Compatibility #5 [ffff88020e53db70] dsl_pool_config_enter at ffffffffa027c24d [zfs]
ZFS Test Suite #6 [ffff88020e53db80] dsl_pool_hold at ffffffffa027c2da [zfs]
Native ZPL Implementation #7 [ffff88020e53dbc0] dmu_objset_own at ffffffffa025e1a0 [zfs]
Fuse Implementation #8 [ffff88020e53dc20] zvol_open at ffffffffa02fd4fb [zfs]
ZVOL Performance #9 [ffff88020e53dc90] __blkdev_get at ffffffff811cbc4e
Enclosure Management Integration #10 [ffff88020e53dcf0] blkdev_get at ffffffff811cbf70
Support Debian/Ubuntu Style Packages #11 [ffff88020e53dd00] blkdev_open at ffffffff811cbff1
Update core ZFS code from OpenSolaris #12 [ffff88020e53dd30] __dentry_open at ffffffff8118b1ba
Split zpios off in to its own project #13 [ffff88020e53dd90] nameidata_to_filp at ffffffff8118b524
Use the Autotest Test Suite #14 [ffff88020e53ddb0] do_filp_open at ffffffff811a12d0
Reduce Stack Usage #15 [ffff88020e53df20] do_sys_open at ffffffff8118af77
SPL tests fails #16 [ffff88020e53df70] sys_open at ffffffff8118b080
UTS_RELEASE checking is depricated #17 [ffff88020e53df80] system_call_fastpath at ffffffff8100b0f2
The first code path takes dsl pool config lock, then tries to take zvol_state_lock, while the second, takes the zvol_state_lock, and then tries to get the dsl pool config lock.
It is interesting that zvol_first_open() called from zvol_open() takes care to obtain spa_namespace_lock, and fails if it cannot, in order to avoid the locking order inversion between spa_namespace_lock and the zvol_state_lock.
Yet the txg_sync_thread does not seem to hold the spa_namespace_lock when calling dsl_pool_sync() and further, before trying to get the zvol_state_lock. That is why I believe this defensive check is not effective in this particular case.
The text was updated successfully, but these errors were encountered: