lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() #3681

bprotopopov · 2015-08-11T20:22:59Z

There appears to be a problem with the following two code paths executed concurrently:

resulting from the zfs rename on zvol snapshot, taking place in txg_sync_thread()

#0 [ffff8802a6e7ba20] schedule at ffffffff815299d0
#1 [ffff8802a6e7baf8] __mutex_lock_slowpath at ffffffff8152b3a6
#2 [ffff8802a6e7bb68] mutex_lock at ffffffff8152aecb
#3 [ffff8802a6e7bb88] zvol_rename_minors at ffffffffa02fe634 [zfs]
#4 [ffff8802a6e7bbd8] dsl_dir_rename_sync at ffffffffa027c04d [zfs]
#5 [ffff8802a6e7bc48] dsl_sync_task_sync at ffffffffa02850f2 [zfs]
#6 [ffff8802a6e7bc78] dsl_pool_sync at ffffffffa027d31b [zfs]
#7 [ffff8802a6e7bcf8] spa_sync at ffffffffa0293587 [zfs]
#8 [ffff8802a6e7bdc8] txg_sync_thread at ffffffffa02a890b [zfs]
#9 [ffff8802a6e7beb8] thread_generic_wrapper at ffffffffa0169978 [spl]
#10 [ffff8802a6e7bee8] kthread at ffffffff8109e78e
#11 [ffff8802a6e7bf48] kernel_thread at ffffffff8100c28a

resulting from opening zvol for I/O
#0 [ffff88020e53d9b8] schedule at ffffffff815299d0
Use Barriers in pre-2.6.24 kernels #1 [ffff88020e53da90] cv_wait_common at ffffffffa0171e65 [spl]
Implement zeventd daemon #2 [ffff88020e53db20] __cv_wait at ffffffffa0171f75 [spl]
Use New BIO_RW_FAILFAST_* API #3 [ffff88020e53db30] rrw_enter_read at ffffffffa028df7b [zfs]
Having ZVOL can cause EBUSY for certain operations #4 [ffff88020e53db60] rrw_enter at ffffffffa028e0d0 [zfs]
Verify On Disk Compatibility #5 [ffff88020e53db70] dsl_pool_config_enter at ffffffffa027c24d [zfs]
ZFS Test Suite #6 [ffff88020e53db80] dsl_pool_hold at ffffffffa027c2da [zfs]
Native ZPL Implementation #7 [ffff88020e53dbc0] dmu_objset_own at ffffffffa025e1a0 [zfs]
Fuse Implementation #8 [ffff88020e53dc20] zvol_open at ffffffffa02fd4fb [zfs]
ZVOL Performance #9 [ffff88020e53dc90] __blkdev_get at ffffffff811cbc4e
Enclosure Management Integration #10 [ffff88020e53dcf0] blkdev_get at ffffffff811cbf70
Support Debian/Ubuntu Style Packages #11 [ffff88020e53dd00] blkdev_open at ffffffff811cbff1
Update core ZFS code from OpenSolaris #12 [ffff88020e53dd30] __dentry_open at ffffffff8118b1ba
Split zpios off in to its own project #13 [ffff88020e53dd90] nameidata_to_filp at ffffffff8118b524
Use the Autotest Test Suite #14 [ffff88020e53ddb0] do_filp_open at ffffffff811a12d0
Reduce Stack Usage #15 [ffff88020e53df20] do_sys_open at ffffffff8118af77
SPL tests fails #16 [ffff88020e53df70] sys_open at ffffffff8118b080
UTS_RELEASE checking is depricated #17 [ffff88020e53df80] system_call_fastpath at ffffffff8100b0f2

The first code path takes dsl pool config lock, then tries to take zvol_state_lock, while the second, takes the zvol_state_lock, and then tries to get the dsl pool config lock.

It is interesting that zvol_first_open() called from zvol_open() takes care to obtain spa_namespace_lock, and fails if it cannot, in order to avoid the locking order inversion between spa_namespace_lock and the zvol_state_lock.

Yet the txg_sync_thread does not seem to hold the spa_namespace_lock when calling dsl_pool_sync() and further, before trying to get the zvol_state_lock. That is why I believe this defensive check is not effective in this particular case.

bprotopopov · 2015-08-11T21:03:39Z

I am seriously considering performing zvol__minor related functions invoked from dsl_dataset.c, spa.c, and zfs_ioctl.c, through a taskq, in order to isolate it from any sort of locking conditions in the callers.

behlendorf · 2015-08-19T16:58:33Z

@bprotopopov what about changing the mutex to a rwlock. The zvol_state_lock is there to primarily protect against addition and removal from the list. For places like zvol_rename_minors where the list is just traversed it would be safe to use a read lock. A proper mutex to lock the individual zvol_state_t member during modification would probably be a good idea as well.

bprotopopov · 2015-08-25T02:29:46Z

@behlendorf I believe the zvol_first_open() would then have to take the rwlock as a writer whereas the txg_sync_thread would take is as a reader, still resulting in the same deadlock, or am I missing something?

I have prototyped the asynchronous (in a taskq) zvol minor operations, and I think the code is much simplified as far as the analysis of the possible deadlocks. I also think with this approach, there is no longer a need to try-lock the spa_namespace_lock in the zvol_open() path (because everybody now takes the zvol_state_lock first and then other locks). Given that the udev-related side effects of the minor operations are asynchronous as well, would this approach seem like a more straightforward way to get things done?

behlendorf · 2015-08-25T15:43:41Z

@bprotopopov perhaps the race would be smaller but I think your right. It would still exist. I'm definitely not opposed to doing the minor creation asynchronously I was just tossing out another possible idea. I'd love to see your patch for this once your happy with it.

In fact, what would be ideal is if you could build up a small patch stack with this change and your other zvol improvements. Then we could get them all reviewed in one go and merged.

bprotopopov · 2015-08-31T16:09:44Z

@behlendorf sure thing

bprotopopov · 2015-09-16T20:48:12Z

Hope to be done by end of week

bprotopopov · 2015-09-18T21:02:38Z

@behlendorf, hope you can help, what is the proper way to build SPL and ZFS to enable asserts and other debug checks ?

I did make distclean, ./autogen.sh, ./configure --enable-debug, make rpm-utils && make rpm-kmod, and yum localinstall the RPMs for SPL first, then the same thing for ZFS.

Yet I get a panic
...
SPLError: 2477:0:(list.h:77:list_create()) ASSERTION(size >= offset + sizeof(list_node_t)) failed
SPLError: 2477:0:(list.h:77:list_create()) SPL PANIC
SPL: Showing stack for process 2477
Pid: 2477, comm: modprobe Tainted: P --------------- 2.6.32-504.23.4.el6.x86_64 #1
Call Trace:
[] ? spl_debug_dumpstack+0x46/0x60 [spl]
[] ? spl_debug_bug+0x81/0xd0 [spl]
[] ? spl_PANIC+0xba/0xf0 [spl]
[] ? kmem_alloc_debug+0x251/0x490 [spl]
[] ? spl_kmem_cache_create+0x331/0xf50 [spl]
[] ? __thread_create+0x13f/0x390 [spl]
[] ? mutex_lock+0x1e/0x50
[] ? zfs_znode_cache_constructor+0x0/0x1b0 [zfs]
[] ? spl__init+0x0/0x20 [zfs]
[] ? zvol_init+0x88/0x290 [zfs]
[] ? spl__init+0x0/0x20 [zfs]
[] ? spl__init+0x0/0x20 [zfs]
[] ? _init+0x22/0xb10 [zfs]
[] ? spl__init+0x0/0x20 [zfs]
[] ? spl__init+0x13/0x20 [zfs]
[] ? do_one_initcall+0x3c/0x1d0
[] ? sys_init_module+0xe1/0x250
[] ? system_call_fastpath+0x16/0x1b

What am I missing ?
Should I do make && make install instead of using RPMs ?

behlendorf · 2015-09-18T22:57:40Z

I'm not sure I follow the question. Passing - - enable-debug as you did will enable the assertions and it looks like your hitting one so that issue will need to be resolved.

bprotopopov · 2015-09-18T23:13:46Z

Well the assertion looks like mismatch of type sizes in SPL and ZFS modules, e.g. as if one is build with and one without debug support.

Thanks for confirming the procedure, it must be something in my environment.

Typos courtesy of my iPhone

On Sep 18, 2015, at 6:57 PM, Brian Behlendorf <[email protected]mailto:[email protected]> wrote:

I'm not sure I follow the question. Passing - - enable-debug as you did will enable the assertions and it looks like your hitting one so that issue will need to be resolved.

Reply to this email directly or view it on GitHubhttps://github.com//issues/3681#issuecomment-141589105.

…() and dsl_pool_sync()...zvol_rename_minors() Execute zvol_*_minor(s) functions asynchronously in taskqs (one taskq per pool) in order to avoid having to deal with the locking state of the caller.

…n() and dsl_pool_sync()...zvol_rename_minors() Execute zvol_*_minor(s) functions asynchronously in taskqs (one taskq per pool) in order to avoid having to deal with the locking state of the caller.

bprotopopov · 2015-09-22T18:30:35Z

@behlendorf, wanted to clarify regarding the other zvol improvements you mentioned to be included with this bugfix; I can only think of the zvol minors scalability issue (#2217); I rebased that fix as well and will put it the pull request shortly.

Please let me know if that's what you meant.

behlendorf · 2015-09-22T18:38:50Z

Thanks for linking to your patches and working on this. I think this improvement will help us close out a lot of long standing issues. I took a quick look at the patch and I have a few questions.

Why is a necessary to create a taskq per-pool on-demand rather than global one at module load? That definitely complicates the code and I'm not sure what advantage is has over a single global taskq in zvol.c. Are you concerned a deadlock is possible if multiple pools are sharing the same taskq? Or do you feel you'll need more threads when a system contains many pools. If it's the latter I'd suggest just using TASKQ_DYNAMIC and allowing a large number of threads to be created if they're needed.
It feels to me like the make_async_arg arg logic should be pulled in to zvol_async_dispatch. That would allow all the helper functions to simplified.
KM_SLEEP can now be used safely, KM_PUSHPAGE isn't required.
Did you resolve your ASSERT issue?

bprotopopov · 2015-09-22T19:09:43Z

Hi, @behlendorf,
I am still refining the code a bit; to answer your questions:

taskq per pool - I was trying to make the minor-related actions for different pools independent; thinking more about this, this might be an overkill, since the implementations take the zvol_state_lock which is global across the pools. I don't mind reworking the code with a single queue and TASKQ_DYNAMIC perhaps
I will look into that as well - I have no preference
perhaps because I was testing with the older code first, I did see the warnings about using KM_SLEEP in the sync context, so I looked elsewhere in zvol.c code, and I saw KM_PUSHPAGE being used exclusively, so, I followed the example - again, no preference here - do you think I should change to KM_SLEEP ?
I did resolve the asset due to a programming error I am not willing to discuss in public :)

bprotopopov · 2015-09-22T19:49:21Z

Hi, @behlendorf,

sorry for the confusion, so I remembered why a taskq per pool (barring limitations of concurrency due to the zvol_state_lock) - the taskqs must be single-threaded to guarantee ordering or request processing, and it felt a bit unsettling to limit all the minor-related work for all the pools to a single thread.

At present, the minor-related work can originate from many user contexts, and at least some of that processing happens concurrently (e.g. dmu_objset_find() - the actual reading from disk). So, I did not want to serialize I/O for multiple pools because that could result in various undesirable dependencies between pools (e.g. hung I/O on one pool hanging minor processing on other pools, etc.). I am not sure how to achieve this concurrency while guaranteeing the ordering of request processing without multiple queues.

It also seemed that for a single pool, it should be OK to have a single thread executing requests sequentially, since the consecutive dmu_objset_find()s will likely co-benefit from I/O caching. In my testing, I did not notice any issues with performance.

bprotopopov · 2015-09-22T20:42:43Z

@behlendorf, I have moved the make_async_arg() logic into zvol_async_dispatch(). I held off on KM_SLEEP vs. MK_PUSHPAGE, and I think it is a good idea to keep a queue per pool as discussed above. What's your take ?

…n() and dsl_pool_sync()...zvol_rename_minors() Execute zvol_*_minor(s) functions asynchronously in taskqs (one taskq per pool) in order to avoid having to deal with the locking state of the caller.

behlendorf · 2015-09-23T00:39:11Z

@bprotopopov I see, that makes sense. For example, we wouldn't want it to be possible for a minor removal to potentially be handled before its matching minor create so we're limited to a single thread to keep things simple and sequential. I suspect a single thread for the entire system would be more than enough but it's completely reasonable to create one per-pool just in case. However, in that case I'd suggest hanging the taskq directly off the spa_t, creating it right away in spa_activate(), and the destroying it in spa_deactivate() to keep things simple. You'll need to decide if it's worth the added complexity.

The KM_SLEEP change is safe to make so I'd go ahead with it. If it wasn't nothing would work in the latest code!

bprotopopov · 2015-09-23T15:12:51Z

@behlendorf, clarification: should I replace all KM_PUSHPAGE with KM_SLEEP in zvol.c ?

bprotopopov · 2015-09-23T17:52:42Z

@behlendorf, I looked into following your advice regarding creating queues in spa_activate(); the function is expected to succeed (no return value), yet taskq_create() could return NULL - many places in the code check for non-NUL return. So, I could make taskq creation ''best effort" and log a WARNING message if failed to create.

bprotopopov · 2015-09-23T17:53:19Z

I think it will be extremely rare if we cannot create a taskq at pool activation, and if we can't there are other more pressing issues to attend to.

…n() and dsl_pool_sync()...zvol_rename_minors() Execute zvol_*_minor(s) functions asynchronously in taskqs (one taskq per pool) in order to avoid having to deal with the locking state of the caller.

…n() and dsl_pool_sync()...zvol_rename_minors() Remove trylock of spa_namespace_lock as it is no longer needed; the code paths that take zvol_state_lock no longer take spa_namespace_lock prior to taking zvol_state_lock

…n() and dsl_pool_sync()...zvol_rename_minors() Create/destroy per-pool zvol asynchroous work taskqs in spa_activate()/spa_deactivate(), remove explicit calls to create/destroy taskqs at pool create/import/export/destroy. Remove the zvol taskq list and rwlock in zvol.c

bprotopopov · 2015-09-23T20:48:32Z

So, if one creates task queues in spa_activate(), one later has to get at those queues at work dispatch time. One simple way of doing so (spa_open()) seems to deadlock (again) in zvol_rename/zvol_create paths while trying to get the spa_namespace_lock in spa_open_common().

On the other hand, I am not sure if there is a way of getting at the taskq in some other way without modifying the callers of the zvolminor() functions. I have the commits shown above.

The original code with creation of taskqs on demand dealt with this explicitly by keeping a list of queues in for opened pools. Getting those queues on demand seems to be problematic.

behlendorf · 2015-09-23T22:05:24Z

should I replace all KM_PUSHPAGE with KM_SLEEP in zvol.c ?

Yes.

I think it will be extremely rare if we cannot create a taskq at pool activation

Right, it should be exceptionally unlikely. And in fact this assumption is already made by spa_create_zio_taskqs() which is called by spa_activate().

Getting those queues on demand seems to be problematic.

I assumed the relevant spa_t point would already be available for all the zvol_minor_() callers. If it is then passing it as an augment seems entirely reasonable to me and is what I had in mind. As for spa_open() we should avoid using that if possible, that route leads to more locking headaches and deadlocks as you discovered.

bprotopopov · 2015-09-23T23:47:22Z

Unfortunately, I found that associating task queues with the pool structures has a major issue.

It does not work for the cases of pool export/destruction where we also need to remove minors for those pools after they are exported/destroyed (both imply removal from the pool namespace). The removal implies that that a valid pointer to the pool structure cannot be obtained.

In fact, were the tests to not deadlock, I would have quickly found that zvol_async_remove_minors() in

static int
zfs_ioc_pool_destroy(zfs_cmd_t *zc)
{
int error;
zfs_log_history(zc);
error = spa_destroy(zc->zc_name);
if (error == 0)
zvol_async_remove_minors(zc->zc_name);
return (error);
}

is guaranteed to fail because spa_destroyed() has called spa_remove() on the pool that should contain the pointer to the task queue.

I think that given the ideas we have tried, it is advisable to go with the original approach of explicit tracking of per-pool queues in the zvol code (commit e390e94). The additional complexity seems minimal compared to the concerns raised by interacting with the pool calls and structures.

…n() and dsl_pool_sync()...zvol_rename_minors() Execute zvol_*_minor(s) functions asynchronously in taskqs (one taskq per pool) in order to avoid having to deal with the locking state of the caller.

zfsonlinux issue openzfs#3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Remove trylock of spa_namespace_lock as it is no longer needed when zvol minor operations are performed in a separate context with no prior locking state; the spa_namespace_lock is no longer held when bdev->bd_mutex or zfs_state_lock might be taken in the code paths originating from the zvol minor operation callbacks. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#3681

zfsonlinux issue #2217 - zvol minor operations: check snapdev property before traversing snapshots of a dataset zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Create a per-pool zvol taskq for asynchronous zvol tasks. There are a few key design decisions to be aware of. * Each taskq must be single threaded to ensure tasks are always processed in the order in which they were dispatched. * There is a taskq per-pool in order to keep the pools independent. This way if one pool is suspended it will not impact another. * The preferred location to dispatch a zvol minor task is a sync task. In this context there is easy access to the spa_t and minimal error handling is required because the sync task must succeed. Support for asynchronous zvol minor operations address issue #3681. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2217 Closes #3678 Closes #3681

zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Remove trylock of spa_namespace_lock as it is no longer needed when zvol minor operations are performed in a separate context with no prior locking state; the spa_namespace_lock is no longer held when bdev->bd_mutex or zfs_state_lock might be taken in the code paths originating from the zvol minor operation callbacks. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3681

beloncfy · 2018-10-03T02:41:03Z

This issue still remains unresolved, I just encountered this error last night on ZoL 0.7.11 during send and receive, this is a full SSD system and running CentOS 7.5 x64.

Oct 2 15:03:03 node2 kernel: INFO: task z_zvol:880 blocked for more than 120 seconds.
Oct 2 15:03:03 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 2 15:03:03 node2 kernel: z_zvol D ffff9a6508a2bf40 0 880 2 0x00000000
Oct 2 15:03:03 node2 kernel: Call Trace:
Oct 2 15:03:03 node2 kernel: [] schedule_preempt_disabled+0x29/0x70
Oct 2 15:03:03 node2 kernel: [] __mutex_lock_slowpath+0xc7/0x1d0
Oct 2 15:03:03 node2 kernel: [] mutex_lock+0x1f/0x2f
Oct 2 15:03:03 node2 kernel: [] dmu_objset_from_ds+0x3d/0x170 [zfs]
Oct 2 15:03:03 node2 kernel: [] dmu_objset_hold+0x86/0xd0 [zfs]
Oct 2 15:03:03 node2 kernel: [] dsl_prop_get+0x44/0xa0 [zfs]
Oct 2 15:03:03 node2 kernel: [] dsl_prop_get_integer+0x1e/0x20 [zfs]
Oct 2 15:03:03 node2 kernel: [] zvol_create_minors_cb+0x3b/0x140 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? rrw_exit+0x60/0x170 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? zvol_prefetch_minors_impl+0x90/0x90 [zfs]
Oct 2 15:03:03 node2 kernel: [] dmu_objset_find_impl+0x115/0x430 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? zvol_prefetch_minors_impl+0x90/0x90 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? zvol_prefetch_minors_impl+0x90/0x90 [zfs]
Oct 2 15:03:03 node2 kernel: [] dmu_objset_find+0x58/0x90 [zfs]
Oct 2 15:03:03 node2 kernel: [] zvol_task_cb+0x51b/0x5a0 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? spl_kmem_free+0x35/0x40 [spl]
Oct 2 15:03:03 node2 kernel: [] taskq_thread+0x2ac/0x4f0 [spl]
Oct 2 15:03:03 node2 kernel: [] ? wake_up_state+0x20/0x20
Oct 2 15:03:03 node2 kernel: [] ? taskq_thread_spawn+0x60/0x60 [spl]
Oct 2 15:03:03 node2 kernel: [] kthread+0xd1/0xe0
Oct 2 15:03:03 node2 kernel: [] ? insert_kthread_work+0x40/0x40
Oct 2 15:03:03 node2 kernel: [] ret_from_fork_nospec_begin+0x21/0x21
Oct 2 15:03:03 node2 kernel: [] ? insert_kthread_work+0x40/0x40
Oct 2 15:03:03 node2 kernel: INFO: task txg_sync:1106 blocked for more than 120 seconds.
Oct 2 15:03:03 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 2 15:03:03 node2 kernel: txg_sync D ffff9a64ff2ebf40 0 1106 2 0x00000000
Oct 2 15:03:03 node2 kernel: Call Trace:
Oct 2 15:03:03 node2 kernel: [] schedule+0x29/0x70
Oct 2 15:03:03 node2 kernel: [] cv_wait_common+0x125/0x150 [spl]
Oct 2 15:03:03 node2 kernel: [] ? wake_up_atomic_t+0x30/0x30
Oct 2 15:03:03 node2 kernel: [] __cv_wait+0x15/0x20 [spl]
Oct 2 15:03:03 node2 kernel: [] rrw_enter_write+0x45/0xb0 [zfs]
Oct 2 15:03:03 node2 kernel: [] rrw_enter+0x13/0x30 [zfs]
Oct 2 15:03:03 node2 kernel: [] dsl_sync_task_sync+0x8d/0x130 [zfs]
Oct 2 15:03:03 node2 kernel: [] dsl_pool_sync+0x2eb/0x440 [zfs]
Oct 2 15:03:03 node2 kernel: [] spa_sync+0x437/0xd90 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? default_wake_function+0x12/0x20
Oct 2 15:03:03 node2 kernel: [] ? __mutex_unlock_slowpath+0x70/0x90
Oct 2 15:03:03 node2 kernel: [] txg_sync_thread+0x301/0x510 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? txg_fini+0x2a0/0x2a0 [zfs]
Oct 2 15:03:03 node2 kernel: [] thread_generic_wrapper+0x73/0x80 [spl]
Oct 2 15:03:03 node2 kernel: [] ? __thread_exit+0x20/0x20 [spl]
Oct 2 15:03:03 node2 kernel: [] kthread+0xd1/0xe0
Oct 2 15:03:03 node2 kernel: [] ? insert_kthread_work+0x40/0x40
Oct 2 15:03:03 node2 kernel: [] ret_from_fork_nospec_begin+0x21/0x21
Oct 2 15:03:03 node2 kernel: [] ? insert_kthread_work+0x40/0x40
Oct 2 15:03:03 node2 kernel: INFO: task zfs:13696 blocked for more than 120 seconds.
Oct 2 15:03:03 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 2 15:03:03 node2 kernel: zfs D ffff9a635e731fa0 0 13696 13695 0x00000080

bprotopopov · 2018-10-03T12:58:22Z

It seems to me that the right direction here is to move the minor management code outside of any external locking domains;

@behlendorf, what do you think about running minor management tasks in sinple threads the nstead if pool tasks ?

behlendorf · 2018-10-03T20:05:14Z

Adding a dedicated thread would be fine, but it's not clear to me from the stacks above that actually would resolve the issue. We're already do this all asynchronously in a taskq thread and in most cases not blocking waiting. Could you explain further what you had in mind?

bprotopopov · 2018-10-03T22:34:55Z

@behlendorf, I am referring to the pool config lock taken prior to running minor management code. An external thread would not need to do that and the lock order inversion would not occur unless i am missing something.

behlendorf · 2018-10-03T22:42:36Z

Could you propose a patch with what your proposing. I think that would probably be clearest.

bprotopopov · 2018-10-03T23:12:38Z

Will try to get to this as soon as i can.

bprotopopov · 2018-10-08T21:54:13Z

@beloncfy, could you please clarify how do you see a clear evidence of a deadlock (between the two contexts whose traces are shown)?

beloncfy · 2018-10-13T03:41:25Z

@beloncfy, could you please clarify how do you see a clear evidence of a deadlock (between the two contexts whose traces are shown)?

It was zero io read and write to my zfs and my server load average were increased to > 200 without any CPU usage, the zfs list command were hanging there as well.

It happened few times on my other servers as well with different version of ZoL.

I guest it was triggered by the zfs receive, my servers were doing send and receive quite aggressive (every 2 minutes) and running concurrently (more than 5 volumes simultaneously). It has some VM running on it.

I have tried to reduced the send and receive frequency to 20 minutes and the concurrency to 1 volume per job, and so far so good.

bprotopopov · 2018-10-14T12:26:29Z

@beloncfy, i get a sense that there are more threads that might be involved, can u dump thread traces with crash perhaps and attach to the ticket?

When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Signed-off-by: Matthew Ahrens <[email protected]> External issue: DLPX-65948 Closes openzfs#7863

When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863

When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see #3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes #7863 Closes #9885

…eceiveTest failed due to "DelphixFatalException: unexpected ZFS error: EBUSY on..." Cherry-pick of the following commit from master. For the backport, I had to move some of the code around because the zvol os-specific restructuring is not in 6.0. async zvol minor node creation interferes with receive When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863 Closes openzfs#9885

With this change all ZVOL updates are initiated from the SPA sync context instead of a mix of the sync and open contexts. The updates are queued to be applied by a dedicated thread in the original order. This should ensure that ZVOLs always accurately reflect the corresponding datasets. ZFS ioctl operations wait on the mentioned thread to complete its work. Thus, the illusion of the synchronous ZVOL update is preserved. At the same time, the SPA sync thread never blocks on ZVOL related operations avoiding problems like reported in bug 203864. This change is based on earlier work in the same direction: D7179 and D14669 by Anthoine Bourgeois. D7179 tried to perform ZVOL operations in the open context and that opened races between them. D14669 uses a design very similar to this change but with different implementation details. This change also heavily borrows from similar code in ZoL, but there are many differences too. See: - openzfs/zfs@a0bd735 - openzfs/zfs#3681 - openzfs/zfs#2217 PR: 203864 MFC after: 5 weeks Sponsored by: CyberSecure Differential Revision: https://reviews.freebsd.org/D23478 git-svn-id: svn+ssh://svn.freebsd.org/base/head@362047 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f

With this change all ZVOL updates are initiated from the SPA sync context instead of a mix of the sync and open contexts. The updates are queued to be applied by a dedicated thread in the original order. This should ensure that ZVOLs always accurately reflect the corresponding datasets. ZFS ioctl operations wait on the mentioned thread to complete its work. Thus, the illusion of the synchronous ZVOL update is preserved. At the same time, the SPA sync thread never blocks on ZVOL related operations avoiding problems like reported in bug 203864. This change is based on earlier work in the same direction: D7179 and D14669 by Anthoine Bourgeois. D7179 tried to perform ZVOL operations in the open context and that opened races between them. D14669 uses a design very similar to this change but with different implementation details. This change also heavily borrows from similar code in ZoL, but there are many differences too. See: - openzfs/zfs@a0bd735 - openzfs/zfs#3681 - openzfs/zfs#2217 PR: 203864 MFC after: 5 weeks Sponsored by: CyberSecure Differential Revision: https://reviews.freebsd.org/D23478

With this change all ZVOL updates are initiated from the SPA sync context instead of a mix of the sync and open contexts. The updates are queued to be applied by a dedicated thread in the original order. This should ensure that ZVOLs always accurately reflect the corresponding datasets. ZFS ioctl operations wait on the mentioned thread to complete its work. Thus, the illusion of the synchronous ZVOL update is preserved. At the same time, the SPA sync thread never blocks on ZVOL related operations avoiding problems like reported in bug 203864. This change is based on earlier work in the same direction: D7179 and D14669 by Anthoine Bourgeois. D7179 tried to perform ZVOL operations in the open context and that opened races between them. D14669 uses a design very similar to this change but with different implementation details. This change also heavily borrows from similar code in ZoL, but there are many differences too. See: - openzfs/zfs@a0bd735 - openzfs/zfs#3681 - openzfs/zfs#2217 PR: 203864 MFC after: 5 weeks Sponsored by: CyberSecure Differential Revision: https://reviews.freebsd.org/D23478 git-svn-id: https://svn.freebsd.org/base/head@362047 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f

With this change all ZVOL updates are initiated from the SPA sync context instead of a mix of the sync and open contexts. The updates are queued to be applied by a dedicated thread in the original order. This should ensure that ZVOLs always accurately reflect the corresponding datasets. ZFS ioctl operations wait on the mentioned thread to complete its work. Thus, the illusion of the synchronous ZVOL update is preserved. At the same time, the SPA sync thread never blocks on ZVOL related operations avoiding problems like reported in bug 203864. This change is based on earlier work in the same direction: D7179 and D14669 by Anthoine Bourgeois. D7179 tried to perform ZVOL operations in the open context and that opened races between them. D14669 uses a design very similar to this change but with different implementation details. This change also heavily borrows from similar code in ZoL, but there are many differences too. See: - openzfs/zfs@a0bd735 - openzfs/zfs#3681 - openzfs/zfs#2217 PR: 203864 MFC after: 5 weeks Sponsored by: CyberSecure Differential Revision: https://reviews.freebsd.org/D23478

When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see openzfs#3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes openzfs#7863 Closes openzfs#9885

behlendorf added the Bug - Minor label Aug 19, 2015

behlendorf added this to the 0.7.0 milestone Aug 31, 2015

behlendorf closed this as completed in a0bd735 Mar 10, 2016

zhenghc mentioned this issue Mar 14, 2016

zfs hanged when multiple operations on dataset #4404

Closed

kernelOfTruth mentioned this issue Mar 14, 2016

0.6.5.6 release inclusion list [suggestions please] #4414

Closed

behlendorf modified the milestones: 0.6.5.6, 0.7.0 Mar 23, 2016

ahrens mentioned this issue Jan 24, 2020

async zvol minor node creation interferes with receive #9885

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() #3681

lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() #3681

bprotopopov commented Aug 11, 2015

bprotopopov commented Aug 11, 2015

behlendorf commented Aug 19, 2015

bprotopopov commented Aug 25, 2015

behlendorf commented Aug 25, 2015

bprotopopov commented Aug 31, 2015

bprotopopov commented Sep 16, 2015

bprotopopov commented Sep 18, 2015

behlendorf commented Sep 18, 2015

bprotopopov commented Sep 18, 2015

bprotopopov commented Sep 22, 2015

behlendorf commented Sep 22, 2015

bprotopopov commented Sep 22, 2015

bprotopopov commented Sep 22, 2015

bprotopopov commented Sep 22, 2015

behlendorf commented Sep 23, 2015

bprotopopov commented Sep 23, 2015

bprotopopov commented Sep 23, 2015

bprotopopov commented Sep 23, 2015

bprotopopov commented Sep 23, 2015

behlendorf commented Sep 23, 2015

bprotopopov commented Sep 23, 2015

beloncfy commented Oct 3, 2018 •

edited

Loading

bprotopopov commented Oct 3, 2018

behlendorf commented Oct 3, 2018

bprotopopov commented Oct 3, 2018

behlendorf commented Oct 3, 2018

bprotopopov commented Oct 3, 2018

bprotopopov commented Oct 8, 2018

beloncfy commented Oct 13, 2018

bprotopopov commented Oct 14, 2018

lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() #3681

lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() #3681

Comments

bprotopopov commented Aug 11, 2015

bprotopopov commented Aug 11, 2015

behlendorf commented Aug 19, 2015

bprotopopov commented Aug 25, 2015

behlendorf commented Aug 25, 2015

bprotopopov commented Aug 31, 2015

bprotopopov commented Sep 16, 2015

bprotopopov commented Sep 18, 2015

behlendorf commented Sep 18, 2015

bprotopopov commented Sep 18, 2015

bprotopopov commented Sep 22, 2015

behlendorf commented Sep 22, 2015

bprotopopov commented Sep 22, 2015

bprotopopov commented Sep 22, 2015

bprotopopov commented Sep 22, 2015

behlendorf commented Sep 23, 2015

bprotopopov commented Sep 23, 2015

bprotopopov commented Sep 23, 2015

bprotopopov commented Sep 23, 2015

bprotopopov commented Sep 23, 2015

behlendorf commented Sep 23, 2015

bprotopopov commented Sep 23, 2015

beloncfy commented Oct 3, 2018 • edited Loading

bprotopopov commented Oct 3, 2018

behlendorf commented Oct 3, 2018

bprotopopov commented Oct 3, 2018

behlendorf commented Oct 3, 2018

bprotopopov commented Oct 3, 2018

bprotopopov commented Oct 8, 2018

beloncfy commented Oct 13, 2018

bprotopopov commented Oct 14, 2018

beloncfy commented Oct 3, 2018 •

edited

Loading