You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was sent the following backtraces for analysis following a deadlock that was encountered on SoftNAS during a long running send:
Aug 20 10:02:48 ip-10-0-5-75 kernel: INFO: task txg_sync:1157 blocked for more than 120 seconds.
Aug 20 10:02:48 ip-10-0-5-75 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 20 10:02:48 ip-10-0-5-75 kernel: txg_sync D ffff8800280c2680 0 1157 2 0x00000000
Aug 20 10:02:48 ip-10-0-5-75 kernel: ffff880399affad0 0000000000000246 ffffffff81007ca2 0000000000000000
Aug 20 10:02:48 ip-10-0-5-75 kernel: ffff880399affad8 0000000000000000 0000000000000200 0000000000000200
Aug 20 10:02:48 ip-10-0-5-75 kernel: ffff8803a0961098 ffff880399afffd8 000000000000fb88 ffff8803a0961098
Aug 20 10:02:48 ip-10-0-5-75 kernel: Call Trace:
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81007ca2>] ? check_events+0x12/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff814eb14b>] mutex_lock+0x2b/0x50
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0294619>] zvol_rename_minors+0x79/0x180 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa020f319>] dsl_dataset_rename_snapshot_sync_impl+0x199/0x2d0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa02113cf>] dsl_dataset_rename_snapshot_sync+0xbf/0x1a0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa02211e2>] dsl_sync_task_sync+0xf2/0x100 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa02199fb>] dsl_pool_sync+0x2eb/0x440 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa022d3ab>] spa_sync+0x40b/0xae0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0241749>] txg_sync_thread+0x369/0x5e0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff810527f9>] ? set_user_nice+0xc9/0x130
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa02413e0>] ? txg_sync_thread+0x0/0x5e0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa01023b8>] thread_generic_wrapper+0x68/0x80 [spl]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0102350>] ? thread_generic_wrapper+0x0/0x80 [spl]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81090626>] kthread+0x96/0xa0
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: INFO: task zfs:23467 blocked for more than 120 seconds.
Aug 20 10:02:48 ip-10-0-5-75 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 20 10:02:48 ip-10-0-5-75 kernel: zfs D ffff880028084680 0 23467 23462 0x00000000
Aug 20 10:02:48 ip-10-0-5-75 kernel: ffff8803148718d8 0000000000000286 ffffffff810074fd ffff880314871898
Aug 20 10:02:48 ip-10-0-5-75 kernel: ffffffff81007ca2 0000000000000000 0000000000000000 00000000ffffffff
Aug 20 10:02:48 ip-10-0-5-75 kernel: ffff8803ac1145f8 ffff880314871fd8 000000000000fb88 ffff8803ac1145f8
Aug 20 10:02:48 ip-10-0-5-75 kernel: Call Trace:
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff810074fd>] ? xen_force_evtchn_callback+0xd/0x10
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81007ca2>] ? check_events+0x12/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81007c8f>] ? xen_restore_fl_direct_end+0x0/0x1
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff814ec58c>] ? _spin_unlock_irqrestore+0x1c/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81090b9e>] ? prepare_to_wait_exclusive+0x4e/0x80
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff810074fd>] ? xen_force_evtchn_callback+0xd/0x10
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa010a635>] cv_wait_common+0x105/0x1c0 [spl]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa023d6c3>] ? spa_lookup+0x53/0x60 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa010a745>] __cv_wait+0x15/0x20 [spl]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa022810b>] rrw_enter_read+0x3b/0x150 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81007ca2>] ? check_events+0x12/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0228240>] rrw_enter+0x20/0x30 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa021898d>] dsl_pool_config_enter+0x1d/0x20 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0218a1a>] dsl_pool_hold+0x4a/0x60 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81007c8f>] ? xen_restore_fl_direct_end+0x0/0x1
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa01fd16b>] dmu_objset_hold+0x2b/0xb0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa021cd8f>] dsl_prop_get+0x3f/0x90 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff8115f977>] ? kfree+0x127/0x320
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa021cdfe>] dsl_prop_get_integer+0x1e/0x20 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0293fa7>] __zvol_create_minor+0xa7/0x640 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa00fed0b>] ? kmem_free_debug+0x4b/0x150 [spl]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0294567>] zvol_create_minor+0x27/0x40 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa029458e>] zvol_create_minors_cb+0xe/0x20 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa01faace>] dmu_objset_find_impl+0x36e/0x3e0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0294580>] ? zvol_create_minors_cb+0x0/0x20 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa01fa91e>] dmu_objset_find_impl+0x1be/0x3e0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0294580>] ? zvol_create_minors_cb+0x0/0x20 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0294580>] ? zvol_create_minors_cb+0x0/0x20 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa01fab92>] dmu_objset_find+0x52/0x80 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa02930b3>] zvol_create_minors+0x33/0x40 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0268e7e>] zfs_ioc_snapshot+0x26e/0x2d0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa026a1d7>] zfsdev_ioctl+0x1e7/0x500 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81189012>] vfs_ioctl+0x22/0xa0
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff811891b4>] do_vfs_ioctl+0x84/0x580
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81007ca2>] ? check_events+0x12/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81189731>] sys_ioctl+0x81/0xa0
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
The txg_sync task called rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); in dsl_sync_task_sync() and then zvol_rename_minors(), which immediately called mutex_lock(&zvol_state_lock). Simultaneously, A userland process started a snapshot and took mutex_lock(&zvol_state_lock) before creating minors. It then called dsl_pool_config_enter(), which did rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);.
The system where this happened is running the latest version of SoftNAS, which has a CentOS 6.5 userland, Linux 3.14.8 and ZoL 0.6.3 with some light patches on top.
The text was updated successfully, but these errors were encountered:
This is a Linux-specific regression because Illumos does not have anything for renaming zvol snapshots (I am not sure if the snapshots are even accessible). I have a rough plan of action sketched out in my head. The general idea is to make the rename functionality asynchronous. I will just have a couple of lists of zvols to rename (syncing/quiscing style) similar to the txg commit. Then I will kick the system taskq to run this code whenever work is batched. That ought to avoid this deadlock.
ryao
changed the title
zfs send can cause zfs snapshot to deadlock the txg_sync thread on zvols
zfs rename can cause zfs snapshot to deadlock the txg_sync thread on zvols
Sep 2, 2014
I was sent the following backtraces for analysis following a deadlock that was encountered on SoftNAS during a long running send:
The txg_sync task called
rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG);
indsl_sync_task_sync()
and thenzvol_rename_minors()
, which immediately calledmutex_lock(&zvol_state_lock)
. Simultaneously, A userland process started a snapshot and tookmutex_lock(&zvol_state_lock)
before creating minors. It then calleddsl_pool_config_enter()
, which didrrw_enter(&dp->dp_config_rwlock, RW_READER, tag);
.The system where this happened is running the latest version of SoftNAS, which has a CentOS 6.5 userland, Linux 3.14.8 and ZoL 0.6.3 with some light patches on top.
The text was updated successfully, but these errors were encountered: