Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zpool command hangs when attempting to create a new pool on a zvol #612

Closed
ryao opened this issue Mar 20, 2012 · 18 comments
Closed

zpool command hangs when attempting to create a new pool on a zvol #612

ryao opened this issue Mar 20, 2012 · 18 comments
Milestone

Comments

@ryao
Copy link
Contributor

ryao commented Mar 20, 2012

I ran zpool create -o ashift=13 tank /dev/zvol/rpool/KVM/gfreebsd to try to create a pool for a VM with 0.6.0-rc7, but the command froze. I assume that this is related to issue #342 and possibly also issue #469.

@behlendorf
Copy link
Contributor

Yes, creating a new zfs pool on top of a zvol isn't handled. I've yet to look in to exactly why it fails but I can imagine a ton of potential deadlocks. I should probably merge a patch to detect when something is trying to do this and flat out prohibit it.

@ryao
Copy link
Contributor Author

ryao commented Mar 21, 2012

In a similar vein, I suppose I could do zpool add rpool file /example, which would add a vdev to a pool that is backed by the pool itself, which is something that definitely should be prevented, assuming that it is not already.

@fajarnugraha
Copy link
Contributor

As richard mentioned, a vm-hosting environment, where vm's storage is backed by zvols, and the vm uses zfs as well (e.g. solaris-derivatives) is a valid (albeit niche) use cache for this.

The admin might want to create the pool on the host, populate the content (zfs send/receive, rsync, whatever), export the pool, then startup the guest whose root is on that pool. The copy process will be faster when done directly on the host instead of (for example) booting with livecd on the guest and transfer the files using ssh.

Anyway, as a workaround I managed to do that using virtualbox + rawdisk vmdk mapping + scst_local. Still faster compared to using ssh.

@behlendorf
Copy link
Contributor

Has anyone tried this recently? The -rc11 release addresses many of the deadlock which originally concerned me. No promises it will work now, but it might. Has anyone tried this on OI or FreeBSD?

@fajarnugraha
Copy link
Contributor

Still the same result using Ubuntu's daily ppa, v0.6.0.81-rc11

@lundman
Copy link
Contributor

lundman commented Dec 18, 2012

So my basic understanding of the deadlock at the moment, is that one thread is doing this stack:

mutex_lock+0x2a/0x50
blkdev_get+0x210/0x300
mntput+0x24/0x40
blkdev_get_by_path+0x3d/0x90
vdev_disk_open+0xe8/0x3d0 [zfs]
vdev_open+0xf5/0x480 [zfs]
zvol_is_zvol+0x2a/0x50 [zfs]
vdev_open_children+0x4f/0x110 [zfs]
vdev_root_open+0x51/0xf0 [zfs]
vdev_open+0xf5/0x480 [zfs]
vdev_create+0x27/0x90 [zfs]
spa_create+0x1e2/0x7d0 [zfs]
zfs_ioc_pool_create+0x1dc/0x260 [zfs]

and the other gets stuck here;

mutex_lock+0x2a/0x50
spa_open_common+0x61/0x3d0 [zfs]
find_get_page+0x1e/0xa0
getcomponent.part.5+0x177/0x190 [zfs]
spa_open+0x13/0x20 [zfs]
dsl_dir_open_spa+0x533/0x610 [zfs]
dsl_dataset_hold+0x40/0x2d0 [zfs]
dsl_dataset_own+0x31/0x80 [zfs]
dmu_objset_own+0x37/0xe0 [zfs]
zone_statistics+0x99/0xc0
zvol_open+0x102/0x260 [zfs]
blkdev_get+0x1ad/0x300
blkdev_open+0x5f/0x90

Looking at zvol_open(), the call that it actually seems to hang inside is;

    if (!mutex_owned(&zvol_state_lock)) {
        mutex_enter(&zvol_state_lock);
        drop_mutex = 1;
    }

    if (zv->zv_open_count == 0) {
        error = zvol_first_open(zv);       // HERE

So, in the case that we are holding the mutex, and it is considered our "first open" of the device. Or perhaps that should be called the number of references go from 0 to 1. It is worth noting that zvol_open() is called at least twice before we attempt to create a pool inside it.

If I stop zvol_first_open() from being called, for example by never decreasing zv->zv_open_count, it avoids the deadlock. Another amusing way, is to open the device for reading, but not reading from it, before creating the pool;

exec 3</dev/zd0
zpool create -f data /dev/zd0
exec 3<&-

Also works.

Possible attacks are then,

  • can zvol_first_open() possibly be called just once? Or is it for each initial open reference? (Solaris appears the same, so maybe not this one).
  • can we detect zvol is in use, thereby leaving the function early, and/or, skipping zvol_first_open(), and/or, holding the mutex.
  • can we ensure that zvol_open() is run, and completed, before the other thread starts the zfs_ioc_pool_create() path?
  • Is it possible the EXIO exit in zvol_open() in Solaris, detects this situation?
        zv = zfsdev_get_soft_state(getminor(*devp), ZSST_ZVOL);
        if (zv == NULL) {
                mutex_exit(&zfsdev_state_lock);
                return (ENXIO);
        }

What does zfsdev_get_soft_state() do?

@behlendorf
Copy link
Contributor

Actually, according to your stacks this is caused by a lock inversion between the spa_namespace_lock and the bdev->bd_mutex. Here are the cleaned up and annotated stacks. The first stack is clearly from zpool create do you have the process name for the second stack?

blkdev_get+0x210/0x300  -------> Takes bdev->bd_mutex
blkdev_get_by_path+0x3d/0x90
vdev_disk_open+0xe8/0x3d0 [zfs]
vdev_open+0xf5/0x480 [zfs]
vdev_open_children+0x4f/0x110 [zfs]
vdev_root_open+0x51/0xf0 [zfs]
vdev_open+0xf5/0x480 [zfs]
vdev_create+0x27/0x90 [zfs]
spa_create+0x1e2/0x7d0 [zfs] --------> Takes spa_namespace_lock
zfs_ioc_pool_create+0x1dc/0x260 [zfs]
spa_open_common+0x61/0x3d0 [zfs] -------> Takes spa_namespace_lock
spa_open+0x13/0x20 [zfs]
dsl_dir_open_spa+0x533/0x610 [zfs]
dsl_dataset_hold+0x40/0x2d0 [zfs]
dsl_dataset_own+0x31/0x80 [zfs]
dmu_objset_own+0x37/0xe0 [zfs]
zvol_open+0x102/0x260 [zfs]
__blkdev_get+0x000/0x000 -------> Takes bdev->bd_mutex
blkdev_get+0x1ad/0x300
blkdev_open+0x5f/0x90

@lundman
Copy link
Contributor

lundman commented Dec 18, 2012

I was actually going to ask you that, as I am not entirely sure where it is from. But now that we both want to know, I shall focus on that.

@ryao
Copy link
Contributor Author

ryao commented Dec 19, 2012

I have opened a pull request #1157 with a patch to fix this issue. It makes vdev_uses_zvols() unconditionally return B_TRUE, which is necessary until someone figures out how to make vdev_uses_zvols() work on Linux.

Making vdev_uses_zvols() work is not a straightforward thing when we cases such as trying to make a pool backed by a file on a filesystem on a zvol (or on a dataset). This needs more testing, but this patch makes the simple case of making a pool on a zvol work.

@lundman
Copy link
Contributor

lundman commented Dec 19, 2012

From irc, I get a slightly different deadlock:

Dec 19 09:16:37 zfsdev kernel: [59692.928370] zvol: open 1 1 : 1
Dec 19 09:16:37 zfsdev kernel: [59692.928373] zvol: open 2
Dec 19 09:16:37 zfsdev kernel: [59692.928377] zvol: open 3
Dec 19 09:16:37 zfsdev kernel: [59692.928380] zvol: open 4
Dec 19 09:16:37 zfsdev kernel: [59692.928383] zvol: open exit
Dec 19 09:16:37 zfsdev kernel: [59692.970843] zvol: open enter
Dec 19 09:16:37 zfsdev kernel: [59692.970847] zvol: open 1 1 : 0
Dec 19 09:16:37 zfsdev kernel: [59692.979143]  not zvol '(null)'
Dec 19 09:16:37 zfsdev kernel: [59692.979149] uses zvol '/dev/zd0'!

Dec 19 09:19:24 zfsdev kernel: [59860.039275] SysRq : Show Blocked State
Dec 19 09:19:24 zfsdev kernel: [59860.039353]   task                        PC stack   pid father
Dec 19 09:19:24 zfsdev kernel: [59860.039391] zpool           D ffff8800603139c0     0 16959   1328 0x00000000
Dec 19 09:19:24 zfsdev kernel: [59860.039394]  ffff88001aa7f968 0000000000000086 ffff88004f3b1700 ffff88001aa7ffd8
Dec 19 09:19:24 zfsdev kernel: [59860.039397]  ffff88001aa7ffd8 ffff88001aa7ffd8 ffff88005d262e00 ffff88004f3b1700
Dec 19 09:19:24 zfsdev kernel: [59860.039399]  0000000000000000 ffff88005c4e63d8 ffff88004f3b1700 ffff88005c4e63dc
Dec 19 09:19:24 zfsdev kernel: [59860.039402] Call Trace:
Dec 19 09:19:24 zfsdev kernel: [59860.039410]  [<ffffffff81682189>] schedule+0x29/0x70
Dec 19 09:19:24 zfsdev kernel: [59860.039412]  [<ffffffff8168244e>] schedule_preempt_disabled+0xe/0x10
Dec 19 09:19:24 zfsdev kernel: [59860.039415]  [<ffffffff81680f67>] __mutex_lock_slowpath+0xd7/0x150
Dec 19 09:19:24 zfsdev kernel: [59860.039418]  [<ffffffff81680a3a>] mutex_lock+0x2a/0x50
Dec 19 09:19:24 zfsdev kernel: [59860.039422]  [<ffffffff811b9edb>] __blkdev_get+0x6b/0x4b0
Dec 19 09:19:24 zfsdev kernel: [59860.039424]  [<ffffffff811ba530>] blkdev_get+0x210/0x300
Dec 19 09:19:24 zfsdev kernel: [59860.039427]  [<ffffffff811a0444>] ? mntput+0x24/0x40
Dec 19 09:19:24 zfsdev kernel: [59860.039430]  [<ffffffff8118c662>] ? path_put+0x22/0x30
Dec 19 09:19:24 zfsdev kernel: [59860.039432]  [<ffffffff811ba75d>] blkdev_get_by_path+0x3d/0x90
Dec 19 09:19:24 zfsdev kernel: [59860.039464]  [<ffffffffa05af308>] vdev_disk_open+0xe8/0x3d0 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039485]  [<ffffffffa05ad085>] vdev_open+0xf5/0x480 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039505]  [<ffffffffa05add4f>] vdev_open_children+0x4f/0x110 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039526]  [<ffffffffa05b6361>] vdev_root_open+0x51/0xf0 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039546]  [<ffffffffa05ad085>] vdev_open+0xf5/0x480 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039567]  [<ffffffffa05ad437>] vdev_create+0x27/0x90 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039587]  [<ffffffffa059b7c2>] spa_create+0x1e2/0x7d0 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039608]  [<ffffffffa05cf76c>] zfs_ioc_pool_create+0x1dc/0x260 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039628]  [<ffffffffa05cf96d>] zfsdev_ioctl+0xdd/0x1d0 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039631]  [<ffffffff81193ad9>] do_vfs_ioctl+0x99/0x590
Dec 19 09:19:24 zfsdev kernel: [59860.039633]  [<ffffffff8119eb06>] ? alloc_fd+0xc6/0x110
Dec 19 09:19:24 zfsdev kernel: [59860.039636]  [<ffffffff8116decb>] ? kmem_cache_free+0x7b/0x100
Dec 19 09:19:24 zfsdev kernel: [59860.039638]  [<ffffffff8118cac3>] ? putname+0x33/0x50
Dec 19 09:19:24 zfsdev kernel: [59860.039640]  [<ffffffff81194069>] sys_ioctl+0x99/0xa0
Dec 19 09:19:24 zfsdev kernel: [59860.039643]  [<ffffffff8168b329>] system_call_fastpath+0x16/0x1b
Dec 19 09:19:24 zfsdev kernel: [59860.039644] blkid           D ffff8800602139c0     0 16964   8539 0x00000004
Dec 19 09:19:24 zfsdev kernel: [59860.039647]  ffff88004f1677d8 0000000000000082 ffff88001aa6ae00 ffff88004f167fd8
Dec 19 09:19:24 zfsdev kernel: [59860.039649]  ffff88004f167fd8 ffff88004f167fd8 ffff88004f3b5c00 ffff88001aa6ae00
Dec 19 09:19:24 zfsdev kernel: [59860.039651]  0000000000000de4 ffffffffa0626dc0 ffff88001aa6ae00 ffffffffa0626dc4
Dec 19 09:19:24 zfsdev kernel: [59860.039653] Call Trace:
Dec 19 09:19:24 zfsdev kernel: [59860.039656]  [<ffffffff81682189>] schedule+0x29/0x70
Dec 19 09:19:24 zfsdev kernel: [59860.039658]  [<ffffffff8168244e>] schedule_preempt_disabled+0xe/0x10
Dec 19 09:19:24 zfsdev kernel: [59860.039661]  [<ffffffff81680f67>] __mutex_lock_slowpath+0xd7/0x150
Dec 19 09:19:24 zfsdev kernel: [59860.039663]  [<ffffffff81680a3a>] mutex_lock+0x2a/0x50
Dec 19 09:19:24 zfsdev kernel: [59860.039683]  [<ffffffffa059daa1>] spa_open_common+0x61/0x3d0 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039700]  [<ffffffffa05802b7>] ? getcomponent.part.5+0x177/0x190 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039720]  [<ffffffffa059de23>] spa_open+0x13/0x20 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039738]  [<ffffffffa0582203>] dsl_dir_open_spa+0x533/0x610 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039754]  [<ffffffffa0576f20>] dsl_dataset_hold+0x40/0x2d0 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039771]  [<ffffffffa057b7f1>] dsl_dataset_own+0x31/0x80 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039785]  [<ffffffffa0563667>] dmu_objset_own+0x37/0xe0 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039788]  [<ffffffff8130f3f0>] ? disk_map_sector_rcu+0x80/0x80
Dec 19 09:19:24 zfsdev kernel: [59860.039809]  [<ffffffffa05f4819>] zvol_open+0xa9/0x2d0 [zfs]
Dec 19 09:19:24 zfsdev kernel: [59860.039811]  [<ffffffff8130fd0b>] ? get_gendisk+0x3b/0x110
Dec 19 09:19:24 zfsdev kernel: [59860.039813]  [<ffffffff811b9f40>] __blkdev_get+0xd0/0x4b0
Dec 19 09:19:24 zfsdev kernel: [59860.039815]  [<ffffffff811ba4cd>] blkdev_get+0x1ad/0x300
Dec 19 09:19:24 zfsdev kernel: [59860.039818]  [<ffffffff816830ae>] ? _raw_spin_lock+0xe/0x20
Dec 19 09:19:24 zfsdev kernel: [59860.039820]  [<ffffffff811ba67f>] blkdev_open+0x5f/0x90
Dec 19 09:19:24 zfsdev kernel: [59860.039823]  [<ffffffff811802df>] __dentry_open+0x21f/0x330
Dec 19 09:19:24 zfsdev kernel: [59860.039824]  [<ffffffff811ba620>] ? blkdev_get+0x300/0x300
Dec 19 09:19:24 zfsdev kernel: [59860.039827]  [<ffffffff8118042a>] vfs_open+0x3a/0x40
Dec 19 09:19:24 zfsdev kernel: [59860.039829]  [<ffffffff811813d8>] nameidata_to_filp+0x58/0xb0
Dec 19 09:19:24 zfsdev kernel: [59860.039830]  [<ffffffff8118fd0f>] do_last+0x49f/0xa10
Dec 19 09:19:24 zfsdev kernel: [59860.039833]  [<ffffffff812ec8bc>] ? apparmor_file_alloc_security+0x2c/0x60
Dec 19 09:19:24 zfsdev kernel: [59860.039835]  [<ffffffff81191569>] path_openat+0xd9/0x430
Dec 19 09:19:24 zfsdev kernel: [59860.039837]  [<ffffffff811919e1>] do_filp_open+0x41/0xa0
Dec 19 09:19:24 zfsdev kernel: [59860.039839]  [<ffffffff8119eb06>] ? alloc_fd+0xc6/0x110
Dec 19 09:19:24 zfsdev kernel: [59860.039841]  [<ffffffff81181525>] do_sys_open+0xf5/0x230
Dec 19 09:19:24 zfsdev kernel: [59860.039843]  [<ffffffff81181681>] sys_open+0x21/0x30
Dec 19 09:19:24 zfsdev kernel: [59860.039845]  [<ffffffff8168b329>] system_call_fastpath+0x16/0x1b
Dec 19 09:19:24 zfsdev kernel: [59860.039861] Sched Debug Version: v0.10, 3.5.0-19-generic #30-Ubuntu
Dec 19 09:19:24 zfsdev kernel: [59860.039863] ktime                                   : 59860039.826824
Dec 19 09:19:24 zfsdev kernel: [59860.039864] sched_clk                               : 59788410.125271
Dec 19 09:19:24 zfsdev kernel: [59860.039865] cpu_clk                                 : 59860039.860364
Dec 19 09:19:24 zfsdev kernel: [59860.039866] jiffies                                 : 4309857305
Dec 19 09:19:24 zfsdev kernel: [59860.039867] sched_clock_stable                      : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039868] 
Dec 19 09:19:24 zfsdev kernel: [59860.039868] sysctl_sched
Dec 19 09:19:24 zfsdev kernel: [59860.039870]   .sysctl_sched_latency                    : 12.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039871]   .sysctl_sched_min_granularity            : 1.500000
Dec 19 09:19:24 zfsdev kernel: [59860.039872]   .sysctl_sched_wakeup_granularity         : 2.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039873]   .sysctl_sched_child_runs_first           : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039874]   .sysctl_sched_features                   : 24119
Dec 19 09:19:24 zfsdev kernel: [59860.039875]   .sysctl_sched_tunable_scaling            : 1 (logaritmic)
Dec 19 09:19:24 zfsdev kernel: [59860.039876] 
Dec 19 09:19:24 zfsdev kernel: [59860.039876] cpu#0, 2530.763 MHz
Dec 19 09:19:24 zfsdev kernel: [59860.039877]   .nr_running                    : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039878]   .load                          : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039879]   .nr_switches                   : 1335832
Dec 19 09:19:24 zfsdev kernel: [59860.039880]   .nr_load_updates               : 344379
Dec 19 09:19:24 zfsdev kernel: [59860.039881]   .nr_uninterruptible            : -915
Dec 19 09:19:24 zfsdev kernel: [59860.039882]   .next_balance                  : 4309.857305
Dec 19 09:19:24 zfsdev kernel: [59860.039883]   .curr->pid                     : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039884]   .clock                         : 59860039.297132
Dec 19 09:19:24 zfsdev kernel: [59860.039885]   .cpu_load[0]                   : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039886]   .cpu_load[1]                   : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039887]   .cpu_load[2]                   : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039888]   .cpu_load[3]                   : 43
Dec 19 09:19:24 zfsdev kernel: [59860.039889]   .cpu_load[4]                   : 497
Dec 19 09:19:24 zfsdev kernel: [59860.039889]   .yld_count                     : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039890]   .sched_count                   : 124283211
Dec 19 09:19:24 zfsdev kernel: [59860.039891]   .sched_goidle                  : 545547
Dec 19 09:19:24 zfsdev kernel: [59860.039892]   .avg_idle                      : 882906
Dec 19 09:19:24 zfsdev kernel: [59860.039893]   .ttwu_count                    : 821796
Dec 19 09:19:24 zfsdev kernel: [59860.039894]   .ttwu_local                    : 671681
Dec 19 09:19:24 zfsdev kernel: [59860.039896] 
Dec 19 09:19:24 zfsdev kernel: [59860.039896] cfs_rq[0]:/autogroup-109
Dec 19 09:19:24 zfsdev kernel: [59860.039897]   .exec_clock                    : 118.693982
Dec 19 09:19:24 zfsdev kernel: [59860.039899]   .MIN_vruntime                  : 0.000001
Dec 19 09:19:24 zfsdev kernel: [59860.039900]   .min_vruntime                  : 563.120404
Dec 19 09:19:24 zfsdev kernel: [59860.039901]   .max_vruntime                  : 0.000001
Dec 19 09:19:24 zfsdev kernel: [59860.039902]   .spread                        : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039903]   .spread0                       : -10659394.826303
Dec 19 09:19:24 zfsdev kernel: [59860.039904]   .nr_spread_over                : 78
Dec 19 09:19:24 zfsdev kernel: [59860.039905]   .nr_running                    : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039905]   .load                          : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039906]   .load_avg                      : 5361.979904
Dec 19 09:19:24 zfsdev kernel: [59860.039908]   .load_period                   : 5.312183
Dec 19 09:19:24 zfsdev kernel: [59860.039909]   .load_contrib                  : 1009
Dec 19 09:19:24 zfsdev kernel: [59860.039910]   .load_tg                       : 1009
Dec 19 09:19:24 zfsdev kernel: [59860.039911]   .se->exec_start                : 59860039.295849
Dec 19 09:19:24 zfsdev kernel: [59860.039912]   .se->vruntime                  : 10659952.287700
Dec 19 09:19:24 zfsdev kernel: [59860.039913]   .se->sum_exec_runtime          : 118.693982
Dec 19 09:19:24 zfsdev kernel: [59860.039915]   .se->statistics.wait_start     : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039915]   .se->statistics.sleep_start    : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039915]   .se->statistics.block_start    : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039915]   .se->statistics.sleep_max      : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039915]   .se->statistics.block_max      : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039917]   .se->statistics.exec_max       : 3.255863
Dec 19 09:19:24 zfsdev kernel: [59860.039921]   .se->statistics.slice_max      : 0.394138
Dec 19 09:19:24 zfsdev kernel: [59860.039924]   .se->statistics.wait_max       : 0.965770
Dec 19 09:19:24 zfsdev kernel: [59860.039928]   .se->statistics.wait_sum       : 4.310078
Dec 19 09:19:24 zfsdev kernel: [59860.039931]   .se->statistics.wait_count     : 499
Dec 19 09:19:24 zfsdev kernel: [59860.039935]   .se->load.weight               : 2
Dec 19 09:19:24 zfsdev kernel: [59860.039939] 
Dec 19 09:19:24 zfsdev kernel: [59860.039939] cfs_rq[0]:/
Dec 19 09:19:24 zfsdev kernel: [59860.039942]   .exec_clock                    : 254361.967691
Dec 19 09:19:24 zfsdev kernel: [59860.039956]   .MIN_vruntime                  : 0.000001
Dec 19 09:19:24 zfsdev kernel: [59860.039957]   .min_vruntime                  : 10659957.946707
Dec 19 09:19:24 zfsdev kernel: [59860.039958]   .max_vruntime                  : 0.000001
Dec 19 09:19:24 zfsdev kernel: [59860.039959]   .spread                        : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039960]   .spread0                       : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039961]   .nr_spread_over                : 394
Dec 19 09:19:24 zfsdev kernel: [59860.039962]   .nr_running                    : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039963]   .load                          : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039964]   .load_avg                      : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039965]   .load_period                   : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039966]   .load_contrib                  : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039967]   .load_tg                       : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039971] 
Dec 19 09:19:24 zfsdev kernel: [59860.039971] rt_rq[0]:/
Dec 19 09:19:24 zfsdev kernel: [59860.039972]   .rt_nr_running                 : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039973]   .rt_throttled                  : 0
Dec 19 09:19:24 zfsdev kernel: [59860.039974]   .rt_time                       : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039975]   .rt_runtime                    : 950.000000
Dec 19 09:19:24 zfsdev kernel: [59860.039976] 
Dec 19 09:19:24 zfsdev kernel: [59860.039976] runnable tasks:
Dec 19 09:19:24 zfsdev kernel: [59860.039976]             task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
Dec 19 09:19:24 zfsdev kernel: [59860.039976] ----------------------------------------------------------------------------------------------------------
Dec 19 09:19:24 zfsdev kernel: [59860.039996] 
Dec 19 09:19:24 zfsdev kernel: [59860.039996] cpu#1, 2530.763 MHz
Dec 19 09:19:24 zfsdev kernel: [59860.039997]   .nr_running                    : 1
Dec 19 09:19:24 zfsdev kernel: [59860.039998]   .load                          : 1024
Dec 19 09:19:24 zfsdev kernel: [59860.039999]   .nr_switches                   : 1072888
Dec 19 09:19:24 zfsdev kernel: [59860.040000]   .nr_load_updates               : 273056
Dec 19 09:19:24 zfsdev kernel: [59860.040001]   .nr_uninterruptible            : 917
Dec 19 09:19:24 zfsdev kernel: [59860.040002]   .next_balance                  : 4309.857281
Dec 19 09:19:24 zfsdev kernel: [59860.040003]   .curr->pid                     : 17267
Dec 19 09:19:24 zfsdev kernel: [59860.040004]   .clock                         : 59860039.076855
Dec 19 09:19:24 zfsdev kernel: [59860.040005]   .cpu_load[0]                   : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040006]   .cpu_load[1]                   : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040006]   .cpu_load[2]                   : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040007]   .cpu_load[3]                   : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040008]   .cpu_load[4]                   : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040009]   .yld_count                     : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040010]   .sched_count                   : 1073405
Dec 19 09:19:24 zfsdev kernel: [59860.040011]   .sched_goidle                  : 482307
Dec 19 09:19:24 zfsdev kernel: [59860.040012]   .avg_idle                      : 1000000
Dec 19 09:19:24 zfsdev kernel: [59860.040013]   .ttwu_count                    : 524718
Dec 19 09:19:24 zfsdev kernel: [59860.040013]   .ttwu_local                    : 416408
Dec 19 09:19:24 zfsdev kernel: [59860.040015] 
Dec 19 09:19:24 zfsdev kernel: [59860.040015] cfs_rq[1]:/autogroup-110
Dec 19 09:19:24 zfsdev kernel: [59860.040016]   .exec_clock                    : 208.161704
Dec 19 09:19:24 zfsdev kernel: [59860.040017]   .MIN_vruntime                  : 0.000001
Dec 19 09:19:24 zfsdev kernel: [59860.040018]   .min_vruntime                  : 549.222450
Dec 19 09:19:24 zfsdev kernel: [59860.040019]   .max_vruntime                  : 0.000001
Dec 19 09:19:24 zfsdev kernel: [59860.040020]   .spread                        : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040021]   .spread0                       : -10659408.724257
Dec 19 09:19:24 zfsdev kernel: [59860.040022]   .nr_spread_over                : 77
Dec 19 09:19:24 zfsdev kernel: [59860.040023]   .nr_running                    : 1
Dec 19 09:19:24 zfsdev kernel: [59860.040024]   .load                          : 1024
Dec 19 09:19:24 zfsdev kernel: [59860.040025]   .load_avg                      : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040026]   .load_period                   : 9.999999
Dec 19 09:19:24 zfsdev kernel: [59860.040027]   .load_contrib                  : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040028]   .load_tg                       : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040029]   .se->exec_start                : 59860039.076855
Dec 19 09:19:24 zfsdev kernel: [59860.040030]   .se->vruntime                  : 59997.076666
Dec 19 09:19:24 zfsdev kernel: [59860.040031]   .se->sum_exec_runtime          : 208.161704
Dec 19 09:19:24 zfsdev kernel: [59860.040032]   .se->statistics.wait_start     : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040033]   .se->statistics.sleep_start    : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040034]   .se->statistics.block_start    : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040035]   .se->statistics.sleep_max      : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040036]   .se->statistics.block_max      : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040037]   .se->statistics.exec_max       : 7.009389
Dec 19 09:19:24 zfsdev kernel: [59860.040038]   .se->statistics.slice_max      : 0.072015
Dec 19 09:19:24 zfsdev kernel: [59860.040039]   .se->statistics.wait_max       : 0.888851
Dec 19 09:19:24 zfsdev kernel: [59860.040040]   .se->statistics.wait_sum       : 4.063314
Dec 19 09:19:24 zfsdev kernel: [59860.040041]   .se->statistics.wait_count     : 350
Dec 19 09:19:24 zfsdev kernel: [59860.040042]   .se->load.weight               : 1024
Dec 19 09:19:24 zfsdev kernel: [59860.040043] 
Dec 19 09:19:24 zfsdev kernel: [59860.040043] cfs_rq[1]:/autogroup-82
Dec 19 09:19:24 zfsdev kernel: [59860.040044]   .exec_clock                    : 7316.458887
Dec 19 09:19:24 zfsdev kernel: [59860.040045]   .MIN_vruntime                  : 0.000001
Dec 19 09:19:24 zfsdev kernel: [59860.040046]   .min_vruntime                  : 4829.808853
Dec 19 09:19:24 zfsdev kernel: [59860.040047]   .max_vruntime                  : 0.000001
Dec 19 09:19:24 zfsdev kernel: [59860.040048]   .spread                        : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040049]   .spread0                       : -10655128.137854
Dec 19 09:19:24 zfsdev kernel: [59860.040050]   .nr_spread_over                : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040051]   .nr_running                    : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040052]   .load                          : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040053]   .load_avg                      : 2559.999744
Dec 19 09:19:24 zfsdev kernel: [59860.040054]   .load_period                   : 9.855210
Dec 19 09:19:24 zfsdev kernel: [59860.040055]   .load_contrib                  : 259
Dec 19 09:19:24 zfsdev kernel: [59860.040056]   .load_tg                       : 259
Dec 19 09:19:24 zfsdev kernel: [59860.040057]   .se->exec_start                : 59859924.150220
Dec 19 09:19:24 zfsdev kernel: [59860.040058]   .se->vruntime                  : 59991.128121
Dec 19 09:19:24 zfsdev kernel: [59860.040059]   .se->sum_exec_runtime          : 7317.560410
Dec 19 09:19:24 zfsdev kernel: [59860.040060]   .se->statistics.wait_start     : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040061]   .se->statistics.sleep_start    : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040062]   .se->statistics.block_start    : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040063]   .se->statistics.sleep_max      : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040064]   .se->statistics.block_max      : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040065]   .se->statistics.exec_max       : 4.766456
Dec 19 09:19:24 zfsdev kernel: [59860.040066]   .se->statistics.slice_max      : 0.471419
Dec 19 09:19:24 zfsdev kernel: [59860.040067]   .se->statistics.wait_max       : 0.870893
Dec 19 09:19:24 zfsdev kernel: [59860.040068]   .se->statistics.wait_sum       : 72.278487
Dec 19 09:19:24 zfsdev kernel: [59860.040069]   .se->statistics.wait_count     : 107622
Dec 19 09:19:24 zfsdev kernel: [59860.040070]   .se->load.weight               : 2
Dec 19 09:19:24 zfsdev kernel: [59860.040071] 
Dec 19 09:19:24 zfsdev kernel: [59860.040071] cfs_rq[1]:/
Dec 19 09:19:24 zfsdev kernel: [59860.040072]   .exec_clock                    : 70105.015872
Dec 19 09:19:24 zfsdev kernel: [59860.040073]   .MIN_vruntime                  : 0.000001
Dec 19 09:19:24 zfsdev kernel: [59860.040074]   .min_vruntime                  : 59997.076666
Dec 19 09:19:24 zfsdev kernel: [59860.040075]   .max_vruntime                  : 0.000001
Dec 19 09:19:24 zfsdev kernel: [59860.040076]   .spread                        : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040077]   .spread0                       : -10599960.870041
Dec 19 09:19:24 zfsdev kernel: [59860.040078]   .nr_spread_over                : 431
Dec 19 09:19:24 zfsdev kernel: [59860.040079]   .nr_running                    : 1
Dec 19 09:19:24 zfsdev kernel: [59860.040080]   .load                          : 1024
Dec 19 09:19:24 zfsdev kernel: [59860.040081]   .load_avg                      : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040082]   .load_period                   : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040083]   .load_contrib                  : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040084]   .load_tg                       : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040085] 
Dec 19 09:19:24 zfsdev kernel: [59860.040085] rt_rq[1]:/
Dec 19 09:19:24 zfsdev kernel: [59860.040086]   .rt_nr_running                 : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040087]   .rt_throttled                  : 0
Dec 19 09:19:24 zfsdev kernel: [59860.040088]   .rt_time                       : 0.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040089]   .rt_runtime                    : 950.000000
Dec 19 09:19:24 zfsdev kernel: [59860.040090] 
Dec 19 09:19:24 zfsdev kernel: [59860.040090] runnable tasks:
Dec 19 09:19:24 zfsdev kernel: [59860.040090]             task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
Dec 19 09:19:24 zfsdev kernel: [59860.040090] ----------------------------------------------------------------------------------------------------------
Dec 19 09:19:24 zfsdev kernel: [59860.040102] R           bash 17267       549.222450       176   120       549.222450       114.070919     46626.148675 /autogroup-110
Dec 19 09:19:24 zfsdev kernel: [59860.040106] 
Dec 19 09:19:44 zfsdev kernel: [59880.336176] INFO: task zpool:16959 blocked for more than 120 seconds.
Dec 19 09:19:44 zfsdev kernel: [59880.336250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 19 09:19:44 zfsdev kernel: [59880.336350] zpool           D ffff8800603139c0     0 16959   1328 0x00000000
Dec 19 09:19:44 zfsdev kernel: [59880.336354]  ffff88001aa7f968 0000000000000086 ffff88004f3b1700 ffff88001aa7ffd8
Dec 19 09:19:44 zfsdev kernel: [59880.336357]  ffff88001aa7ffd8 ffff88001aa7ffd8 ffff88005d262e00 ffff88004f3b1700
Dec 19 09:19:44 zfsdev kernel: [59880.336359]  0000000000000000 ffff88005c4e63d8 ffff88004f3b1700 ffff88005c4e63dc
Dec 19 09:19:44 zfsdev kernel: [59880.336361] Call Trace:
Dec 19 09:19:44 zfsdev kernel: [59880.336370]  [<ffffffff81682189>] schedule+0x29/0x70
Dec 19 09:19:44 zfsdev kernel: [59880.336372]  [<ffffffff8168244e>] schedule_preempt_disabled+0xe/0x10
Dec 19 09:19:44 zfsdev kernel: [59880.336375]  [<ffffffff81680f67>] __mutex_lock_slowpath+0xd7/0x150
Dec 19 09:19:44 zfsdev kernel: [59880.336378]  [<ffffffff81680a3a>] mutex_lock+0x2a/0x50
Dec 19 09:19:44 zfsdev kernel: [59880.336381]  [<ffffffff811b9edb>] __blkdev_get+0x6b/0x4b0
Dec 19 09:19:44 zfsdev kernel: [59880.336383]  [<ffffffff811ba530>] blkdev_get+0x210/0x300
Dec 19 09:19:44 zfsdev kernel: [59880.336386]  [<ffffffff811a0444>] ? mntput+0x24/0x40
Dec 19 09:19:44 zfsdev kernel: [59880.336390]  [<ffffffff8118c662>] ? path_put+0x22/0x30
Dec 19 09:19:44 zfsdev kernel: [59880.336391]  [<ffffffff811ba75d>] blkdev_get_by_path+0x3d/0x90
Dec 19 09:19:44 zfsdev kernel: [59880.336421]  [<ffffffffa05af308>] vdev_disk_open+0xe8/0x3d0 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336441]  [<ffffffffa05ad085>] vdev_open+0xf5/0x480 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336460]  [<ffffffffa05add4f>] vdev_open_children+0x4f/0x110 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336480]  [<ffffffffa05b6361>] vdev_root_open+0x51/0xf0 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336499]  [<ffffffffa05ad085>] vdev_open+0xf5/0x480 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336519]  [<ffffffffa05ad437>] vdev_create+0x27/0x90 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336538]  [<ffffffffa059b7c2>] spa_create+0x1e2/0x7d0 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336558]  [<ffffffffa05cf76c>] zfs_ioc_pool_create+0x1dc/0x260 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336578]  [<ffffffffa05cf96d>] zfsdev_ioctl+0xdd/0x1d0 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336580]  [<ffffffff81193ad9>] do_vfs_ioctl+0x99/0x590
Dec 19 09:19:44 zfsdev kernel: [59880.336582]  [<ffffffff8119eb06>] ? alloc_fd+0xc6/0x110
Dec 19 09:19:44 zfsdev kernel: [59880.336585]  [<ffffffff8116decb>] ? kmem_cache_free+0x7b/0x100
Dec 19 09:19:44 zfsdev kernel: [59880.336587]  [<ffffffff8118cac3>] ? putname+0x33/0x50
Dec 19 09:19:44 zfsdev kernel: [59880.336589]  [<ffffffff81194069>] sys_ioctl+0x99/0xa0
Dec 19 09:19:44 zfsdev kernel: [59880.336591]  [<ffffffff8168b329>] system_call_fastpath+0x16/0x1b
Dec 19 09:19:44 zfsdev kernel: [59880.336593] INFO: task blkid:16964 blocked for more than 120 seconds.
Dec 19 09:19:44 zfsdev kernel: [59880.336654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 19 09:19:44 zfsdev kernel: [59880.336753] blkid           D ffff8800602139c0     0 16964   8539 0x00000004
Dec 19 09:19:44 zfsdev kernel: [59880.336755]  ffff88004f1677d8 0000000000000082 ffff88001aa6ae00 ffff88004f167fd8
Dec 19 09:19:44 zfsdev kernel: [59880.336758]  ffff88004f167fd8 ffff88004f167fd8 ffff88004f3b5c00 ffff88001aa6ae00
Dec 19 09:19:44 zfsdev kernel: [59880.336760]  0000000000000de4 ffffffffa0626dc0 ffff88001aa6ae00 ffffffffa0626dc4
Dec 19 09:19:44 zfsdev kernel: [59880.336762] Call Trace:
Dec 19 09:19:44 zfsdev kernel: [59880.336765]  [<ffffffff81682189>] schedule+0x29/0x70
Dec 19 09:19:44 zfsdev kernel: [59880.336767]  [<ffffffff8168244e>] schedule_preempt_disabled+0xe/0x10
Dec 19 09:19:44 zfsdev kernel: [59880.336769]  [<ffffffff81680f67>] __mutex_lock_slowpath+0xd7/0x150
Dec 19 09:19:44 zfsdev kernel: [59880.336772]  [<ffffffff81680a3a>] mutex_lock+0x2a/0x50
Dec 19 09:19:44 zfsdev kernel: [59880.336792]  [<ffffffffa059daa1>] spa_open_common+0x61/0x3d0 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336808]  [<ffffffffa05802b7>] ? getcomponent.part.5+0x177/0x190 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336827]  [<ffffffffa059de23>] spa_open+0x13/0x20 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336844]  [<ffffffffa0582203>] dsl_dir_open_spa+0x533/0x610 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336860]  [<ffffffffa0576f20>] dsl_dataset_hold+0x40/0x2d0 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336876]  [<ffffffffa057b7f1>] dsl_dataset_own+0x31/0x80 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336889]  [<ffffffffa0563667>] dmu_objset_own+0x37/0xe0 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336892]  [<ffffffff8130f3f0>] ? disk_map_sector_rcu+0x80/0x80
Dec 19 09:19:44 zfsdev kernel: [59880.336911]  [<ffffffffa05f4819>] zvol_open+0xa9/0x2d0 [zfs]
Dec 19 09:19:44 zfsdev kernel: [59880.336913]  [<ffffffff8130fd0b>] ? get_gendisk+0x3b/0x110
Dec 19 09:19:44 zfsdev kernel: [59880.336916]  [<ffffffff811b9f40>] __blkdev_get+0xd0/0x4b0
Dec 19 09:19:44 zfsdev kernel: [59880.336917]  [<ffffffff811ba4cd>] blkdev_get+0x1ad/0x300
Dec 19 09:19:44 zfsdev kernel: [59880.336920]  [<ffffffff816830ae>] ? _raw_spin_lock+0xe/0x20
Dec 19 09:19:44 zfsdev kernel: [59880.336922]  [<ffffffff811ba67f>] blkdev_open+0x5f/0x90
Dec 19 09:19:44 zfsdev kernel: [59880.336925]  [<ffffffff811802df>] __dentry_open+0x21f/0x330
Dec 19 09:19:44 zfsdev kernel: [59880.336927]  [<ffffffff811ba620>] ? blkdev_get+0x300/0x300
Dec 19 09:19:44 zfsdev kernel: [59880.336929]  [<ffffffff8118042a>] vfs_open+0x3a/0x40
Dec 19 09:19:44 zfsdev kernel: [59880.336930]  [<ffffffff811813d8>] nameidata_to_filp+0x58/0xb0
Dec 19 09:19:44 zfsdev kernel: [59880.336932]  [<ffffffff8118fd0f>] do_last+0x49f/0xa10
Dec 19 09:19:44 zfsdev kernel: [59880.336935]  [<ffffffff812ec8bc>] ? apparmor_file_alloc_security+0x2c/0x60
Dec 19 09:19:44 zfsdev kernel: [59880.336937]  [<ffffffff81191569>] path_openat+0xd9/0x430
Dec 19 09:19:44 zfsdev kernel: [59880.336939]  [<ffffffff811919e1>] do_filp_open+0x41/0xa0
Dec 19 09:19:44 zfsdev kernel: [59880.336941]  [<ffffffff8119eb06>] ? alloc_fd+0xc6/0x110
Dec 19 09:19:44 zfsdev kernel: [59880.336943]  [<ffffffff81181525>] do_sys_open+0xf5/0x230
Dec 19 09:19:44 zfsdev kernel: [59880.336944]  [<ffffffff81181681>] sys_open+0x21/0x30
Dec 19 09:19:44 zfsdev kernel: [59880.336947]  [<ffffffff8168b329>] system_call_fastpath+0x16/0x1b

@ryao
Copy link
Contributor Author

ryao commented Dec 19, 2012

@lundman Do you have any idea what is invoking blkid on your system? it appears to suffer from the lock inversion issue that @behlendorf highlighted.

@lundman
Copy link
Contributor

lundman commented Dec 19, 2012

Yes, as we discussed on IRC, udev is issuing

root      6014  3900  0 09:41 ?        00:00:00 /sbin/blkid -o udev -p /dev/zd0

and to confirm that, I send SIGSTOP to udev, create my pool, then SIGCONT without issues.

@ryao
Copy link
Contributor Author

ryao commented Dec 19, 2012

Ignore my comment about the pull request. There is a separate race here and @lundman's earlier patch resolves it in the common case (where you are not using something inbetween like dm-crypt).

@ryao
Copy link
Contributor Author

ryao commented Dec 19, 2012

Here is output of udevadm:

# udevadm monitor
monitor will print the received events for:
UDEV - the event which udev sends out after rule processing
KERNEL - the kernel uevent

KERNEL[7299.281584] change   /devices/virtual/block/zd32 (block)
UDEV  [7299.302890] change   /devices/virtual/block/zd32 (block)

Those two events occur when I do zpool create -m none test /dev/zvol/rpool/test. @lundman is seeing a deadlock intermittently on a mechanical disk while I am not seeing any deadlock on a SSD. It seems like we have a window between the kernel issuing this event and ZFS actually locking the device that udev can run blkid on the the zvol. That needs to be changed.

@lundman
Copy link
Contributor

lundman commented Dec 19, 2012

If we care where the udev events come from, they occur from before the ioctl() to create the pool, more specifically from;

zpool_main.c;
    nvroot = make_root_vdev(NULL, props, force, !force, B_FALSE, dryrun,
                                        argc - 1, argv + 1);

and make_root_vdev() has two events triggered on my system, the first call, and last;

    if ((newroot = construct_spec(props, argc, argv)) == NULL)
        return (NULL);

and

    /*
     * Run through the vdev specification and label any whole disks found.
     */
    if (!dryrun && make_disks(zhp, newroot) != 0) {
        nvlist_free(newroot);
        return (NULL);
    }

I can easily put a "sleep(1)" here to solve my issues, but that is obviously not the real answer.

@behlendorf
Copy link
Contributor

@lundman, @ryao Yeah, this is a tricky one. The usual thing to do in a case like this is to ensure the locks are always taken in the same order. Unfortunately the bdev->bd_mutex is taken unconditionally by the kernel during open(2) so that's outside our control. And the open really does need to take the spa_namespace_lock to safely lookup the spa. As for the create it broadly holds the spa_namespace_lock in order to safely lookup and then add the spa.

This lock inversion deadlock extends beyond simply creasing a zpool on a zvol. It's entirely possible (although very very unlikely) it could occur for any zpool ioctl() operation which is forced to reopen a vdev while holding the spa_namespace_lock and racing with an unrelated open of the zvol. Although based on how often that happens in the wild it's probably not something to lose any sleep over.

@behlendorf
Copy link
Contributor

@lundman @ryao Try behlendorf/zfs@45eed13 it should fix the issue by detecting the lock inversion if it occurs, unwinding the locks, and they reacquiring them safely.

@lundman
Copy link
Contributor

lundman commented Dec 20, 2012

I have tested this patch, and I can no longer deadlock. Creating the pool, reboot and importing, all work as expected. Thanks.

unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
In all but one case the spa_namespace_lock is taken before the
bdev->bd_mutex lock.  But Linux __blkdev_get() function calls
fops->open() with the bdev->bd_mutex lock held and we must
somehow still safely acquire the spa_namespace_lock.

To avoid a potential lock inversion deadlock we preemptively
try to take the spa_namespace_lock().  Normally it will not
be contended and this is safe because spa_open_common() handles
the case where the caller already holds the spa_namespace_lock.

When it is contended we risk a lock inversion if we were to
block waiting for the lock.  Luckily, the __blkdev_get()
function allows us to return -ERESTARTSYS which will result in
bdev->bd_mutex being dropped, reacquired, and fops->open() being
called again.  This process can be repeated safely until both
locks are acquired.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Jorgen Lundman <[email protected]>
Closes openzfs#612
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants