NULL dereference in __blkdev_get() #949

cwedgwood · 2012-09-09T22:55:13Z

# zpool export tank0

wait many seconds

# zpool import
Killed

dmesg:

[  236.605298] BUG: unable to handle kernel NULL pointer dereference at 00000000000002b6
[  236.613715] IP: [<ffffffff8112e308>] __blkdev_get+0x75/0x3da
[  236.619795] PGD 624225067 PUD 625d32067 PMD 0
[  236.624771] Oops: 0000 [#1] SMP
[  236.628347] CPU 2
[  236.630339] Modules linked in: loop sg zfs(O) zunicode(O) zavl(O) zcommon(O) znvpair(O) spl(O) uas coretemp kvm_intel kvm i2c_i801 i2c_core ghash_clmulni_intel mpt2sas lpc_ich mfd_core i7core_edac microcod$
[  236.652669]
[  236.652923] Pid: 4949, comm: zpool Tainted: G           O 3.5.3-cw4-00001-g5138c87 #8 Supermicro X8DTL/X8DTL
[  236.663547] RIP: 0010:[<ffffffff8112e308>]  [<ffffffff8112e308>] __blkdev_get+0x75/0x3da
[  236.672207] RSP: 0018:ffff88032651bb98  EFLAGS: 00010282
[  236.677909] RAX: ffffffffffffff86 RBX: ffff88032ec29800 RCX: 000000000e6fffff
[  236.685502] RDX: 0000000000000000 RSI: 00000000000280da RDI: ffff88032651bbc8
[  236.693116] RBP: ffff88032651bbf8 R08: ffff880631c1d100 R09: 00000031703064ab
[  236.700652] R10: 00000031703064ab R11: 00000031703064ab R12: ffffffffffffff86
[  236.708250] R13: 000000000000001d R14: 0000000000000000 R15: 0000000000000000
[  236.715851] FS:  00007f170a166b40(0000) GS:ffff880333c40000(0000) knlGS:0000000000000000
[  236.724475] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  236.730531] CR2: 00000000000002b6 CR3: 0000000623027000 CR4: 00000000000007e0
[  236.738073] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  236.745713] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  236.753328] Process zpool (pid: 4949, threadinfo ffff88032651a000, task ffff88032de043e0)
[  236.762050] Stack:
[  236.764207]  ffff880333811000 ffff88032651bc54 ffffc90002ee1e40 ffff880333811000
[  236.772318]  ffff88032651bbc8 00000001814ec8ec ffff88032651bbe8 ffff88032ec29800
[  236.780450]  0000000000000000 ffff880625d25f00 ffffffff8112e93b ffff88033368a540
[  236.788568] Call Trace:
[  236.791189]  [<ffffffff8112e93b>] ? blkdev_get+0x2ce/0x2ce
[  236.797082]  [<ffffffff8112e7f6>] blkdev_get+0x189/0x2ce
[  236.802798]  [<ffffffff81115ec0>] ? spin_unlock+0x9/0xb
[  236.808416]  [<ffffffff81116227>] ? unlock_new_inode+0x59/0x5e
[  236.814595]  [<ffffffff8112d901>] ? bdget+0x11d/0x12c
[  236.820001]  [<ffffffff8112e93b>] ? blkdev_get+0x2ce/0x2ce
[  236.825922]  [<ffffffff8112e99f>] blkdev_open+0x64/0x70
[  236.831487]  [<ffffffff811024e8>] do_dentry_open.isra.17+0x16e/0x21d
[  236.838210]  [<ffffffff81103049>] nameidata_to_filp+0x42/0x84
[  236.844362]  [<ffffffff8110eed4>] do_last.isra.47+0x625/0x64b
[  236.850472]  [<ffffffff8110eff2>] path_openat+0xc5/0x2f4
[  236.856174]  [<ffffffff8110f328>] do_filp_open+0x38/0x86
[  236.861865]  [<ffffffff8110c317>] ? getname_flags+0x2a/0xa2
[  236.867779]  [<ffffffff81118e6c>] ? alloc_fd+0xe5/0xf7
[  236.873306]  [<ffffffff811030f9>] do_sys_open+0x6e/0x102
[  236.879009]  [<ffffffff8110f2ad>] ? user_path_at_empty+0x6c/0x9c
[  236.885379]  [<ffffffff811031c4>] sys_openat+0x14/0x16
[  236.890855]  [<ffffffff814f31e9>] system_call_fastpath+0x16/0x1b
[  236.897195] Code: 20 75 11 e8 82 31 0f 00 85 c0 41 89 c6 74 05 e9 62 03 00 00 8b 3b 48 8d 75 cc e8 73 53 11 00 48 85 c0 49 89 c4 0f 84 45 03 00 00 <48> 8b 80 30 03 00 00 4c 89 e7 48 8b 40 58 48 89 45 b0 e8$
[  236.921243] RIP  [<ffffffff8112e308>] __blkdev_get+0x75/0x3da
[  236.927410]  RSP <ffff88032651bb98>
[  236.931158] CR2: 00000000000002b6
[  236.934824] ---[ end trace 6331f0ace31d011f ]---

I thought perhaps there was a race/qurk with zvols, but this happens even in cases where there are none.

The text was updated successfully, but these errors were encountered:

behlendorf · 2012-09-10T19:01:47Z

I've seen this reported once or twice as well. I also thought it was probably associated with zvols but you seem to have ruled that out.

The really interesting thing here is that we're just calling the user space sys_open() on a normal block device. Nothing in the ZFS kernel stack is involved here. I suppose it's possible we're racing with a concurrent operation on the bdev and things aren't locked properly, but most of that is handled by the kernel.

If you have you vmlinux kernel image still around can you resolve the offending line the NULL deref occurred on. That would help narrow down exactly what's getting damaged.

gdb /boot/vmlinux
list *(__blkdev_get+0x75)

cwedgwood · 2012-09-10T21:40:35Z

I rebuilt w/ CONFIG_DEBUG_INFO=y but I don't think that would have caused any insns to move about.

Assuming gcc didn't shuffle things about much:

1131     restart:
1132
1133            ret = -ENXIO;
1134            disk = get_gendisk(bdev->bd_dev, &partno);
1135            if (!disk)
1136                    goto out;
1137            owner = disk->fops->owner;
1138
1139            disk_block_events(disk);
1140            mutex_lock_nested(&bdev->bd_mutex, for_part);

Line 1137

cwedgwood · 2012-09-10T22:14:56Z

No, it's 1137, __blkdev_get + 0x75 == 0xffffffff8112e308

        owner = disk->fops->owner;

The insb before is from: if (!disk) goto out;

0xffffffff8112e2f7 <+100>:   callq  0xffffffff8124366f <get_gendisk>
0xffffffff8112e2fc <+105>:   test   %rax,%rax
0xffffffff8112e2ff <+108>:   mov    %rax,%r12
0xffffffff8112e302 <+111>:   je     0xffffffff8112e64d <__blkdev_get+954>

blows up:

0xffffffff8112e308 <+117>:   mov    0x330(%rax),%rax
0xffffffff8112e30f <+124>:   mov    %r12,%rdi
0xffffffff8112e312 <+127>:   mov    0x58(%rax),%rax
0xffffffff8112e316 <+131>:   mov    %rax,-0x50(%rbp)

disk not null
disk->fops is null

behlendorf · 2012-09-10T22:16:40Z

Based on the previous code which must have executed to get to this point, and the value of the NULL deref I'm inclined to think bdev is actually a valid address... ahh you just beat me too it. :)

behlendorf · 2012-09-10T22:39:27Z

Duplicate issues are #931, #789, #743, #730 . These are all NULL derferences in __blkdev_get() with likely the same root cause. Fixing this one would let us close out five issues which would be nice!

cwedgwood · 2012-09-10T23:34:00Z

@behlendorf wrt to ERR_PTR theory:

&((struct gendisk *) 0ull)->fops = 0x330

the faulting address is 0x2b6, so -116 ERR_PTR would fit that

include/asm-generic/errno.h:#define ESTALE 116 /* Stale NFS file handle */

but i don't think i have nfs involved...

behlendorf · 2012-09-10T23:45:19Z

@cwedgwood Then the thing to do probably is to add an IS_ERR() check so it doesn't NULL deref and then reproduce the issue under system tap to get a full trace.


diff --git a/fs/block_dev.c b/fs/block_dev.c
index c2bbe1f..fe4fe0c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1132,7 +1132,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)

        ret = -ENXIO;
        disk = get_gendisk(bdev->bd_dev, &partno);
-       if (!disk)
+       if (IS_ERR(disk))
                goto out;
        owner = disk->fops->owner;

behlendorf · 2012-09-11T00:05:23Z

get_gendisk-> kobj_lookup()->probe() looks like the most likely candidate. If an ERR_PTR is returned instead of simply NULL on failure we could see this.

cwedgwood · 2012-09-11T01:10:27Z

so, i put debugging all over the place ... and i can't trigger it anymore

i'll leave it in for now on the hope it comes back

behlendorf · 2012-09-11T01:49:22Z

What environment are you running this in? We should at least inspect the code for the probe() function, however that depends on what block driver your using.

cwedgwood · 2012-09-11T02:54:35Z

@behlendorf dual westmere xeon w/ 5 sata disks connected via mpt2sas, ssd for zil, another ssd for l2arc

behlendorf · 2012-09-11T03:45:01Z

It appears this would be absolutely possible for a ramdisk block device, see brd_probe() -ENOMEM error case. But it's not clear to me exactly how this would result with the generic exact_match probe handler for add_disk() which is used by sd.c.

cwedgwood · 2012-09-17T18:11:31Z

after recreating the env with copious debugging this race has become hard to hit again

if someone has a way to trigger this frequently i would like to know details

cwedgwood · 2012-09-24T08:15:56Z

@behlendorf doing zfs destroy ... to remove a slew of zvol's i managed to hit this or something similar:

context:

diff --git a/drivers/base/map.c b/drivers/base/map.c
index e87017f..675c972 100644
--- a/drivers/base/map.c
+++ b/drivers/base/map.c
@@ -123,6 +123,10 @@ retry:
        }
        mutex_unlock(domain->lock);
        kobj = probe(dev, index, data);
+               if (IS_ERR(kobj)) {
+                       printk(KERN_WARNING "probe returned error %ld\n", PTR_ERR(kobj));
+                       WARN_ON(1);
+               }
        /* Currently ->owner protects _only_ ->probe() itself. */
        module_put(owner);
        if (kobj)

and i see:

[  964.300016] probe returned error -2
[  964.303715] ------------[ cut here ]------------
[  964.308570] WARNING: at /home/cw/wk/linux/linux/drivers/base/map.c:128 kobj_lookup+0x11f/0x161()
[  964.317909] Hardware name: X8DTL
[  964.321439] Modules linked in: loop target_core_mod configfs sg zfs(O) coretemp kvm_intel zunicode(O) zavl(O) zcommon(O) znvpair(O) kvm spl(O) i2c_i801 i2c_core i7core_edac ghash_clmulni_intel microcode lpc_ich mfd_core mpt2sas
[  964.343379] Pid: 6292, comm: blkid Tainted: G           O 3.5.4.cw1 #18
[  964.350444] Call Trace:
[  964.353024]  [<ffffffff8103ab29>] warn_slowpath_common+0x83/0x9c
[  964.359433]  [<ffffffffa0b02c90>] ? zvol_find_by_name+0x42/0x42 [zfs]
[  964.366354]  [<ffffffff8103ab5c>] warn_slowpath_null+0x1a/0x1c
[  964.372583]  [<ffffffff812e9f85>] kobj_lookup+0x11f/0x161
[  964.378356]  [<ffffffff81243781>] get_gendisk+0x32/0xaf
[  964.384011]  [<ffffffff8112e358>] __blkdev_get+0x69/0x40f
[  964.389719]  [<ffffffff8110c7ae>] ? lookup_fast+0x128/0x23d
[  964.395746]  [<ffffffff8112e9cc>] ? blkdev_get+0x2ce/0x2ce
[  964.401684]  [<ffffffff8112e887>] blkdev_get+0x189/0x2ce
[  964.407310]  [<ffffffff8111507e>] ? __d_lookup_rcu+0xa2/0xc9
[  964.413443]  [<ffffffff8112e9cc>] ? blkdev_get+0x2ce/0x2ce
[  964.419277]  [<ffffffff8112ea30>] blkdev_open+0x64/0x70
[  964.424971]  [<ffffffff81102520>] do_dentry_open.isra.17+0x16e/0x21d
[  964.431845]  [<ffffffff81103081>] nameidata_to_filp+0x42/0x84
[  964.437926]  [<ffffffff8110ef08>] do_last.isra.47+0x625/0x64b
[  964.444097]  [<ffffffff8110f026>] path_openat+0xc5/0x2f4
[  964.449677]  [<ffffffff8110f35c>] do_filp_open+0x38/0x86
[  964.455396]  [<ffffffff8110c34b>] ? getname_flags+0x2a/0xa2
[  964.461401]  [<ffffffff81118e9c>] ? alloc_fd+0xe5/0xf7
[  964.466871]  [<ffffffff81103131>] do_sys_open+0x6e/0x102
[  964.472675]  [<ffffffff8110f2e1>] ? user_path_at_empty+0x6c/0x9c
[  964.479046]  [<ffffffff811031e6>] sys_open+0x21/0x23
[  964.484403]  [<ffffffff814f3269>] system_call_fastpath+0x16/0x1b
[  964.490860] ---[ end trace bd3516012c40deee ]---

then:

[  964.495704] BUG: unable to handle kernel NULL pointer dereference at 00000000000002b6
[  964.504083] IP: [<ffffffff8112e364>] __blkdev_get+0x75/0x40f
[  964.510089] PGD 32df8b067 PUD 32df8a067 PMD 0 
[  964.514851] Oops: 0000 [#1] SMP 
[  964.518337] CPU 0 
[  964.520302] Modules linked in: loop target_core_mod configfs sg zfs(O) coretemp kvm_intel zunicode(O) zavl(O) zcommon(O) znvpair(O) kvm spl(O) i2c_i801 i2c_core i7core_edac ghash_clmulni_intel microcode lpc_ich mfd_core mpt2sas
[  964.543682] 
[  964.543900] Pid: 6292, comm: blkid Tainted: G        W  O 3.5.4.cw1 #18 Supermicro X8DTL/X8DTL
[  964.553064] RIP: 0010:[<ffffffff8112e364>]  [<ffffffff8112e364>] __blkdev_get+0x75/0x40f
[  964.561659] RSP: 0018:ffff88062614bb98  EFLAGS: 00010282
[  964.567281] RAX: ffffffffffffff86 RBX: ffff88062b86ad00 RCX: 0000000000000000
[  964.574758] RDX: 0000000000000081 RSI: ffff88062614bbd4 RDI: ffff88032dac2001
[  964.582282] RBP: ffff88062614bbf8 R08: 0000000000000000 R09: ffffffff81548d40
[  964.589795] R10: ffffffff81548d40 R11: 000000000000059a R12: ffffffffffffff86
[  964.597294] R13: 000000000000001d R14: 0000000000000000 R15: 0000000000000000
[  964.604832] FS:  00007fc496228740(0000) GS:ffff880333c00000(0000) knlGS:0000000000000000
[  964.613353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  964.619422] CR2: 00000000000002b6 CR3: 000000032df88000 CR4: 00000000000007f0
[  964.626966] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  964.634538] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  964.642129] Process blkid (pid: 6292, threadinfo ffff88062614a000, task ffff88062879ad40)
:

I had debugging for the later as well but clearly not enough.

As mentioned above 00000000000002b6 is smaller than the struct size, so I think this is ERR_PTR + struct offset we're seeing

cwedgwood · 2012-09-24T08:22:54Z

btw, -2 -> ENOENT

cwedgwood · 2012-09-24T08:31:11Z

so, in this case it is zvol related, looking at zvol.c we have:

zvol_probe(dev_t dev, int *part, void *arg)
{
    zvol_state_t *zv;
    struct kobject *kobj;

    mutex_enter(&zvol_state_lock);
    zv = zvol_find_by_dev(dev);
    kobj = zv ? get_disk(zv->zv_disk) : ERR_PTR(-ENOENT);
    mutex_exit(&zvol_state_lock);

    return kobj;
}

now, why zvol_find_by_dev fails i'm not sure (in this it might be that destroyed zvol's didn't do away (i've seen that) so something accessing things tripped up this)

pretty sure we can't return ERR_PTR(...) from here though as the rest of the code doesn't like it

behlendorf · 2012-09-24T12:37:53Z

Indeed, this does look like an issue in zvol_probe(). We should be returning NULL here instead of an error pointer.

cwedgwood · 2012-09-24T17:05:46Z

@behlendorf so i made this fix and added some other debugging and of course the race has gone away for now...

i'll leave it in and let you know how it goes

cwedgwood · 2012-09-24T17:18:12Z

update: i can now reproduce the race and with a small fix it does indeed prevent the export/import crash

behlendorf · 2012-09-24T17:23:43Z

That's excellent, can you post the patch so we can get it reviewed and merged. This is one of the issues I absolutely wanted fixed in rc12.

cwedgwood · 2012-09-24T17:41:56Z

@behlendorf let me know how that looks, i can quickly ammend and re-push

behlendorf mentioned this issue Sep 16, 2012

'zpool import' of a destroyed pool causes oops #931

Closed

This was referenced Sep 17, 2012

zconfig and zfault failed, debian wheezy, kernel 3.2 #743

Closed

zconfig fails ans causes kernel oops with kernel 3.3.5: #730

Closed

cwedgwood mentioned this issue Sep 24, 2012

zvol_probe should return NULL when the device isn't found. #987

Closed

behlendorf closed this as completed in 23a61cc Oct 3, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NULL dereference in __blkdev_get() #949

NULL dereference in __blkdev_get() #949

cwedgwood commented Sep 9, 2012

behlendorf commented Sep 10, 2012

cwedgwood commented Sep 10, 2012

cwedgwood commented Sep 10, 2012

behlendorf commented Sep 10, 2012

behlendorf commented Sep 10, 2012

cwedgwood commented Sep 10, 2012

behlendorf commented Sep 10, 2012

behlendorf commented Sep 11, 2012

cwedgwood commented Sep 11, 2012

behlendorf commented Sep 11, 2012

cwedgwood commented Sep 11, 2012

behlendorf commented Sep 11, 2012

cwedgwood commented Sep 17, 2012

cwedgwood commented Sep 24, 2012

cwedgwood commented Sep 24, 2012

cwedgwood commented Sep 24, 2012

behlendorf commented Sep 24, 2012

cwedgwood commented Sep 24, 2012

cwedgwood commented Sep 24, 2012

behlendorf commented Sep 24, 2012

cwedgwood commented Sep 24, 2012

NULL dereference in __blkdev_get() #949

NULL dereference in __blkdev_get() #949

Comments

cwedgwood commented Sep 9, 2012

behlendorf commented Sep 10, 2012

cwedgwood commented Sep 10, 2012

cwedgwood commented Sep 10, 2012

behlendorf commented Sep 10, 2012

behlendorf commented Sep 10, 2012

cwedgwood commented Sep 10, 2012

behlendorf commented Sep 10, 2012

behlendorf commented Sep 11, 2012

cwedgwood commented Sep 11, 2012

behlendorf commented Sep 11, 2012

cwedgwood commented Sep 11, 2012

behlendorf commented Sep 11, 2012

cwedgwood commented Sep 17, 2012

cwedgwood commented Sep 24, 2012

cwedgwood commented Sep 24, 2012

cwedgwood commented Sep 24, 2012

behlendorf commented Sep 24, 2012

cwedgwood commented Sep 24, 2012

cwedgwood commented Sep 24, 2012

behlendorf commented Sep 24, 2012

cwedgwood commented Sep 24, 2012