Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NULL dereference in __blkdev_get() #949

Closed
cwedgwood opened this issue Sep 9, 2012 · 21 comments
Closed

NULL dereference in __blkdev_get() #949

cwedgwood opened this issue Sep 9, 2012 · 21 comments
Milestone

Comments

@cwedgwood
Copy link
Contributor

# zpool export tank0

wait many seconds

# zpool import
Killed

dmesg:

[  236.605298] BUG: unable to handle kernel NULL pointer dereference at 00000000000002b6
[  236.613715] IP: [<ffffffff8112e308>] __blkdev_get+0x75/0x3da
[  236.619795] PGD 624225067 PUD 625d32067 PMD 0
[  236.624771] Oops: 0000 [#1] SMP
[  236.628347] CPU 2
[  236.630339] Modules linked in: loop sg zfs(O) zunicode(O) zavl(O) zcommon(O) znvpair(O) spl(O) uas coretemp kvm_intel kvm i2c_i801 i2c_core ghash_clmulni_intel mpt2sas lpc_ich mfd_core i7core_edac microcod$
[  236.652669]
[  236.652923] Pid: 4949, comm: zpool Tainted: G           O 3.5.3-cw4-00001-g5138c87 #8 Supermicro X8DTL/X8DTL
[  236.663547] RIP: 0010:[<ffffffff8112e308>]  [<ffffffff8112e308>] __blkdev_get+0x75/0x3da
[  236.672207] RSP: 0018:ffff88032651bb98  EFLAGS: 00010282
[  236.677909] RAX: ffffffffffffff86 RBX: ffff88032ec29800 RCX: 000000000e6fffff
[  236.685502] RDX: 0000000000000000 RSI: 00000000000280da RDI: ffff88032651bbc8
[  236.693116] RBP: ffff88032651bbf8 R08: ffff880631c1d100 R09: 00000031703064ab
[  236.700652] R10: 00000031703064ab R11: 00000031703064ab R12: ffffffffffffff86
[  236.708250] R13: 000000000000001d R14: 0000000000000000 R15: 0000000000000000
[  236.715851] FS:  00007f170a166b40(0000) GS:ffff880333c40000(0000) knlGS:0000000000000000
[  236.724475] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  236.730531] CR2: 00000000000002b6 CR3: 0000000623027000 CR4: 00000000000007e0
[  236.738073] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  236.745713] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  236.753328] Process zpool (pid: 4949, threadinfo ffff88032651a000, task ffff88032de043e0)
[  236.762050] Stack:
[  236.764207]  ffff880333811000 ffff88032651bc54 ffffc90002ee1e40 ffff880333811000
[  236.772318]  ffff88032651bbc8 00000001814ec8ec ffff88032651bbe8 ffff88032ec29800
[  236.780450]  0000000000000000 ffff880625d25f00 ffffffff8112e93b ffff88033368a540
[  236.788568] Call Trace:
[  236.791189]  [<ffffffff8112e93b>] ? blkdev_get+0x2ce/0x2ce
[  236.797082]  [<ffffffff8112e7f6>] blkdev_get+0x189/0x2ce
[  236.802798]  [<ffffffff81115ec0>] ? spin_unlock+0x9/0xb
[  236.808416]  [<ffffffff81116227>] ? unlock_new_inode+0x59/0x5e
[  236.814595]  [<ffffffff8112d901>] ? bdget+0x11d/0x12c
[  236.820001]  [<ffffffff8112e93b>] ? blkdev_get+0x2ce/0x2ce
[  236.825922]  [<ffffffff8112e99f>] blkdev_open+0x64/0x70
[  236.831487]  [<ffffffff811024e8>] do_dentry_open.isra.17+0x16e/0x21d
[  236.838210]  [<ffffffff81103049>] nameidata_to_filp+0x42/0x84
[  236.844362]  [<ffffffff8110eed4>] do_last.isra.47+0x625/0x64b
[  236.850472]  [<ffffffff8110eff2>] path_openat+0xc5/0x2f4
[  236.856174]  [<ffffffff8110f328>] do_filp_open+0x38/0x86
[  236.861865]  [<ffffffff8110c317>] ? getname_flags+0x2a/0xa2
[  236.867779]  [<ffffffff81118e6c>] ? alloc_fd+0xe5/0xf7
[  236.873306]  [<ffffffff811030f9>] do_sys_open+0x6e/0x102
[  236.879009]  [<ffffffff8110f2ad>] ? user_path_at_empty+0x6c/0x9c
[  236.885379]  [<ffffffff811031c4>] sys_openat+0x14/0x16
[  236.890855]  [<ffffffff814f31e9>] system_call_fastpath+0x16/0x1b
[  236.897195] Code: 20 75 11 e8 82 31 0f 00 85 c0 41 89 c6 74 05 e9 62 03 00 00 8b 3b 48 8d 75 cc e8 73 53 11 00 48 85 c0 49 89 c4 0f 84 45 03 00 00 <48> 8b 80 30 03 00 00 4c 89 e7 48 8b 40 58 48 89 45 b0 e8$
[  236.921243] RIP  [<ffffffff8112e308>] __blkdev_get+0x75/0x3da
[  236.927410]  RSP <ffff88032651bb98>
[  236.931158] CR2: 00000000000002b6
[  236.934824] ---[ end trace 6331f0ace31d011f ]---

I thought perhaps there was a race/qurk with zvols, but this happens even in cases where there are none.

@behlendorf
Copy link
Contributor

I've seen this reported once or twice as well. I also thought it was probably associated with zvols but you seem to have ruled that out.

The really interesting thing here is that we're just calling the user space sys_open() on a normal block device. Nothing in the ZFS kernel stack is involved here. I suppose it's possible we're racing with a concurrent operation on the bdev and things aren't locked properly, but most of that is handled by the kernel.

If you have you vmlinux kernel image still around can you resolve the offending line the NULL deref occurred on. That would help narrow down exactly what's getting damaged.

gdb /boot/vmlinux
list *(__blkdev_get+0x75)

@cwedgwood
Copy link
Contributor Author

I rebuilt w/ CONFIG_DEBUG_INFO=y but I don't think that would have caused any insns to move about.

Assuming gcc didn't shuffle things about much:

1131     restart:
1132
1133            ret = -ENXIO;
1134            disk = get_gendisk(bdev->bd_dev, &partno);
1135            if (!disk)
1136                    goto out;
1137            owner = disk->fops->owner;
1138
1139            disk_block_events(disk);
1140            mutex_lock_nested(&bdev->bd_mutex, for_part);

Line 1137

@cwedgwood
Copy link
Contributor Author

No, it's 1137, __blkdev_get + 0x75 == 0xffffffff8112e308

        owner = disk->fops->owner;

The insb before is from: if (!disk) goto out;

0xffffffff8112e2f7 <+100>:   callq  0xffffffff8124366f <get_gendisk>
0xffffffff8112e2fc <+105>:   test   %rax,%rax
0xffffffff8112e2ff <+108>:   mov    %rax,%r12
0xffffffff8112e302 <+111>:   je     0xffffffff8112e64d <__blkdev_get+954>

blows up:

0xffffffff8112e308 <+117>:   mov    0x330(%rax),%rax
0xffffffff8112e30f <+124>:   mov    %r12,%rdi
0xffffffff8112e312 <+127>:   mov    0x58(%rax),%rax
0xffffffff8112e316 <+131>:   mov    %rax,-0x50(%rbp)

disk not null
disk->fops is null

@behlendorf
Copy link
Contributor

Based on the previous code which must have executed to get to this point, and the value of the NULL deref I'm inclined to think bdev is actually a valid address... ahh you just beat me too it. :)

@behlendorf
Copy link
Contributor

Duplicate issues are #931, #789, #743, #730 . These are all NULL derferences in __blkdev_get() with likely the same root cause. Fixing this one would let us close out five issues which would be nice!

@cwedgwood
Copy link
Contributor Author

@behlendorf wrt to ERR_PTR theory:

&((struct gendisk *) 0ull)->fops = 0x330

the faulting address is 0x2b6, so -116 ERR_PTR would fit that

include/asm-generic/errno.h:#define ESTALE 116 /* Stale NFS file handle */

but i don't think i have nfs involved...

@behlendorf
Copy link
Contributor

@cwedgwood Then the thing to do probably is to add an IS_ERR() check so it doesn't NULL deref and then reproduce the issue under system tap to get a full trace.


diff --git a/fs/block_dev.c b/fs/block_dev.c
index c2bbe1f..fe4fe0c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1132,7 +1132,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)

        ret = -ENXIO;
        disk = get_gendisk(bdev->bd_dev, &partno);
-       if (!disk)
+       if (IS_ERR(disk))
                goto out;
        owner = disk->fops->owner;

@behlendorf
Copy link
Contributor

get_gendisk-> kobj_lookup()->probe() looks like the most likely candidate. If an ERR_PTR is returned instead of simply NULL on failure we could see this.

@cwedgwood
Copy link
Contributor Author

so, i put debugging all over the place ... and i can't trigger it anymore

i'll leave it in for now on the hope it comes back

@behlendorf
Copy link
Contributor

What environment are you running this in? We should at least inspect the code for the probe() function, however that depends on what block driver your using.

@cwedgwood
Copy link
Contributor Author

@behlendorf dual westmere xeon w/ 5 sata disks connected via mpt2sas, ssd for zil, another ssd for l2arc

@behlendorf
Copy link
Contributor

It appears this would be absolutely possible for a ramdisk block device, see brd_probe() -ENOMEM error case. But it's not clear to me exactly how this would result with the generic exact_match probe handler for add_disk() which is used by sd.c.

@cwedgwood
Copy link
Contributor Author

after recreating the env with copious debugging this race has become hard to hit again

if someone has a way to trigger this frequently i would like to know details

@cwedgwood
Copy link
Contributor Author

@behlendorf doing zfs destroy ... to remove a slew of zvol's i managed to hit this or something similar:

context:

diff --git a/drivers/base/map.c b/drivers/base/map.c
index e87017f..675c972 100644
--- a/drivers/base/map.c
+++ b/drivers/base/map.c
@@ -123,6 +123,10 @@ retry:
        }
        mutex_unlock(domain->lock);
        kobj = probe(dev, index, data);
+               if (IS_ERR(kobj)) {
+                       printk(KERN_WARNING "probe returned error %ld\n", PTR_ERR(kobj));
+                       WARN_ON(1);
+               }
        /* Currently ->owner protects _only_ ->probe() itself. */
        module_put(owner);
        if (kobj)

and i see:

[  964.300016] probe returned error -2
[  964.303715] ------------[ cut here ]------------
[  964.308570] WARNING: at /home/cw/wk/linux/linux/drivers/base/map.c:128 kobj_lookup+0x11f/0x161()
[  964.317909] Hardware name: X8DTL
[  964.321439] Modules linked in: loop target_core_mod configfs sg zfs(O) coretemp kvm_intel zunicode(O) zavl(O) zcommon(O) znvpair(O) kvm spl(O) i2c_i801 i2c_core i7core_edac ghash_clmulni_intel microcode lpc_ich mfd_core mpt2sas
[  964.343379] Pid: 6292, comm: blkid Tainted: G           O 3.5.4.cw1 #18
[  964.350444] Call Trace:
[  964.353024]  [<ffffffff8103ab29>] warn_slowpath_common+0x83/0x9c
[  964.359433]  [<ffffffffa0b02c90>] ? zvol_find_by_name+0x42/0x42 [zfs]
[  964.366354]  [<ffffffff8103ab5c>] warn_slowpath_null+0x1a/0x1c
[  964.372583]  [<ffffffff812e9f85>] kobj_lookup+0x11f/0x161
[  964.378356]  [<ffffffff81243781>] get_gendisk+0x32/0xaf
[  964.384011]  [<ffffffff8112e358>] __blkdev_get+0x69/0x40f
[  964.389719]  [<ffffffff8110c7ae>] ? lookup_fast+0x128/0x23d
[  964.395746]  [<ffffffff8112e9cc>] ? blkdev_get+0x2ce/0x2ce
[  964.401684]  [<ffffffff8112e887>] blkdev_get+0x189/0x2ce
[  964.407310]  [<ffffffff8111507e>] ? __d_lookup_rcu+0xa2/0xc9
[  964.413443]  [<ffffffff8112e9cc>] ? blkdev_get+0x2ce/0x2ce
[  964.419277]  [<ffffffff8112ea30>] blkdev_open+0x64/0x70
[  964.424971]  [<ffffffff81102520>] do_dentry_open.isra.17+0x16e/0x21d
[  964.431845]  [<ffffffff81103081>] nameidata_to_filp+0x42/0x84
[  964.437926]  [<ffffffff8110ef08>] do_last.isra.47+0x625/0x64b
[  964.444097]  [<ffffffff8110f026>] path_openat+0xc5/0x2f4
[  964.449677]  [<ffffffff8110f35c>] do_filp_open+0x38/0x86
[  964.455396]  [<ffffffff8110c34b>] ? getname_flags+0x2a/0xa2
[  964.461401]  [<ffffffff81118e9c>] ? alloc_fd+0xe5/0xf7
[  964.466871]  [<ffffffff81103131>] do_sys_open+0x6e/0x102
[  964.472675]  [<ffffffff8110f2e1>] ? user_path_at_empty+0x6c/0x9c
[  964.479046]  [<ffffffff811031e6>] sys_open+0x21/0x23
[  964.484403]  [<ffffffff814f3269>] system_call_fastpath+0x16/0x1b
[  964.490860] ---[ end trace bd3516012c40deee ]---

then:

[  964.495704] BUG: unable to handle kernel NULL pointer dereference at 00000000000002b6
[  964.504083] IP: [<ffffffff8112e364>] __blkdev_get+0x75/0x40f
[  964.510089] PGD 32df8b067 PUD 32df8a067 PMD 0 
[  964.514851] Oops: 0000 [#1] SMP 
[  964.518337] CPU 0 
[  964.520302] Modules linked in: loop target_core_mod configfs sg zfs(O) coretemp kvm_intel zunicode(O) zavl(O) zcommon(O) znvpair(O) kvm spl(O) i2c_i801 i2c_core i7core_edac ghash_clmulni_intel microcode lpc_ich mfd_core mpt2sas
[  964.543682] 
[  964.543900] Pid: 6292, comm: blkid Tainted: G        W  O 3.5.4.cw1 #18 Supermicro X8DTL/X8DTL
[  964.553064] RIP: 0010:[<ffffffff8112e364>]  [<ffffffff8112e364>] __blkdev_get+0x75/0x40f
[  964.561659] RSP: 0018:ffff88062614bb98  EFLAGS: 00010282
[  964.567281] RAX: ffffffffffffff86 RBX: ffff88062b86ad00 RCX: 0000000000000000
[  964.574758] RDX: 0000000000000081 RSI: ffff88062614bbd4 RDI: ffff88032dac2001
[  964.582282] RBP: ffff88062614bbf8 R08: 0000000000000000 R09: ffffffff81548d40
[  964.589795] R10: ffffffff81548d40 R11: 000000000000059a R12: ffffffffffffff86
[  964.597294] R13: 000000000000001d R14: 0000000000000000 R15: 0000000000000000
[  964.604832] FS:  00007fc496228740(0000) GS:ffff880333c00000(0000) knlGS:0000000000000000
[  964.613353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  964.619422] CR2: 00000000000002b6 CR3: 000000032df88000 CR4: 00000000000007f0
[  964.626966] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  964.634538] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  964.642129] Process blkid (pid: 6292, threadinfo ffff88062614a000, task ffff88062879ad40)
:

I had debugging for the later as well but clearly not enough.

As mentioned above 00000000000002b6 is smaller than the struct size, so I think this is ERR_PTR + struct offset we're seeing

@cwedgwood
Copy link
Contributor Author

btw, -2 -> ENOENT

@cwedgwood
Copy link
Contributor Author

so, in this case it is zvol related, looking at zvol.c we have:

zvol_probe(dev_t dev, int *part, void *arg)
{
    zvol_state_t *zv;
    struct kobject *kobj;

    mutex_enter(&zvol_state_lock);
    zv = zvol_find_by_dev(dev);
    kobj = zv ? get_disk(zv->zv_disk) : ERR_PTR(-ENOENT);
    mutex_exit(&zvol_state_lock);

    return kobj;
}

now, why zvol_find_by_dev fails i'm not sure (in this it might be that destroyed zvol's didn't do away (i've seen that) so something accessing things tripped up this)

pretty sure we can't return ERR_PTR(...) from here though as the rest of the code doesn't like it

@behlendorf
Copy link
Contributor

Indeed, this does look like an issue in zvol_probe(). We should be returning NULL here instead of an error pointer.

@cwedgwood
Copy link
Contributor Author

@behlendorf so i made this fix and added some other debugging and of course the race has gone away for now...

i'll leave it in and let you know how it goes

@cwedgwood
Copy link
Contributor Author

update: i can now reproduce the race and with a small fix it does indeed prevent the export/import crash

@behlendorf
Copy link
Contributor

That's excellent, can you post the patch so we can get it reviewed and merged. This is one of the issues I absolutely wanted fixed in rc12.

@cwedgwood
Copy link
Contributor Author

@behlendorf let me know how that looks, i can quickly ammend and re-push

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants