Skip to content
This repository has been archived by the owner on Feb 26, 2020. It is now read-only.

SPL PANIC when creating a pool on top of a Ceph RBD #241

Closed
tdb opened this issue May 24, 2013 · 59 comments
Closed

SPL PANIC when creating a pool on top of a Ceph RBD #241

tdb opened this issue May 24, 2013 · 59 comments
Labels
Milestone

Comments

@tdb
Copy link

tdb commented May 24, 2013

I'm trying to create a pool on top of a Ceph RBD. My setup is:

  • Running on a VMware VM
  • Ubuntu precise using linux-generic-lts-raring kernel (3.8.0.22.21)
  • ZFS/SPL 0.6.1 from ppa:zfs-native/stable
  • Ceph 0.61.2 from ceph.com repository

I can create a pool on top of a local disk without any problems. But when I put it on top of a Ceph RBD (block device) I get the following error:

# rbd ls -l
NAME     SIZE PARENT FMT PROT LOCK
cephzfs 1024G          1
# rbd map cephzfs --pool rbd --name client.admin
# ls -la /dev/rbd/rbd/cephzfs /dev/rbd1
brw-rw---- 1 root disk 251, 0 May 24 16:04 /dev/rbd1
lrwxrwxrwx 1 root root     10 May 24 16:04 /dev/rbd/rbd/cephzfs -> ../../rbd1
# zpool create pool1 /dev/rbd/rbd/cephzfs
cannot open 'pool1': dataset does not exist

And this panic:

[10582.132665] VERIFY(shpp->sh_eof == shpp->sh_pool_create_len) failed
[10582.132816] SPLError: 1746:0:(spa_history.c:276:spa_history_log_sync()) SPL PANIC
[10582.132958] SPL: Showing stack for process 1746
[10582.132962] Pid: 1746, comm: txg_sync Tainted: PF          O 3.8.0-22-generic #33~precise1-Ubuntu
[10582.132963] Call Trace:
[10582.132999]  [] spl_debug_dumpstack+0x27/0x40 [spl]
[10582.133006]  [] spl_debug_bug+0x82/0xe0 [spl]
[10582.133045]  [] spa_history_log_sync+0x428/0x650 [zfs]
[10582.133077]  [] dsl_sync_task_group_sync+0x123/0x210 [zfs]
[10582.133107]  [] dsl_pool_sync+0x41b/0x530 [zfs]
[10582.133140]  [] spa_sync+0x3a8/0xa50 [zfs]
[10582.133160]  [] ? ktime_get_ts+0x4c/0xe0
[10582.133195]  [] txg_sync_thread+0x2df/0x540 [zfs]
[10582.133229]  [] ? txg_init+0x250/0x250 [zfs]
[10582.133238]  [] thread_generic_wrapper+0x78/0x90 [spl]
[10582.133246]  [] ? __thread_create+0x310/0x310 [spl]
[10582.133255]  [] kthread+0xc0/0xd0
[10582.133259]  [] ? flush_kthread_worker+0xb0/0xb0
[10582.133272]  [] ret_from_fork+0x7c/0xb0
[10582.133275]  [] ? flush_kthread_worker+0xb0/0xb0

And then the following repeats after that until I reboot:

[10779.414291] INFO: task txg_sync:1746 blocked for more than 120 seconds.
[10779.414442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10779.414588] txg_sync        D ffff88003737b460     0  1746      2 0x00000000
[10779.414596]  ffff88003c517ad8 0000000000000046 0000000000000013 ffff88003fc13f40
[10779.414601]  ffff88003c517fd8 ffff88003c517fd8 ffff88003c517fd8 0000000000013f40
[10779.414604]  ffff88003b2d9740 ffff88003b08c5c0 ffffffff81c15347 0000000000000000
[10779.414607] Call Trace:
[10779.414624]  [] schedule+0x29/0x70
[10779.414652]  [] spl_debug_bug+0xb5/0xe0 [spl]
[10779.414716]  [] spa_history_log_sync+0x428/0x650 [zfs]
[10779.414751]  [] dsl_sync_task_group_sync+0x123/0x210 [zfs]
[10779.414785]  [] dsl_pool_sync+0x41b/0x530 [zfs]
[10779.414818]  [] spa_sync+0x3a8/0xa50 [zfs]
[10779.414825]  [] ? ktime_get_ts+0x4c/0xe0
[10779.414863]  [] txg_sync_thread+0x2df/0x540 [zfs]
[10779.414897]  [] ? txg_init+0x250/0x250 [zfs]
[10779.414906]  [] thread_generic_wrapper+0x78/0x90 [spl]
[10779.414914]  [] ? __thread_create+0x310/0x310 [spl]
[10779.414919]  [] kthread+0xc0/0xd0
[10779.414922]  [] ? flush_kthread_worker+0xb0/0xb0
[10779.414926]  [] ret_from_fork+0x7c/0xb0
[10779.414929]  [] ? flush_kthread_worker+0xb0/0xb0
[10899.176620] INFO: task txg_sync:1746 blocked for more than 120 seconds.
[10899.176758] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10899.176902] txg_sync        D ffff88003737b460     0  1746      2 0x00000000
[10899.176906]  ffff88003c517ad8 0000000000000046 0000000000000013 ffff88003fc13f40
[10899.176910]  ffff88003c517fd8 ffff88003c517fd8 ffff88003c517fd8 0000000000013f40
[10899.176913]  ffff88003b2d9740 ffff88003b08c5c0 ffffffff81c15347 0000000000000000
[10899.176917] Call Trace:
[10899.176926]  [] schedule+0x29/0x70
[10899.176958]  [] spl_debug_bug+0xb5/0xe0 [spl]
[10899.176998]  [] spa_history_log_sync+0x428/0x650 [zfs]
[10899.177030]  [] dsl_sync_task_group_sync+0x123/0x210 [zfs]
[10899.177059]  [] dsl_pool_sync+0x41b/0x530 [zfs]
[10899.177092]  [] spa_sync+0x3a8/0xa50 [zfs]
[10899.177097]  [] ? ktime_get_ts+0x4c/0xe0
[10899.177132]  [] txg_sync_thread+0x2df/0x540 [zfs]
[10899.177166]  [] ? txg_init+0x250/0x250 [zfs]
[10899.177178]  [] thread_generic_wrapper+0x78/0x90 [spl]
[10899.177186]  [] ? __thread_create+0x310/0x310 [spl]
[10899.177191]  [] kthread+0xc0/0xd0
[10899.177194]  [] ? flush_kthread_worker+0xb0/0xb0
[10899.177198]  [] ret_from_fork+0x7c/0xb0
[10899.177202]  [] ? flush_kthread_worker+0xb0/0xb0

I'm happy to provide any further information required or do testing as needed.

Thank you.
Tim.

@hvenzke
Copy link

hvenzke commented May 28, 2013

use the real physical name /dev/rbd1
no symlinks with zfs !!

@tdb
Copy link
Author

tdb commented May 28, 2013

It makes no difference I'm afraid. The panic is identical.

@hvenzke
Copy link

hvenzke commented May 28, 2013

Well , then the bug is at Ceph RBD ´s logic basicly as that provide the storange .

ZFS on linux is known to work with native drbd fine.

Ceph RBD snapshoot featgers are overkill as ZFS does that itsself.

Can you try make an gfs cluster or lustre fs on it ?

@tdb
Copy link
Author

tdb commented May 29, 2013

Ceph RBD works fine with other file systems for me, and ZFS works fine with other underlying storage. So it's hard to be precise about where the problem lies. In any case, ZFS shouldn't panic, surely? That's a bug.

Ceph provides a distributed file system which is why I want to use it. ZFS also has some great features for managing multiple file systems within a single pool including snapshots.

@behlendorf
Copy link
Contributor

@tdb You're hitting a VERIFY in the code while attempting to sync out the history buffer to disk. For some reason the buffer lengths aren't being correctly updated. Since this only happens on top of a ceph rbd I suspect their block device is behaving slightly differently that the rest of the Linux block drivers. For the purposes of a test you could try commenting out the VERIFY like this, although I my suspicion is you'll likely hit another issue quickly. However, that failure may shed some more light on exactly what's going wrong.

diff --git a/module/zfs/spa_history.c b/module/zfs/spa_history.c
index 9fb75f3..2d45266 100644
--- a/module/zfs/spa_history.c
+++ b/module/zfs/spa_history.c
@@ -272,8 +272,8 @@ spa_history_log_sync(void *arg1, void *arg2, dmu_tx_t *tx)
            NV_ENCODE_XDR, KM_PUSHPAGE) == 0);

        mutex_enter(&spa->spa_history_lock);
-       if (hap->ha_log_type == LOG_CMD_POOL_CREATE)
-               VERIFY(shpp->sh_eof == shpp->sh_pool_create_len);
+//     if (hap->ha_log_type == LOG_CMD_POOL_CREATE)
+//             VERIFY(shpp->sh_eof == shpp->sh_pool_create_len);

        /* write out the packed length as little endian */
        le_len = LE_64((uint64_t)reclen);

Related to this most people usually think about putting ceph on top over zfs not vise-versa. This behavior was recently fixed in master so you might try that. It won't get you features like distributed snapshots but it will bring many of zfs's other benefits.

@tdb
Copy link
Author

tdb commented Jun 7, 2013

@behlendorf Thanks for the reply. I made the change suggested (against 0.6.1) and saw the following:

# zpool create pool1 /dev/rbd1
cannot open 'pool1': dataset does not exist

So that's the same as before. Checking zpool status afterwards showed a good pool, but zfs status didn't show any filesystems. No panic though.

Then I tried to repeat it. This time I got a panic after creating the pool, and zpool status hung. The panic was:

[  183.924160] divide error: 0000 [#1] SMP
[  183.924349] Modules linked in: coretemp(F) microcode(F) psmouse(F) ppdev(F) vmw_balloon(F) serio_raw(F) i2c_piix4(F) vmwgfx(F) mac_hid(F) ttm(F) shpchp(F) drm(F) parport_pc(F) rbd(F) libceph(F) lp(F) parport(F) zfs(POF) zcommon(POF) znvpair(POF) zavl(POF) zunicode(POF) spl(OF) floppy(F) e1000(F) mptspi(F) mptscsih(F) mptbase(F) btrfs(F) zlib_deflate(F) libcrc32c(F)
[  183.926033] CPU 0
[  183.926100] Pid: 2019, comm: txg_sync Tainted: PF          O 3.8.0-23-generic #34~precise1-Ubuntu VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
[  183.926385] RIP: 0010:[]  [] spa_history_write+0x82/0x1d0 [zfs]
[  183.926631] RSP: 0018:ffff88003c549ab8  EFLAGS: 00010246
[  183.926742] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  183.926878] RDX: 0000000000000000 RSI: 0000000000000020 RDI: 0000000000000000
[  183.927015] RBP: ffff88003c549b28 R08: ffff88003cfb4b40 R09: 0000000000000003
[  183.927151] R10: ffff880037062303 R11: 316462722f766564 R12: ffff88003c496600
[  183.927287] R13: ffff88003be36000 R14: ffff88003cf9a000 R15: 0000000000000008
[  183.927424] FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[  183.927574] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  183.927690] CR2: 00007f3b12ef0000 CR3: 000000003b141000 CR4: 00000000000007f0
[  183.927924] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  183.928132] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  183.928312] Process txg_sync (pid: 2019, threadinfo ffff88003c548000, task ffff88003bc8ae80)
[  183.928535] Stack:
[  183.928633]  0000000000000002 ffffffffa01e3360 ffff88003cfb4b40 ffff88003c549ba0
[  183.929007]  ffff88003cf9a000 0000000000000008 ffff88003be36000 0000000068163d54
[  183.929382]  ffff88003b8a2cc0 ffff88003b8a2cc0 ffff88003be36000 ffff88003cfb4b40
[  183.929757] Call Trace:
[  183.929903]  [] spa_history_log_sync+0x221/0x610 [zfs]
[  183.930106]  [] dsl_sync_task_group_sync+0x123/0x210 [zfs]
[  183.930312]  [] dsl_pool_sync+0x41b/0x530 [zfs]
[  183.930507]  [] spa_sync+0x3a8/0xa50 [zfs]
[  183.930667]  [] ? ktime_get_ts+0x4c/0xe0
[  183.930852]  [] txg_sync_thread+0x2df/0x540 [zfs]
[  183.931049]  [] ? txg_init+0x250/0x250 [zfs]
[  183.931219]  [] thread_generic_wrapper+0x78/0x90 [spl]
[  183.931397]  [] ? __thread_create+0x310/0x310 [spl]
[  183.931568]  [] kthread+0xc0/0xd0
[  183.936038]  [] ? flush_kthread_worker+0xb0/0xb0
[  183.936149]  [] ret_from_fork+0x7c/0xb0
[  183.936251]  [] ? flush_kthread_worker+0xb0/0xb0
[  183.936360] Code: 55 b0 48 89 fa 48 29 f2 48 01 c2 48 39 55 b8 0f 82 bc 00 00 00 4c 8b 75 b0 41 bf 08 00 00 00 48 29 c8 31 d2 49 8b b5 70 08 00 00 <48> f7 f7 4c 8d 45 c0 4c 89 f7 48 01 ca 48 29 d3 48 83 fb 08 49
[  183.938433] RIP  [] spa_history_write+0x82/0x1d0 [zfs]
[  183.938599]  RSP 
[  183.938710] ---[ end trace f7a46262c37aea79 ]---

If I had a more concrete idea of what was happening I'd be happy to file a bug with Ceph.

@behlendorf
Copy link
Contributor

Divide by zero, now that's interesting. Can you dump the exact code for your build as follows, it should look something like this but the exact line might differ. I want to know where that device by zero occurred.

[behlendo@rhel-6-2-amd64 zfs]$ gdb module/zfs/zfs.ko
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/behlendo/src/git/zfs/module/zfs/zfs.ko...done.
(gdb) 
(gdb)  list *(spa_history_write+0x82)
0x58a62 is in spa_history_write (/home/behlendo/src/git/zfs/module/zfs/../../module/zfs/spa_history.c:129).
124             int err;
125
126             phys_bof = spa_history_log_to_phys(shpp->sh_bof, shpp);
127             firstread = MIN(sizeof (reclen), shpp->sh_phys_max_off - phys_bof);
128
129             if ((err = dmu_read(mos, spa->spa_history, phys_bof, firstread,
130                 buf, DMU_READ_PREFETCH)) != 0)
131                     return (err);
132             if (firstread != sizeof (reclen)) {
133                     if ((err = dmu_read(mos, spa->spa_history,
(gdb) quit

@tdb
Copy link
Author

tdb commented Jun 8, 2013

I've been building the module using dkms, but it appears to be stripping the module or not building it with symbols in the first place. Is there a way to modify that behaviour? Or am I going to need to ditch that and build it myself?

I've tried setting the relevant things in /etc/default/zfs.

@chrisrd
Copy link
Contributor

chrisrd commented Jun 8, 2013

Given previously failing VERIFY:

+//     if (hap->ha_log_type == LOG_CMD_POOL_CREATE)
+//             VERIFY(shpp->sh_eof == shpp->sh_pool_create_len);

...static analysis suggests:

static int
spa_history_write(spa_t *spa, void *buf, uint64_t len, spa_history_phys_t *shpp,
    dmu_tx_t *tx)
{
    ...
        phys_eof = spa_history_log_to_phys(shpp->sh_eof, shpp);
    ...
}

static uint64_t
spa_history_log_to_phys(uint64_t log_off, spa_history_phys_t *shpp)
{
        uint64_t phys_len;

        phys_len = shpp->sh_phys_max_off - shpp->sh_pool_create_len;
        return ((log_off - shpp->sh_pool_create_len) % phys_len      <<<< BOOM!
            + shpp->sh_pool_create_len);
}

@behlendorf
Copy link
Contributor

@tdb It depends on your kernel and what the default build options are. For example, the Ubuntu kernels will always strip the symbols. It may also not be needed since @chrisrd has likely spotted the offending line here.

It seems likely that we're somehow reading bogus data from the ceph rbd. It would be useful to see what those values are. If you're still interested in chasing this can you try the following patch. It will log the offending value to the console before the crash. It would be useful to run it several times to see if the values remain constant or change.

diff --git a/module/zfs/spa_history.c b/module/zfs/spa_history.c
index 9fb75f3..700f364 100644
--- a/module/zfs/spa_history.c
+++ b/module/zfs/spa_history.c
@@ -223,6 +223,13 @@ spa_history_log_sync(void *arg1, void *arg2, dmu_tx_t *tx)
         */
        VERIFY(0 == dmu_bonus_hold(mos, spa->spa_history, FTAG, &dbp));
        shpp = dbp->db_data;
+#ifdef _KERNEL
+       printk("sh_pool_create_len = %llu\n", shpp->sh_pool_create_len);
+       printk("sh_phys_max_off = %llu\n", shpp->sh_phys_max_off);
+       printk("sh_bof = %llu\n", shpp->sh_bof);
+       printk("sh_eof = %llu\n", shpp->sh_eof);
+       printk("sh_records_losts = %llu\n", shpp->sh_records_lost);
+#endif

        dmu_buf_will_dirty(dbp, tx);

@tdb
Copy link
Author

tdb commented Jun 18, 2013

@behlendorf It looks like either through fiddling or other updates that I've managed to move the error:

[  422.936633]  rbd1: unknown partition table
[  422.936705] rbd: rbd1: added with size 0x10000000000
[  441.362250] SPL: using hostid 0x007f0101
[  441.470098] SPLError: 1682:0:(zap_micro.c:301:mze_find()) VERIFY3(mze->mze_cd == (&(zn->zn_zap)->zap_u.zap_micro.zap_phys->mz_chunk[(mze)->mze_chunkid])->mze_cd) failed (0 == 1635019877)
[  441.470418] SPLError: 1682:0:(zap_micro.c:301:mze_find()) SPL PANIC
[  441.470544] SPL: Showing stack for process 1682
[  441.470552] Pid: 1682, comm: txg_sync Tainted: PF          O 3.8.0-25-generic #37~precise1-Ubuntu
[  441.470554] Call Trace:
[  441.470579]  [] spl_debug_dumpstack+0x27/0x40 [spl]
[  441.470589]  [] spl_debug_bug+0x82/0xe0 [spl]
[  441.470636]  [] mze_find+0x13a/0x270 [zfs]
[  441.470677]  [] zap_lookup_norm+0x9e/0x1c0 [zfs]
[  441.470685]  [] ? kmem_free_debug+0x4b/0x150 [spl]
[  441.470725]  [] zap_lookup+0x33/0x40 [zfs]
[  441.470765]  [] spa_feature_is_active+0x8a/0xf0 [zfs]
[  441.470799]  [] dsl_scan_active+0x76/0xc0 [zfs]
[  441.470833]  [] dsl_scan_sync+0x4f/0xe30 [zfs]
[  441.470873]  [] ? zio_wait+0x23d/0x480 [zfs]
[  441.470910]  [] ? bpobj_enqueue_cb+0x20/0x20 [zfs]
[  441.470947]  [] spa_sync+0x417/0xcd0 [zfs]
[  441.470968]  [] ? ktime_get_ts+0x4c/0xe0
[  441.471007]  [] txg_sync_thread+0x30a/0x640 [zfs]
[  441.471016]  [] ? kmem_free_debug+0x4b/0x150 [spl]
[  441.471054]  [] ? txg_quiesce_thread+0x540/0x540 [zfs]
[  441.471062]  [] thread_generic_wrapper+0x78/0x90 [spl]
[  441.471070]  [] ? __thread_create+0x310/0x310 [spl]
[  441.471080]  [] kthread+0xc0/0xd0
[  441.471084]  [] ? flush_kthread_worker+0xb0/0xb0
[  441.471096]  [] ret_from_fork+0x7c/0xb0
[  441.471100]  [] ? flush_kthread_worker+0xb0/0xb0

If that's of no use to you, let me know and I'll try to get the machine back how it was. I notice the kernel version has changed, and I'm fairly sure a ceph update got pulled in too.

@behlendorf
Copy link
Contributor

@tdb This just looks like garbage data from disk as well. One thing which did catch my eye however from the above log was the size of the rbd device. 0x10000000000 is a surprisingly round number for the partition, is this expected? Also are you creating a partition table for zfs manually, or allowing it to partition the device?

[  422.936705] rbd: rbd1: added with size 0x10000000000

@tdb
Copy link
Author

tdb commented Jun 19, 2013

@behlendorf I noticed that size too. It's a 1GB partition, so it's actually correct.

# rbd ls -l
NAME     SIZE PARENT FMT PROT LOCK
cephzfs 1024G          1

I was giving the raw device to ZFS, rather than creating a partition.

If I use fdisk to but a partition table on the disk, but without adding any partitions, I get the following when creating a pool:

# zpool create pool1 /dev/rbd1
internal error: Invalid argument
Aborted (core dumped)

If I create a partition on it I get the same errors as I mentioned previously (mze_find) when creating a pool on /dev/rbd1p1.

Just for comparison, here's the output creating an ext4 filesystem on the same partition:

root@ubuntu:~# mkfs.ext4 /dev/rbd1p1
mke2fs 1.42 (29-Nov-2011)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=1024 blocks, Stripe width=1024 blocks
67108864 inodes, 268434432 blocks
13421721 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
8192 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000, 214990848
Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
root@ubuntu:~# mount /dev/rbd1p1 /mnt
root@ubuntu:~# df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd1p1    1008G   72M  957G   1% /mnt
root@ubuntu:~#

@behlendorf
Copy link
Contributor

Strange. Well the only way these failures make sense is if something odd is happening at the block device layer. My next suggestion would be to use blktrace to grab a trace log for the rbd device. That would allow us to look for something unusual in the way the rbd or zfs is behaving.

http://www.cse.unsw.edu.au/~aaronc/iosched/doc/blktrace.html

@tdb
Copy link
Author

tdb commented Jun 20, 2013

@behlendorf Does this output help?

https://gist.github.com/tdb/2ae734e546be0c5e1d39

@behlendorf
Copy link
Contributor

@tdb That's exactly the log I wanted to see, but unfortunately it doesn't really show anything strange. All the I/O looks reasonable and is doing what I'd expect a zpool create to do. It's the right size and it's all within the size of the device. However, what is interesting is that it doesn't show any reads before the crash.

That's got me wondering if the rbd driver might be modifying parts of the pages in the bvecs during the write. That could explain this issue, but we'd need to put a debug patch together to see.

@chrisrd
Copy link
Contributor

chrisrd commented Jun 21, 2013

@TBD Based on little more than the mention of modifying bvecs, this commit which touches drivers/block/rbd.c might be relevant:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d74c6d514fe314b8bdab58b487b25992291577ec

block: Add bio_for_each_segment_all()

__bio_for_each_segment() iterates bvecs from the specified index
instead of bio->bv_idx.  Currently, the only usage is to walk all the
bvecs after the bio has been advanced by specifying 0 index.

For immutable bvecs, we need to split these apart;
bio_for_each_segment() is going to have a different implementation.
This will also help document the intent of code that's using it -
bio_for_each_segment_all() is only legal to use for code that owns the
bio.

If your kernel doesn't have that patch already it could be worthwhile trying a kernel including it. It looks to have been introduced some time between v3.9 and v3.10-rc1. Possibly even worth trying v3.10-rc6 which has pulled in a bunch of rbd.c changes

@tdb
Copy link
Author

tdb commented Jun 21, 2013

@chrisrd Using the Ubuntu mainline kernels I tried v3.9.7, but it behaved the same. I checked and it doesn't cotain the commit you mentioned above. So I tried v3.10-rc6 and I get the following build error in spl:

Making all in module
make[2]: Entering directory `/var/lib/dkms/spl/0.6.1/build/module'
make -C /lib/modules/3.10.0-031000rc6-generic/build SUBDIRS=`pwd`  CONFIG_SPL=m modules
make[3]: Entering directory `/usr/src/linux-headers-3.10.0-031000rc6-generic'
  CC [M]  /var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-debug.o
  CC [M]  /var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.o
In file included from /var/lib/dkms/spl/0.6.1/build/include/sys/kmem.h:38:0,
                 from /var/lib/dkms/spl/0.6.1/build/include/sys/kstat.h:32,
                 from /var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:28:
/var/lib/dkms/spl/0.6.1/build/include/sys/vmsystm.h:77:8: error: redefinition of ‘struct vmalloc_info’
include/linux/vmalloc.h:173:8: note: originally defined here
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c: In function ‘proc_dir_entry_match’:
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1126:15: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1129:32: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c: In function ‘proc_dir_entry_find’:
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1137:16: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1137:37: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c: In function ‘proc_dir_entries’:
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1150:16: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1150:37: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c: In function ‘spl_proc_init’:
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1177:2: error: implicit declaration of function ‘create_proc_entry’ [-Werror=implicit-function-declaration]
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1177:21: warning: assignment makes pointer from integer without a cast [enabled by default]
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1181:27: error: dereferencing pointer to incomplete type
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c: In function ‘proc_dir_entry_match’:
/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.c:1130:1: warning: control reaches end of non-void function [-Wreturn-type]
cc1: some warnings being treated as errors
make[5]: *** [/var/lib/dkms/spl/0.6.1/build/module/spl/../../module/spl/spl-proc.o] Error 1
make[4]: *** [/var/lib/dkms/spl/0.6.1/build/module/spl] Error 2
make[3]: *** [_module_/var/lib/dkms/spl/0.6.1/build/module] Error 2
make[3]: Leaving directory `/usr/src/linux-headers-3.10.0-031000rc6-generic'
make[2]: *** [modules] Error 2
make[2]: Leaving directory `/var/lib/dkms/spl/0.6.1/build/module'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/var/lib/dkms/spl/0.6.1/build'
make: *** [all] Error 2

Have spl/zfs been tested with v3.10 yet?

@behlendorf
Copy link
Contributor

@tdb There are pull requests open for 3.10 support by they are still under going review before getting merged. They should be safe to use, the only real questions around them are do they accidentally break builds on older kernels and are they as clean as they can be.

@chrisrd I don't think the referenced commit will help, but it wouldn't hurt to try. We'll probably need to instrument the zfs vdev_disk.c code to see exactly what's happening to the bios.

@tdb
Copy link
Author

tdb commented Aug 25, 2013

Just a quick update on this. I've tried again with 0.6.2 and the following two kernels:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.10.9-saucy/
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-rc6-saucy/

Same problem:

Aug 25 00:51:29 ubuntu-12042 kernel: [  142.393672] SPLError: 2851:0:(zap_micro.c:301:mze_find()) VERIFY3(mze->mze_cd == (&(zn->zn_zap)->zap_u.zap_micro.zap_phys->mz_chunk[(mze)->mze_chunkid])->mze_cd) failed (0 == 825307184)
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394034] SPLError: 2851:0:(zap_micro.c:301:mze_find()) SPL PANIC
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394160] SPL: Showing stack for process 2851
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394164] CPU: 0 PID: 2851 Comm: txg_sync Tainted: PF          O 3.11.0-031100rc6-generic #201308181835
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394166] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/22/2012
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394169]  ffff88003c59da00 ffff88003c4ab9c8 ffffffff81720b9b 0000000000000007
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394173]  0000000000000000 ffff88003c4ab9d8 ffffffffa018f4d7 ffff88003c4aba18
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394176]  ffffffffa01907a2 ffffffffa01a4b4d ffff880036998880 ffff88003c59da00
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394179] Call Trace:
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394203]  [] dump_stack+0x46/0x58
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394221]  [] spl_debug_dumpstack+0x27/0x40 [spl]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394246]  [] spl_debug_bug+0x82/0xe0 [spl]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394314]  [] mze_find+0x13a/0x270 [zfs]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394359]  [] zap_lookup_norm+0x9e/0x1c0 [zfs]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394368]  [] ? kmem_free_debug+0x4b/0x150 [spl]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394410]  [] zap_lookup+0x33/0x40 [zfs]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394451]  [] spa_feature_is_active+0x8a/0xf0 [zfs]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394485]  [] dsl_scan_active+0x76/0xc0 [zfs]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394520]  [] dsl_scan_sync+0x4f/0xe30 [zfs]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394559]  [] ? zio_wait+0x23d/0x4a0 [zfs]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394596]  [] ? bpobj_enqueue_cb+0x20/0x20 [zfs]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394633]  [] spa_sync+0x48a/0xd60 [zfs]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394649]  [] ? ktime_get_ts+0x4c/0xe0
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394687]  [] txg_sync_thread+0x30a/0x640 [zfs]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394696]  [] ? kmem_free_debug+0x4b/0x150 [spl]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394733]  [] ? txg_quiesce_thread+0x540/0x540 [zfs]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394742]  [] thread_generic_wrapper+0x78/0x90 [spl]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394750]  [] ? __thread_create+0x310/0x310 [spl]
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394759]  [] kthread+0xc0/0xd0
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394763]  [] ? flush_kthread_worker+0xb0/0xb0
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394771]  [] ret_from_fork+0x7c/0xb0
Aug 25 00:51:29 ubuntu-12042 kernel: [  142.394776]  [] ? flush_kthread_worker+0xb0/0xb0

@tdb
Copy link
Author

tdb commented Nov 21, 2013

Using 0.6.2 and the linux-image-generic-lts-saucy 3.11.0.13.12 kernel on Ubuntu precise I now get the following:

# zpool create pool2 /dev/rbd1
internal error: Invalid argument
Aborted (core dumped)

The core file contains:

#0  0x00007ffa1abad425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffa1abb0b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffa1b383782 in ?? () from /lib/libzfs.so.2
#3  0x00007ffa1b383b70 in zfs_standard_error_fmt () from /lib/libzfs.so.2
#4  0x00007ffa1b364a1e in zfs_open () from /lib/libzfs.so.2
#5  0x000000000040bc98 in zpool_do_create (argc=, argv=) at ../../cmd/zpool/zpool_main.c:1057
#6  0x0000000000404d26 in main (argc=4, argv=0x7fffecdc5178) at ../../cmd/zpool/zpool_main.c:5709

And this in the log:

Nov 21 23:08:22 ubuntu-12042 kernel: [  116.240529] SPLError: 1688:0:(spa.c:6190:spa_sync()) VERIFY3(bpobj_iterate(defer_bpo, spa_free_sync_cb, zio, tx) == 0) failed (22 == 0)
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.240786] SPLError: 1688:0:(spa.c:6190:spa_sync()) SPL PANIC
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.240899] SPL: Showing stack for process 1688
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.240910] CPU: 0 PID: 1688 Comm: txg_sync Tainted: PF          O 3.11.0-13-generic #20~precise2-Ubuntu
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.240912] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.240915]  0000000000000005 ffff88003c6f9c48 ffffffff8173a05d 0000000000000007
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.240919]  0000000000000000 ffff88003c6f9c58 ffffffffa01794d7 ffff88003c6f9c98
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.240922]  ffffffffa017a7a2 ffffffffa018ebed ffff88003b804000 0000000000000005
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.240925] Call Trace:
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.240943]  [] dump_stack+0x46/0x58
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.240971]  [] spl_debug_dumpstack+0x27/0x40 [spl]
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.240979]  [] spl_debug_bug+0x82/0xe0 [spl]
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.241024]  [] spa_sync+0x9f7/0xdb0 [zfs]
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.241080]  [] txg_sync_thread+0x364/0x6a0 [zfs]
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.241122]  [] ? txg_quiesce_thread+0x520/0x520 [zfs]
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.241131]  [] thread_generic_wrapper+0x78/0x90 [spl]
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.241139]  [] ? __thread_create+0x310/0x310 [spl]
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.241145]  [] kthread+0xc0/0xd0
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.241149]  [] ? flush_kthread_worker+0xb0/0xb0
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.241158]  [] ret_from_fork+0x7c/0xb0
Nov 21 23:08:22 ubuntu-12042 kernel: [  116.241162]  [] ? flush_kthread_worker+0xb0/0xb0
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.848936] INFO: task txg_sync:1688 blocked for more than 120 seconds.
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849079] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849220] txg_sync        D ffff880036a5ece0     0  1688      2 0x00000000
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849226]  ffff88003c6f9c48 0000000000000046 ffffffff81ae70b3 ffff88003fc14580
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849230]  ffff88003c6f9fd8 ffff88003c6f9fd8 ffff88003c6f9fd8 0000000000014580
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849233]  ffff88003cd69770 ffff88003cd6aee0 0000000000000000 0000000000000000
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849236] Call Trace:
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849252]  [] schedule+0x29/0x70
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849299]  [] spl_debug_bug+0xb5/0xe0 [spl]
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849346]  [] spa_sync+0x9f7/0xdb0 [zfs]
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849387]  [] txg_sync_thread+0x364/0x6a0 [zfs]
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849427]  [] ? txg_quiesce_thread+0x520/0x520 [zfs]
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849445]  [] thread_generic_wrapper+0x78/0x90 [spl]
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849454]  [] ? __thread_create+0x310/0x310 [spl]
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849460]  [] kthread+0xc0/0xd0
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849464]  [] ? flush_kthread_worker+0xb0/0xb0
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849468]  [] ret_from_fork+0x7c/0xb0
Nov 21 23:10:27 ubuntu-12042 kernel: [  240.849471]  [] ? flush_kthread_worker+0xb0/0xb0

Further zpool commands generate the following:

Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182141] SPLError: 2064:0:(zap_micro.c:1292:zap_cursor_retrieve()) VERIFY3(mze->mze_cd == mzep->mze_cd) failed (0 == 1635019877)
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182264] SPLError: 2064:0:(zap_micro.c:1292:zap_cursor_retrieve()) SPL PANIC
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182329] SPL: Showing stack for process 2064
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182335] CPU: 0 PID: 2064 Comm: zpool Tainted: PF          O 3.11.0-13-generic #20~precise2-Ubuntu
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182337] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182338]  ffff88003c25b640 ffff88003bdebac8 ffffffff8173a05d 0000000000000007
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182341]  0000000000000000 ffff88003bdebad8 ffffffffa01794d7 ffff88003bdebb18
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182343]  ffffffffa017a7a2 ffffffffa018ebed ffff88003bdebbf8 ffff88003c25b640
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182345] Call Trace:
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182352]  [] dump_stack+0x46/0x58
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182363]  [] spl_debug_dumpstack+0x27/0x40 [spl]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182367]  [] spl_debug_bug+0x82/0xe0 [spl]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182400]  [] zap_cursor_retrieve+0x24a/0x480 [zfs]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182414]  [] ? default_spin_lock_flags+0x9/0x10
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182441]  [] ? zap_unlockdir+0x108/0x1a0 [zfs]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182466]  [] spa_add_feature_stats+0x213/0x440 [zfs]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182471]  [] ? kmem_alloc_debug+0x138/0x3b0 [spl]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182476]  [] ? kmem_alloc_debug+0x138/0x3b0 [spl]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182482]  [] ? nvlist_remove_all+0x8f/0xd0 [znvpair]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182506]  [] ? spa_config_held+0xb9/0xd0 [zfs]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182531]  [] ? spa_add_l2cache+0x29/0x3f0 [zfs]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182555]  [] ? spa_add_spares+0x25/0x360 [zfs]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182579]  [] spa_get_stats+0x10f/0x330 [zfs]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182584]  [] ? kmem_alloc_debug+0x138/0x3b0 [spl]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182610]  [] zfs_ioc_pool_stats+0x31/0x70 [zfs]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182636]  [] zfsdev_ioctl+0x53b/0x5b0 [zfs]
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182646]  [] ? ftrace_raw_event_do_sys_open+0x100/0x110
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182651]  [] do_vfs_ioctl+0x7c/0x2f0
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182653]  [] SyS_ioctl+0x91/0xb0
Nov 21 23:17:44 ubuntu-12042 kernel: [  678.182657]  [] system_call_fastpath+0x1a/0x1f

@behlendorf
Copy link
Contributor

@tdb This was due to an ABI change between the user utilities and kmods. It is unrelated to your original issue. Make sure you rebuild everything such that the utilities exactly match the kmods.

@tdb
Copy link
Author

tdb commented Nov 22, 2013

@behlendorf Ah, ok, sorry for the noise. I actually just installed the binary packages from launchpad, which built the kmods with dkms. So I would have expected that to stay in sync. Anyway - as you say, not related to this issue.

@rbraddy
Copy link

rbraddy commented Dec 18, 2013

Is this still an issue or has a resolution been found?

@tdb
Copy link
Author

tdb commented Dec 18, 2013

@rbraddy I'm not aware of a resolution yet, but @behlendorf can confirm. Are you seeing the same problem? It'd be good to know it's not just me!

@rbraddy
Copy link

rbraddy commented Dec 18, 2013

Yes, we are seeing the same issue with creating ZFS storage pool atop of Ceph RDB block device - zpool create failure and kernel panic.

Creating ext4 filesystem on RDB works perfectly. RBD is used extensively today by various cloud stacks (e.g., Open Stack, Cloud Stack and others), so there seems to be no issue with how it presents itself as a block device for those file systems.

Having RDB work well with ZFS is very important, as it addresses one of the major drawbacks to ZFS - a single point of failure on direct-attached storage, plus the ability to scale out. RADOS is very impressive technology, and combined with ZFS promises to be the most powerful filesystem around. Ceph's filesystem is not ready for prime time, so it just makes sense for these two technologies to work well together and be supported (the way ZFS is supported underneath Ceph OSD's today).

I agree that it's odd that: a) ZFS is the only major file system that is not working atop of RDB today, and b) ZFS panics instead of failing gracefully in the face of whatever incompatibility exists.

Having said that, from what I have seen, ZFS does work a bit differently than many other file systems. In our testing, we also encountered strange behavior by PARTED when trying to delete an existing ext4 partition that we initially configured atop of RDB, in an attempt to create an empty GPT partition in preparation for use with ZFS. ZFS creates its own partitioning scheme from what I have seen, so this may be a clue. We are still investigating, but at this point lack the deep kernel expertise required to reconcile the issue between these two complex systems.

In reading through this thread, about six months ago, I see Brian proposed something as a next step that does not appear to have occurred yet, to gather more information as a next step. I'm wondering if it makes sense to pursue that line of analysis next:

From @behlendorf : It seems likely that we're somehow reading bogus data from the ceph rbd. It would be useful to see what those values are. If you're still interested in chasing this can you try the following patch. It will log the offending value to the console before the crash. It would be useful to run it several times to see if the values remain constant or change.

diff --git a/module/zfs/spa_history.c b/module/zfs/spa_history.c
index 9fb75f3..700f364 100644
--- a/module/zfs/spa_history.c
+++ b/module/zfs/spa_history.c
@@ -223,6 +223,13 @@ spa_history_log_sync(void *arg1, void *arg2, dmu_tx_t *tx)
         */
        VERIFY(0 == dmu_bonus_hold(mos, spa->spa_history, FTAG, &dbp));
        shpp = dbp->db_data;
+#ifdef _KERNEL
+       printk("sh_pool_create_len = %llu\n", shpp->sh_pool_create_len);
+       printk("sh_phys_max_off = %llu\n", shpp->sh_phys_max_off);
+       printk("sh_bof = %llu\n", shpp->sh_bof);
+       printk("sh_eof = %llu\n", shpp->sh_eof);
+       printk("sh_records_losts = %llu\n", shpp->sh_records_lost);
+#endif

        dmu_buf_will_dirty(dbp, tx);

@dweeezil
Copy link
Contributor

I just wanted to post a note here to say that I've started actively looking into this problem. I'm occasionally able to reproduce similar problems as the original report but my general observation is that any other forms of chaos can seem to result from running ZFS atop RDB. Unfortunately, I got sidetracked while looking into this and burned a ton of time tracking down the problem described in openzfs/zfs#2010. With that out of the way, hopefully I'm back on track now.

Also, I should mention that this should likely be a ZFS issue rather than an SPL issue.

@hvenzke
Copy link

hvenzke commented Dec 30, 2013

@dweeezil Tim , some of the logs i have read about this with zfs+ Ceph RDB said that the ZFS ´s used partion table not supported by PARTED ?!??

  1. What exactly Partion type been set BEVOR you try to make an zpool on the Ceph RDB ?
  2. did you tried sliced `(diskP2 )setup instead of wholedisk(disk) ?

3 . did you tried fdisk on the Ceph RDB disk , type "bf" usage ?

uppon my ZFS skills BF are the default , someone may allowed to correct me .-)

@dweeezil
Copy link
Contributor

I'm still trying to get a grip on the actual problem. So far, I'm fairly certain the problem is not simply that the rbd block device behaves differently than do block devices.

@remsnet For my current testbed, I'm generally creating my ZFS pool on a single pre-created partition on the rbd device (actually, my preferred testbed is to dd a known good pool on to my rbd and test from there). I'm hoping to narrow down the problem a bit more within the next day or so once I get more time to look at it.

The failures I'm seeing when performing normal filesystem operations are many and varied. I'm concerned that zfs+rbd is exceeding Linux's kernel stack limit but I've not been able to prove it. I do plan on building a 16K stack kernel as part of my further testing to try to rule it out. Using debugfs' stack_trace feature has been very iffy with wild pointer (NULL or close-to-null) dereferences typically occurring in the ftrace_call() function, itself. I also plan on doing some instrumenting of rbd by itself to get a handle on its "base" stack utilization. The failures I'm seeing are typical of those you'd see when memory (the stack in particular) is overwritten.

I'll post more information as I get it it.

@behlendorf
Copy link
Contributor

@dweeezil It's great to see you looking in to this. Stack overun is certainly one possible explanation for this, I could easily believe that the ceph rbd is more stack heavy that other block devices in the kernel. As you said rebuilding your kernel with 16k stacks would be the easiest way to check for this.

sashalevin pushed a commit to sashalevin/linux-stable-security that referenced this issue Apr 29, 2016
commit 178eda2 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Signed-off-by: Sasha Levin <[email protected]>
mattgorski pushed a commit to Jetson-TK1-AndroidTV/android_kernel_nvidia_jetson_l4t_21.4 that referenced this issue Apr 30, 2016
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
nikhil18 pushed a commit to nikhil18/lightning-kernel-bacon that referenced this issue May 16, 2016
commit 178eda2 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
StefanescuCristian pushed a commit to StefanescuCristian/shamu that referenced this issue Jul 25, 2016
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
IonKiwi pushed a commit to IonKiwi/android_kernel_samsung_kccat6 that referenced this issue Nov 28, 2016
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
IonKiwi pushed a commit to IonKiwi/android_kernel_samsung_kccat6 that referenced this issue Dec 30, 2016
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
IonKiwi pushed a commit to IonKiwi/android_kernel_samsung_kccat6 that referenced this issue Jan 28, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
EfranDev pushed a commit to TeamAlto45/android_kernel_alcatel_msm8916 that referenced this issue Feb 12, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
IonKiwi pushed a commit to IonKiwi/android_kernel_samsung_kccat6 that referenced this issue Feb 26, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
lineageos-gerrit pushed a commit to LineageOS/android_kernel_samsung_apq8084 that referenced this issue Apr 24, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Signed-off-by: Corinna Vinschen <[email protected]>
rockinroyle pushed a commit to aospdk/shamu that referenced this issue Jun 29, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
rockinroyle pushed a commit to aospdk/shamu that referenced this issue Jul 1, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
github-cygwin pushed a commit to github-cygwin/android_kernel_samsung_apq8084 that referenced this issue Jul 6, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
TheRingMaster pushed a commit to GZR-Kernels/kernel_moto_shamu that referenced this issue Aug 20, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
engine95 pushed a commit to engine95/S2-710-2DQCL-Nougat that referenced this issue Sep 26, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
engine95 pushed a commit to engine95/S2-815-2CQCL-Nougat that referenced this issue Sep 28, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
engine95 pushed a commit to engine95/S2-715-2CQCL-Nougat that referenced this issue Sep 28, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
engine95 pushed a commit to engine95/S2-810-2DQCL-Nougat that referenced this issue Sep 28, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
camcory pushed a commit to camcory/android_kernel_moto_shamu that referenced this issue Oct 6, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
joryb pushed a commit to CleanAOSP/kernel_msm that referenced this issue Oct 19, 2017
commit 178eda2 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
kongwoojin pushed a commit to teamclever/kernel_grandmaxltekx that referenced this issue Nov 12, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Thunderoar pushed a commit to Thunderoar/latest_goyave_kernel that referenced this issue Nov 30, 2017
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
ChronoMonochrome added a commit to ChronoMonochrome/Chrono_Kernel-1 that referenced this issue Feb 11, 2018
commit 7acab58bc1ac49ba542d92bfdfd5c3047f1d2a59
Merge: 4340f8095da0 c2f7eb8029e2
Author: Shilin Victor <[email protected]>
Date:   Sat Feb 10 21:44:03 2018 +0300

    Merge tag 'v3.10.42' into linux-3.10.y

    This is the 3.10.42 stable release

commit c2f7eb8029e23c4f5445340d8fc0d05367538e6d
Author: Greg Kroah-Hartman <[email protected]>
Date:   Sat Jun 7 13:48:31 2014 -0700

    Linux 3.10.42

commit efccdcdb63a7f7cc7cc1816f0d5e2524eb084c72
Author: Thomas Gleixner <[email protected]>
Date:   Tue Jun 3 12:27:08 2014 +0000

    futex: Make lookup_pi_state more robust

    commit 54a217887a7b658e2650c3feff22756ab80c7339 upstream.

    The current implementation of lookup_pi_state has ambigous handling of
    the TID value 0 in the user space futex.  We can get into the kernel
    even if the TID value is 0, because either there is a stale waiters bit
    or the owner died bit is set or we are called from the requeue_pi path
    or from user space just for fun.

    The current code avoids an explicit sanity check for pid = 0 in case
    that kernel internal state (waiters) are found for the user space
    address.  This can lead to state leakage and worse under some
    circumstances.

    Handle the cases explicit:

           Waiter | pi_state | pi->owner | uTID      | uODIED | ?

      [1]  NULL   | ---      | ---       | 0         | 0/1    | Valid
      [2]  NULL   | ---      | ---       | >0        | 0/1    | Valid

      [3]  Found  | NULL     | --        | Any       | 0/1    | Invalid

      [4]  Found  | Found    | NULL      | 0         | 1      | Valid
      [5]  Found  | Found    | NULL      | >0        | 1      | Invalid

      [6]  Found  | Found    | task      | 0         | 1      | Valid

      [7]  Found  | Found    | NULL      | Any       | 0      | Invalid

      [8]  Found  | Found    | task      | ==taskTID | 0/1    | Valid
      [9]  Found  | Found    | task      | 0         | 0      | Invalid
      [10] Found  | Found    | task      | !=taskTID | 0/1    | Invalid

     [1] Indicates that the kernel can acquire the futex atomically. We
         came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.

     [2] Valid, if TID does not belong to a kernel thread. If no matching
         thread is found then it indicates that the owner TID has died.

     [3] Invalid. The waiter is queued on a non PI futex

     [4] Valid state after exit_robust_list(), which sets the user space
         value to FUTEX_WAITERS | FUTEX_OWNER_DIED.

     [5] The user space value got manipulated between exit_robust_list()
         and exit_pi_state_list()

     [6] Valid state after exit_pi_state_list() which sets the new owner in
         the pi_state but cannot access the user space value.

     [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.

     [8] Owner and user space value match

     [9] There is no transient state which sets the user space TID to 0
         except exit_robust_list(), but this is indicated by the
         FUTEX_OWNER_DIED bit. See [4]

    [10] There is no transient state which leaves owner and user space
         TID out of sync.

    Signed-off-by: Thomas Gleixner <[email protected]>
    Cc: Kees Cook <[email protected]>
    Cc: Will Drewry <[email protected]>
    Cc: Darren Hart <[email protected]>
    Signed-off-by: Linus Torvalds <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 9ad5dabd87e8dd5506529e12e4e8c7b25fb88d7a
Author: Thomas Gleixner <[email protected]>
Date:   Tue Jun 3 12:27:07 2014 +0000

    futex: Always cleanup owner tid in unlock_pi

    commit 13fbca4c6ecd96ec1a1cfa2e4f2ce191fe928a5e upstream.

    If the owner died bit is set at futex_unlock_pi, we currently do not
    cleanup the user space futex.  So the owner TID of the current owner
    (the unlocker) persists.  That's observable inconsistant state,
    especially when the ownership of the pi state got transferred.

    Clean it up unconditionally.

    Signed-off-by: Thomas Gleixner <[email protected]>
    Cc: Kees Cook <[email protected]>
    Cc: Will Drewry <[email protected]>
    Cc: Darren Hart <[email protected]>
    Signed-off-by: Linus Torvalds <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 63d6ad59dd43f44249150aa8c72eeb01bbe0a599
Author: Thomas Gleixner <[email protected]>
Date:   Tue Jun 3 12:27:06 2014 +0000

    futex: Validate atomic acquisition in futex_lock_pi_atomic()

    commit b3eaa9fc5cd0a4d74b18f6b8dc617aeaf1873270 upstream.

    We need to protect the atomic acquisition in the kernel against rogue
    user space which sets the user space futex to 0, so the kernel side
    acquisition succeeds while there is existing state in the kernel
    associated to the real owner.

    Verify whether the futex has waiters associated with kernel state.  If
    it has, return -EINVAL.  The state is corrupted already, so no point in
    cleaning it up.  Subsequent calls will fail as well.  Not our problem.

    [ tglx: Use futex_top_waiter() and explain why we do not need to try
            restoring the already corrupted user space state. ]

    Signed-off-by: Darren Hart <[email protected]>
    Cc: Kees Cook <[email protected]>
    Cc: Will Drewry <[email protected]>
    Signed-off-by: Thomas Gleixner <[email protected]>
    Signed-off-by: Linus Torvalds <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit b58623fb64ff0454ec20bce7a02275a20c23086d
Author: Thomas Gleixner <[email protected]>
Date:   Tue Jun 3 12:27:06 2014 +0000

    futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)

    commit e9c243a5a6de0be8e584c604d353412584b592f8 upstream.

    If uaddr == uaddr2, then we have broken the rule of only requeueing from
    a non-pi futex to a pi futex with this call.  If we attempt this, then
    dangling pointers may be left for rt_waiter resulting in an exploitable
    condition.

    This change brings futex_requeue() in line with futex_wait_requeue_pi()
    which performs the same check as per commit 6f7b0a2a5c0f ("futex: Forbid
    uaddr == uaddr2 in futex_wait_requeue_pi()")

    [ tglx: Compare the resulting keys as well, as uaddrs might be
            different depending on the mapping ]

    Fixes CVE-2014-3153.

    Reported-by: Pinkie Pie
    Signed-off-by: Will Drewry <[email protected]>
    Signed-off-by: Kees Cook <[email protected]>
    Signed-off-by: Thomas Gleixner <[email protected]>
    Reviewed-by: Darren Hart <[email protected]>
    Signed-off-by: Linus Torvalds <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 4237cc8ef3fc3916c337423cbaab818890e628c8
Author: Stanislaw Gruszka <[email protected]>
Date:   Wed Feb 19 13:15:17 2014 +0100

    ath9k: protect tid->sched check

    [ Upstream commit 21f8aaee0c62708654988ce092838aa7df4d25d8 ]

    We check tid->sched without a lock taken on ath_tx_aggr_sleep(). That
    is race condition which can result of doing list_del(&tid->list) twice
    (second time with poisoned list node) and cause crash like shown below:

    [424271.637220] BUG: unable to handle kernel paging request at 00100104
    [424271.637328] IP: [<f90fc072>] ath_tx_aggr_sleep+0x62/0xe0 [ath9k]
    ...
    [424271.639953] Call Trace:
    [424271.639998]  [<f90f6900>] ? ath9k_get_survey+0x110/0x110 [ath9k]
    [424271.640083]  [<f90f6942>] ath9k_sta_notify+0x42/0x50 [ath9k]
    [424271.640177]  [<f809cfef>] sta_ps_start+0x8f/0x1c0 [mac80211]
    [424271.640258]  [<c10f730e>] ? free_compound_page+0x2e/0x40
    [424271.640346]  [<f809e915>] ieee80211_rx_handlers+0x9d5/0x2340 [mac80211]
    [424271.640437]  [<c112f048>] ? kmem_cache_free+0x1d8/0x1f0
    [424271.640510]  [<c1345a84>] ? kfree_skbmem+0x34/0x90
    [424271.640578]  [<c10fc23c>] ? put_page+0x2c/0x40
    [424271.640640]  [<c1345a84>] ? kfree_skbmem+0x34/0x90
    [424271.640706]  [<c1345a84>] ? kfree_skbmem+0x34/0x90
    [424271.640787]  [<f809dde3>] ? ieee80211_rx_handlers_result+0x73/0x1d0 [mac80211]
    [424271.640897]  [<f80a07a0>] ieee80211_prepare_and_rx_handle+0x520/0xad0 [mac80211]
    [424271.641009]  [<f809e22d>] ? ieee80211_rx_handlers+0x2ed/0x2340 [mac80211]
    [424271.641104]  [<c13846ce>] ? ip_output+0x7e/0xd0
    [424271.641182]  [<f80a1057>] ieee80211_rx+0x307/0x7c0 [mac80211]
    [424271.641266]  [<f90fa6ee>] ath_rx_tasklet+0x88e/0xf70 [ath9k]
    [424271.641358]  [<f80a0f2c>] ? ieee80211_rx+0x1dc/0x7c0 [mac80211]
    [424271.641445]  [<f90f82db>] ath9k_tasklet+0xcb/0x130 [ath9k]

    Bug report:
    https://bugzilla.kernel.org/show_bug.cgi?id=70551

    Reported-and-tested-by: Max Sydorenko <[email protected]>
    Signed-off-by: Stanislaw Gruszka <[email protected]>
    Signed-off-by: John W. Linville <[email protected]>
    [ xl: backported to 3.10: adjusted context ]
    Signed-off-by: Xiangyu Lu <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 3c3fa08f4c7770ad35bb10755fb9b1c80e34dee4
Author: Guennadi Liakhovetski <[email protected]>
Date:   Sat Apr 26 12:51:31 2014 -0300

    media: V4L2: fix VIDIOC_CREATE_BUFS in 64- / 32-bit compatibility mode

    commit 97d9d23dda6f37d90aefeec4ed619d52df525382 upstream.

    If a struct contains 64-bit fields, it is aligned on 64-bit boundaries
    within containing structs in 64-bit compilations. This is the case with
    struct v4l2_window, which contains pointers and is embedded into struct
    v4l2_format, and that one is embedded into struct v4l2_create_buffers.
    Unlike some other structs, used as a part of the kernel ABI as ioctl()
    arguments, that are packed, these structs aren't packed. This isn't a
    problem per se, but the ioctl-compat code for VIDIOC_CREATE_BUFS contains
    a bug, that triggers in such 64-bit builds. That code wrongly assumes,
    that in struct v4l2_create_buffers, struct v4l2_format immediately follows
    the __u32 memory field, which in fact isn't the case. This bug wasn't
    visible until now, because until recently hardly any applications used
    this ioctl() and mostly embedded 32-bit only drivers implemented it. This
    is changing now with addition of this ioctl() to some USB drivers, e.g.
    UVC. This patch fixes the bug by copying parts of struct
    v4l2_create_buffers separately.

    Signed-off-by: Guennadi Liakhovetski <[email protected]>
    Acked-by: Laurent Pinchart <[email protected]>
    Signed-off-by: Mauro Carvalho Chehab <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 2e008074b2f19ba550393e3a33334fd1dd5da082
Author: Guennadi Liakhovetski <[email protected]>
Date:   Mon Apr 14 10:49:34 2014 -0300

    media: V4L2: ov7670: fix a wrong index, potentially Oopsing the kernel from user-space

    commit cfece5857ca51d1dcdb157017aba226f594e9dcf upstream.

    Commit 75e2bdad8901a0b599e01a96229be922eef1e488 "ov7670: allow
    configuration of image size, clock speed, and I/O method" uses a wrong
    index to iterate an array. Apart from being wrong, it also uses an
    unchecked value from user-space, which can cause access to unmapped
    memory in the kernel, triggered by a normal desktop user with rights to
    use V4L2 devices.

    Signed-off-by: Guennadi Liakhovetski <[email protected]>
    Acked-by: Jonathan Corbet <[email protected]>
    Signed-off-by: Mauro Carvalho Chehab <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 4f792a2972e6f320484abfc940f978177131facc
Author: Antti Palosaari <[email protected]>
Date:   Thu Apr 10 21:18:16 2014 -0300

    media: fc2580: fix tuning failure on 32-bit arch

    commit 8845cc6415ec28ef8d57b3fb81c75ef9bce69c5f upstream.

    There was some frequency calculation overflows which caused tuning
    failure on 32-bit architecture. Use 64-bit numbers where needed in
    order to avoid calculation overflows.

    Thanks for the Finnish person, who asked remain anonymous, reporting,
    testing and suggesting the fix.

    Signed-off-by: Antti Palosaari <[email protected]>
    Signed-off-by: Mauro Carvalho Chehab <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 0a4e3565df0c91bf0f7a68dee09e45c9d9b2d360
Author: Alex Williamson <[email protected]>
Date:   Tue Apr 22 10:08:40 2014 -0600

    iommu/amd: Fix interrupt remapping for aliased devices

    commit e028a9e6b8a637af09ac4114083280df4a7045f1 upstream.

    An apparent cut and paste error prevents the correct flags from being
    set on the alias device resulting in MSI on conventional PCI devices
    failing to work.  This also produces error events from the IOMMU like:

    AMD-Vi: Event logged [INVALID_DEVICE_REQUEST device=00:14.4 address=0x000000fdf8000000 flags=0x0a00]

    Where 14.4 is a PCIe-to-PCI bridge with a device behind it trying to
    use MSI interrupts.

    Signed-off-by: Alex Williamson <[email protected]>
    Signed-off-by: Joerg Roedel <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit a757a4e215574f2c92fc990275fa5e02159771e1
Author: Chunwei Chen <[email protected]>
Date:   Wed Apr 23 12:35:09 2014 +0800

    libceph: fix corruption when using page_count 0 page in rbd

    commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

    It has been reported that using ZFSonLinux on rbd will result in memory
    corruption. The bug report can be found here:

    https://github.com/zfsonlinux/spl/issues/241
    http://tracker.ceph.com/issues/7790

    The reason is that ZFS will send pages with page_count 0 into rbd, which in
    turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
    page_count 0, as it will do get_page and put_page, and erroneously free the
    page.

    This type of issue has been noted before, and handled in iscsi, drbd,
    etc. So, rbd should also handle this. This fix address this issue by fall back
    to slower sendmsg when page_count 0 detected.

    Cc: Sage Weil <[email protected]>
    Cc: Yehuda Sadeh <[email protected]>
    Signed-off-by: Chunwei Chen <[email protected]>
    Reviewed-by: Ilya Dryomov <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 534cc5572c710370d2bfe4e6b382950fd52c2c00
Author: Guenter Roeck <[email protected]>
Date:   Thu May 15 09:33:42 2014 -0700

    powerpc: Fix 64 bit builds with binutils 2.24

    commit 7998eb3dc700aaf499f93f50b3d77da834ef9e1d upstream.

    With binutils 2.24, various 64 bit builds fail with relocation errors
    such as

    arch/powerpc/kernel/built-in.o: In function `exc_debug_crit_book3e':
            (.text+0x165ee): relocation truncated to fit: R_PPC64_ADDR16_HI
            against symbol `interrupt_base_book3e' defined in .text section
            in arch/powerpc/kernel/built-in.o
    arch/powerpc/kernel/built-in.o: In function `exc_debug_crit_book3e':
            (.text+0x16602): relocation truncated to fit: R_PPC64_ADDR16_HI
            against symbol `interrupt_end_book3e' defined in .text section
            in arch/powerpc/kernel/built-in.o

    The assembler maintainer says:

     I changed the ABI, something that had to be done but unfortunately
     happens to break the booke kernel code.  When building up a 64-bit
     value with lis, ori, shl, oris, ori or similar sequences, you now
     should use @high and @higha in place of @h and @ha.  @h and @ha
     (and their associated relocs R_PPC64_ADDR16_HI and R_PPC64_ADDR16_HA)
     now report overflow if the value is out of 32-bit signed range.
     ie. @h and @ha assume you're building a 32-bit value. This is needed
     to report out-of-range -mcmodel=medium toc pointer offsets in @toc@h
     and @toc@ha expressions, and for consistency I did the same for all
     other @h and @ha relocs.

    Replacing @h with @high in one strategic location fixes the relocation
    errors. This has to be done conditionally since the assembler either
    supports @h or @high but not both.

    Signed-off-by: Guenter Roeck <[email protected]>
    Signed-off-by: Benjamin Herrenschmidt <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit c99612d30ffcdf6ff41281a84e8df2a56c8b7d20
Author: Harald Freudenberger <[email protected]>
Date:   Wed May 7 16:51:29 2014 +0200

    crypto: s390 - fix aes,des ctr mode concurrency finding.

    commit 3901c1124ec5099254a9396085f7798153a7293f upstream.

    An additional testcase found an issue with the last
    series of patches applied: the fallback solution may
    not save the iv value after operation. This very small
    fix just makes sure the iv is copied back to the
    walk/desc struct.

    Signed-off-by: Harald Freudenberger <[email protected]>
    Signed-off-by: Herbert Xu <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit d1ae1920b53e00849397c3f6f63dace219de46a0
Author: Horia Geanta <[email protected]>
Date:   Fri Apr 18 13:01:42 2014 +0300

    crypto: caam - add allocation failure handling in SPRINTFCAT macro

    commit 27c5fb7a84242b66bf1e0b2fe6bf40d19bcc5c04 upstream.

    GFP_ATOMIC memory allocation could fail.
    In this case, avoid NULL pointer dereference and notify user.

    Cc: Kim Phillips <[email protected]>
    Signed-off-by: Horia Geanta <[email protected]>
    Signed-off-by: Herbert Xu <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit a0d3102153fc5d9cf8bb49b62ac9655f9f63b493
Author: Olof Johansson <[email protected]>
Date:   Fri Apr 11 15:19:41 2014 -0700

    i2c: s3c2410: resume race fix

    commit ce78cc071f5f541480e381cc0241d37590041a9d upstream.

    Don't unmark the device as suspended until after it's been re-setup.

    The main race would be w.r.t. an i2c driver that gets resumed at the same
    time (asyncronously), that is allowed to do a transfer since suspended
    is set to 0 before reinit, but really should have seen the -EIO return
    instead.

    Signed-off-by: Olof Johansson <[email protected]>
    Signed-off-by: Doug Anderson <[email protected]>
    Acked-by: Kukjin Kim <[email protected]>
    Signed-off-by: Wolfram Sang <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit a6b6cde1481125b886f726756a3364be0fb9f93e
Author: Du, Wenkai <[email protected]>
Date:   Thu Apr 10 23:03:19 2014 +0000

    i2c: designware: Mask all interrupts during i2c controller enable

    commit 47bb27e78867997040a228328f2a631c3c7f2c82 upstream.

    There have been "i2c_designware 80860F41:00: controller timed out" errors
    on a number of Baytrail platforms. The issue is caused by incorrect value in
    Interrupt Mask Register (DW_IC_INTR_MASK)  when i2c core is being enabled.
    This causes call to __i2c_dw_enable() to immediately start the transfer which
    leads to timeout. There are 3 failure modes observed:

    1. Failure in S0 to S3 resume path

    The default value after reset for DW_IC_INTR_MASK is 0x8ff. When we start
    the first transaction after resuming from system sleep, TX_EMPTY interrupt
    is already unmasked because of the hardware default.

    2. Failure in normal operational path

    This failure happens rarely and is hard to reproduce. Debug trace showed that
    DW_IC_INTR_MASK had value of 0x254 when failure occurred, which meant
    TX_EMPTY was unmasked.

    3. Failure in S3 to S0 suspend path

    This failure also happens rarely and is hard to reproduce. Adding debug trace
    that read DW_IC_INTR_MASK made this failure not reproducible. But from ISR
    call trace we could conclude TX_EMPTY was unmasked when problem occurred.

    The patch masks all interrupts before the controller is enabled to resolve the
    faulty DW_IC_INTR_MASK conditions.

    Signed-off-by: Wenkai Du <[email protected]>
    Acked-by: Mika Westerberg <[email protected]>
    [wsa: improved the comment and removed typo in commit msg]
    Signed-off-by: Wolfram Sang <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 670a6ed522e0fa814d74547cb95fbc4854660474
Author: Wolfram Sang <[email protected]>
Date:   Mon May 5 18:36:21 2014 +0200

    i2c: rcar: bail out on zero length transfers

    commit d7653964c590ba846aa11a8f6edf409773cbc492 upstream.

    This hardware does not support zero length transfers. Instead, the
    driver does one (random) byte transfers currently with undefined results
    for the slaves. We now bail out.

    Signed-off-by: Wolfram Sang <[email protected]>
    Signed-off-by: Wolfram Sang <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit feed5a88f45f26e4fda34838b4929536d4a8e775
Author: Hans de Goede <[email protected]>
Date:   Mon May 5 11:38:09 2014 +0200

    ACPI / blacklist: Add dmi_enable_osi_linux quirk for Asus EEE PC 1015PX

    commit f6e6e1b9fee88c90586787b71dc49bb3ce62bb89 upstream.

    Without this this EEE PC exports a non working WMI interface, with this it
    exports a working "good old" eeepc_laptop interface, fixing brightness control
    not working as well as rfkill being stuck in a permanent wireless blocked
    state.

    This is not an ideal way to fix this, but various attempts to fix this
    otherwise have failed, see:

    References: https://bugzilla.redhat.com/show_bug.cgi?id=1067181
    Reported-and-tested-by: [email protected]
    Signed-off-by: Hans de Goede <[email protected]>
    Signed-off-by: Rafael J. Wysocki <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit a40aac07285bdf77ed67af1423676ff5548ef51b
Author: Levente Kurusa <[email protected]>
Date:   Tue May 6 15:57:48 2014 +0200

    libata: clean up ZPODD when a port is detached

    commit a6f9bf4d2f965b862b95213303d154e02957eed8 upstream.

    When a ZPODD device is unbound via sysfs, the ACPI notify handler
    is not removed. This causes panics as observed in Bug #74601. The
    panic only happens when the wake happens from outside the kernel
    (i.e. inserting a media or pressing a button). Add a loop to
    ata_port_detach which loops through the port's devices and checks
    if zpodd is enabled, if so call zpodd_exit.

    Reviewed-by: Aaron Lu <[email protected]>
    Signed-off-by: Levente Kurusa <[email protected]>
    Signed-off-by: Tejun Heo <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 019c8ec9e3c6a3616f10be9eda359f1927d1f8b1
Author: Mikulas Patocka <[email protected]>
Date:   Thu Feb 20 18:01:01 2014 -0500

    dm crypt: fix cpu hotplug crash by removing per-cpu structure

    commit 610f2de3559c383caf8fbbf91e9968102dff7ca0 upstream.

    The DM crypt target used per-cpu structures to hold pointers to a
    ablkcipher_request structure.  The code assumed that the work item keeps
    executing on a single CPU, so it didn't use synchronization when
    accessing this structure.

    If a CPU is disabled by writing 0 to /sys/devices/system/cpu/cpu*/online,
    the work item could be moved to another CPU.  This causes dm-crypt
    crashes, like the following, because the code starts using an incorrect
    ablkcipher_request:

     smpboot: CPU 7 is now offline
     BUG: unable to handle kernel NULL pointer dereference at 0000000000000130
     IP: [<ffffffffa1862b3d>] crypt_convert+0x12d/0x3c0 [dm_crypt]
     ...
     Call Trace:
      [<ffffffffa1864415>] ? kcryptd_crypt+0x305/0x470 [dm_crypt]
      [<ffffffff81062060>] ? finish_task_switch+0x40/0xc0
      [<ffffffff81052a28>] ? process_one_work+0x168/0x470
      [<ffffffff8105366b>] ? worker_thread+0x10b/0x390
      [<ffffffff81053560>] ? manage_workers.isra.26+0x290/0x290
      [<ffffffff81058d9f>] ? kthread+0xaf/0xc0
      [<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120
      [<ffffffff813464ac>] ? ret_from_fork+0x7c/0xb0
      [<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120

    Fix this bug by removing the per-cpu definition.  The structure
    ablkcipher_request is accessed via a pointer from convert_context.
    Consequently, if the work item is rescheduled to a different CPU, the
    thread still uses the same ablkcipher_request.

    This change may undermine performance improvements intended by commit
    c0297721 ("dm crypt: scale to multiple cpus") on select hardware.  In
    practice no performance difference was observed on recent hardware.  But
    regardless, correctness is more important than performance.

    Signed-off-by: Mikulas Patocka <[email protected]>
    Signed-off-by: Mike Snitzer <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit aece4fa7368debd14ac07ebaf569587ff02cc596
Author: Michael Neuling <[email protected]>
Date:   Mon Mar 3 14:21:40 2014 +1100

    powerpc/tm: Fix crash when forking inside a transaction

    commit 621b5060e823301d0cba4cb52a7ee3491922d291 upstream.

    When we fork/clone we currently don't copy any of the TM state to the new
    thread.  This results in a TM bad thing (program check) when the new process is
    switched in as the kernel does a tmrechkpt with TEXASR FS not set.  Also, since
    R1 is from userspace, we trigger the bad kernel stack pointer detection.  So we
    end up with something like this:

       Bad kernel stack pointer 0 at c0000000000404fc
       cpu 0x2: Vector: 700 (Program Check) at [c00000003ffefd40]
           pc: c0000000000404fc: restore_gprs+0xc0/0x148
           lr: 0000000000000000
           sp: 0
          msr: 9000000100201030
         current = 0xc000001dd1417c30
         paca    = 0xc00000000fe00800   softe: 0        irq_happened: 0x01
           pid   = 0, comm = swapper/2
       WARNING: exception is not recoverable, can't continue

    The below fixes this by flushing the TM state before we copy the task_struct to
    the clone.  To do this we go through the tmreclaim patch, which removes the
    checkpointed registers from the CPU and transitions the CPU out of TM suspend
    mode.  Hence we need to call tmrechkpt after to restore the checkpointed state
    and the TM mode for the current task.

    To make this fail from userspace is simply:
            tbegin
            li      r0, 2
            sc
            <boom>

    Kudos to Adhemerval Zanella Neto for finding this.

    Signed-off-by: Michael Neuling <[email protected]>
    cc: Adhemerval Zanella Neto <[email protected]>
    Signed-off-by: Benjamin Herrenschmidt <[email protected]>
    [Backported to 3.10: context adjust]
    Signed-off-by: Xue Liu <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit c9d6d5c009e96eb257d36b5c43d9c2d94f02cbf8
Author: Andy Grover <[email protected]>
Date:   Wed May 14 15:48:06 2014 -0700

    target: Don't allow setting WC emulation if device doesn't support

    commit 07b8dae38b09bcfede7e726f172e39b5ce8390d9 upstream.

    Just like for pSCSI, if the transport sets get_write_cache, then it is
    not valid to enable write cache emulation for it. Return an error.

    see https://bugzilla.redhat.com/show_bug.cgi?id=1082675

    Reviewed-by: Chris Leech <[email protected]>
    Signed-off-by: Andy Grover <[email protected]>
    Signed-off-by: Nicholas Bellinger <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 8a2629ad0ba7902262df7ae980265d8f93e2dfc2
Author: Sagi Grimberg <[email protected]>
Date:   Tue Apr 29 13:13:45 2014 +0300

    Target/iser: Fix iscsit_accept_np and rdma_cm racy flow

    commit 531b7bf4bd795d9a09eac92504322a472c010bc8 upstream.

    RDMA CM and iSCSI target flows are asynchronous and completely
    uncorrelated. Relying on the fact that iscsi_accept_np will be called
    after CM connection request event and will wait for it is a mistake.

    When attempting to login to a few targets this flow is racy and
    unpredictable, but for parallel login to dozens of targets will
    race and hang every time.

    The correct synchronizing mechanism in this case is pending on
    a semaphore rather than a wait_for_event. We keep the pending
    interruptible for iscsi_np cleanup stage.

    (Squash patch to remove dead code into parent - nab)

    Reported-by: Slava Shwartsman <[email protected]>
    Signed-off-by: Sagi Grimberg <[email protected]>
    Signed-off-by: Nicholas Bellinger <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 5de94f8f4acfce3ff79592cb1d6c58ae6c0b420e
Author: Sagi Grimberg <[email protected]>
Date:   Tue Apr 29 13:13:44 2014 +0300

    Target/iser: Fix wrong connection requests list addition

    commit 9fe63c88b1d59f1ce054d6948ccd3096496ecedb upstream.

    Should be adding list_add_tail($new, $head) and not
    the other way around.

    Signed-off-by: Sagi Grimberg <[email protected]>
    Signed-off-by: Nicholas Bellinger <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit d0e845f6565ceed9ff36b95ac68cf705b97a5844
Author: Marcel Apfelbaum <[email protected]>
Date:   Thu May 15 12:42:49 2014 -0600

    PCI: shpchp: Check bridge's secondary (not primary) bus speed

    commit 93fa9d32670f5592c8e56abc9928fc194e1e72fc upstream.

    When a new device is added below a hotplug bridge, the bridge's secondary
    bus speed and the device's bus speed must match.  The shpchp driver
    previously checked the bridge's *primary* bus speed, not the secondary bus
    speed.

    This caused hot-add errors like:

      shpchp 0000:00:03.0: Speed of bus ff and adapter 0 mismatch

    Check the secondary bus speed instead.

    [bhelgaas: changelog]
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=75251
    Fixes: 3749c51ac6c1 ("PCI: Make current and maximum bus speeds part of the PCI core")
    Signed-off-by: Marcel Apfelbaum <[email protected]>
    Signed-off-by: Bjorn Helgaas <[email protected]>
    Acked-by: Michael S. Tsirkin <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 55d9b08514ede9334e443337e9bf6181a3f8b114
Author: Arnd Bergmann <[email protected]>
Date:   Wed Apr 23 14:49:17 2014 +0200

    genirq: Provide irq_force_affinity fallback for non-SMP

    commit 4c88d7f9b0d5fb0588c3386be62115cc2eaa8f9f upstream.

    Patch 01f8fa4f01d "genirq: Allow forcing cpu affinity of interrupts" added
    an irq_force_affinity() function, and 30ccf03b4a6 "clocksource: Exynos_mct:
    Use irq_force_affinity() in cpu bringup" subsequently uses it. However, the
    driver can be used with CONFIG_SMP disabled, but the function declaration
    is only available for CONFIG_SMP, leading to this build error:

    drivers/clocksource/exynos_mct.c:431:3: error: implicit declaration of function 'irq_force_affinity' [-Werror=implicit-function-declaration]
       irq_force_affinity(mct_irqs[MCT_L0_IRQ + cpu], cpumask_of(cpu));

    This patch introduces a dummy helper function for the non-SMP case
    that always returns success, to get rid of the build error.
    Since the patches causing the problem are marked for stable backports,
    this one should be as well.

    Signed-off-by: Arnd Bergmann <[email protected]>
    Cc: Krzysztof Kozlowski <[email protected]>
    Acked-by: Kukjin Kim <[email protected]>
    Link: http://lkml.kernel.org/r/5619084.0zmrrIUZLV@wuerfel
    Signed-off-by: Thomas Gleixner <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 62dcb5801ac032188a20dc45e6d7b682a028adcf
Author: Linus Torvalds <[email protected]>
Date:   Wed May 14 16:33:54 2014 -0700

    x86-64, modify_ldt: Make support for 16-bit segments a runtime option

    commit fa81511bb0bbb2b1aace3695ce869da9762624ff upstream.

    Checkin:

    b3b42ac2cbae x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels

    disabled 16-bit segments on 64-bit kernels due to an information
    leak.  However, it does seem that people are genuinely using Wine to
    run old 16-bit Windows programs on Linux.

    A proper fix for this ("espfix64") is coming in the upcoming merge
    window, but as a temporary fix, create a sysctl to allow the
    administrator to re-enable support for 16-bit segments.

    It adds a "/proc/sys/abi/ldt16" sysctl that defaults to zero (off). If
    you hit this issue and care about your old Windows program more than
    you care about a kernel stack address information leak, you can do

       echo 1 > /proc/sys/abi/ldt16

    as root (add it to your startup scripts), and you should be ok.

    The sysctl table is only added if you have COMPAT support enabled on
    x86-64, but I assume anybody who runs old windows binaries very much
    does that ;)

    Signed-off-by: H. Peter Anvin <[email protected]>
    Link: http://lkml.kernel.org/r/CA%2B55aFw9BPoD10U1LfHbOMpHWZkvJTkMcfCs9s3urPr1YyWBxw@mail.gmail.com
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 56ecdc3d9e5b91f411e6f3ba63229d332b54af8e
Author: James Hogan <[email protected]>
Date:   Tue May 13 23:58:24 2014 +0100

    metag: Reduce maximum stack size to 256MB

    commit d71f290b4e98a39f49f2595a13be3b4d5ce8e1f1 upstream.

    Specify the maximum stack size for arches where the stack grows upward
    (parisc and metag) in asm/processor.h rather than hard coding in
    fs/exec.c so that metag can specify a smaller value of 256MB rather than
    1GB.

    This fixes a BUG on metag if the RLIMIT_STACK hard limit is increased
    beyond a safe value by root. E.g. when starting a process after running
    "ulimit -H -s unlimited" it will then attempt to use a stack size of the
    maximum 1GB which is far too big for metag's limited user virtual
    address space (stack_top is usually 0x3ffff000):

    BUG: failure at fs/exec.c:589/shift_arg_pages()!

    Signed-off-by: James Hogan <[email protected]>
    Cc: Helge Deller <[email protected]>
    Cc: "James E.J. Bottomley" <[email protected]>
    Cc: [email protected]
    Cc: [email protected]
    Cc: John David Anglin <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 44563045712ee4f385b5f3814d69fc73b5f22288
Author: Mikulas Patocka <[email protected]>
Date:   Thu May 8 15:51:37 2014 -0400

    metag: fix memory barriers

    commit 2425ce84026c385b73ae72039f90d042d49e0394 upstream.

    Volatile access doesn't really imply the compiler barrier. Volatile access
    is only ordered with respect to other volatile accesses, it isn't ordered
    with respect to general memory accesses. Gcc may reorder memory accesses
    around volatile access, as we can see in this simple example (if we
    compile it with optimization, both increments of *b will be collapsed to
    just one):

    void fn(volatile int *a, long *b)
    {
            (*b)++;
            *a = 10;
            (*b)++;
    }

    Consequently, we need the compiler barrier after a write to the volatile
    variable, to make sure that the compiler doesn't reorder the volatile
    write with something else.

    Signed-off-by: Mikulas Patocka <[email protected]>
    Acked-by: Peter Zijlstra <[email protected]>
    Signed-off-by: James Hogan <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit aece7dc95409f8934281954a7e82ddf55b765913
Author: Charles Keepax <[email protected]>
Date:   Tue May 13 13:45:15 2014 +0100

    ASoC: wm8962: Update register CLASS_D_CONTROL_1 to be non-volatile

    commit 44330ab516c15dda8a1e660eeaf0003f84e43e3f upstream.

    The register CLASS_D_CONTROL_1 is marked as volatile because it contains
    a bit, DAC_MUTE, which is also mirrored in the ADC_DAC_CONTROL_1
    register. This causes problems for the "Speaker Switch" control, which
    will report an error if the CODEC is suspended because it relies on a
    volatile register.

    To resolve this issue mark CLASS_D_CONTROL_1 as non-volatile and
    manually keep the register cache in sync by updating both bits when
    changing the mute status.

    Reported-by: Shawn Guo <[email protected]>
    Signed-off-by: Charles Keepax <[email protected]>
    Tested-by: Shawn Guo <[email protected]>
    Signed-off-by: Mark Brown <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit d642daf637d02dacf216d7fd9da7532a4681cfd3
Author: Roger Pau Monne <[email protected]>
Date:   Tue Oct 29 18:31:14 2013 +0100

    xen-blkfront: restore the non-persistent data path

    commit bfe11d6de1c416cea4f3f0f35f864162063ce3fa upstream.

    When persistent grants were added they were always used, even if the
    backend doesn't have this feature (there's no harm in always using the
    same set of pages). This restores the old data path when the backend
    doesn't have persistent grants, removing the burden of doing a memcpy
    when it is not actually needed.

    Signed-off-by: Roger Pau Monné <[email protected]>
    Reported-by: Felipe Franciosi <[email protected]>
    Cc: Felipe Franciosi <[email protected]>
    Cc: Konrad Rzeszutek Wilk <[email protected]>
    Cc: David Vrabel <[email protected]>
    Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
    [v2: Fix up whitespace issues]
    Tested-by: Felipe Franciosi <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit ba4abe2e7f32f6d7fe3d92eeb4b1748c10b5f601
Author: Roger Pau Monne <[email protected]>
Date:   Mon Aug 12 12:53:44 2013 +0200

    xen-blkfront: revoke foreign access for grants not mapped by the backend

    commit fbe363c476afe8ec992d3baf682670a4bd1b6ce6 upstream.

    There's no need to keep the foreign access in a grant if it is not
    persistently mapped by the backend. This allows us to free grants that
    are not mapped by the backend, thus preventing blkfront from hoarding
    all grants.

    The main effect of this is that blkfront will only persistently map
    the same grants as the backend, and it will always try to use grants
    that are already mapped by the backend. Also the number of persistent
    grants in blkfront is the same as in blkback (and is controlled by the
    value in blkback).

    Signed-off-by: Roger Pau Monné <[email protected]>
    Reviewed-by: David Vrabel <[email protected]>
    Acked-by: Matt Wilson <[email protected]>
    Cc: Konrad Rzeszutek Wilk <[email protected]>
    Cc: David Vrabel <[email protected]>
    Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
    Signed-off-by: Jens Axboe <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 46c0326164c98e556c35c3eb240273595d43425d
Author: Jianyu Zhan <[email protected]>
Date:   Mon Apr 14 13:47:40 2014 +0800

    percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree()

    commit 5a838c3b60e3a36ade764cf7751b8f17d7c9c2da upstream.

    pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
            BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long)

    It hardly could be ever bigger than PAGE_SIZE even for large-scale machine,
    but for consistency with its couterpart pcpu_mem_zalloc(),
    use pcpu_mem_free() instead.

    Commit b4916cb17c26 ("percpu: make pcpu_free_chunk() use
    pcpu_mem_free() instead of kfree()") addressed this problem, but
    missed this one.

    tj: commit message updated

    Signed-off-by: Jianyu Zhan <[email protected]>
    Signed-off-by: Tejun Heo <[email protected]>
    Fixes: 099a19d91ca4 ("percpu: allow limited allocation before slab is online)
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 19d65166742f901cc14290494560f6b224cd2d2b
Author: Thomas Petazzoni <[email protected]>
Date:   Fri Apr 18 14:19:52 2014 +0200

    bus: mvebu-mbus: allow several windows with the same target/attribute

    commit b566e782be32145664d96ada3e389f17d32742e5 upstream.

    Having multiple windows with the same target and attribute is actually
    legal, and can be useful for PCIe windows, when PCIe BARs have a size
    that isn't a power of two, and we therefore need to create several
    MBus windows to cover the PCIe BAR for a given PCIe interface.

    Fixes: fddddb52a6c4 ('bus: introduce an Marvell EBU MBus driver')
    Signed-off-by: Thomas Petazzoni <[email protected]>
    Link: https://lkml.kernel.org/r/1397823593-1932-7-git-send-email-thomas.petazzoni@free-electrons.com
    Tested-by: Neil Greatorex <[email protected]>
    Signed-off-by: Jason Cooper <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit f56fb0d42b47b87b12c4936a77429d9dd1c7c4c6
Author: Lai Jiangshan <[email protected]>
Date:   Fri Apr 18 11:04:16 2014 -0400

    workqueue: make rescuer_thread() empty wq->maydays list before exiting

    commit 4d595b866d2c653dc90a492b9973a834eabfa354 upstream.

    After a @pwq is scheduled for emergency execution, other workers may
    consume the affectd work items before the rescuer gets to them.  This
    means that a workqueue many have pwqs queued on @wq->maydays list
    while not having any work item pending or in-flight.  If
    destroy_workqueue() executes in such condition, the rescuer may exit
    without emptying @wq->maydays.

    This currently doesn't cause any actual harm.  destroy_workqueue() can
    safely destroy all the involved data structures whether @wq->maydays
    is populated or not as nobody access the list once the rescuer exits.

    However, this is nasty and makes future development difficult.  Let's
    update rescuer_thread() so that it empties @wq->maydays after seeing
    should_stop to guarantee that the list is empty on rescuer exit.

    tj: Updated comment and patch description.

    Signed-off-by: Lai Jiangshan <[email protected]>
    Signed-off-by: Tejun Heo <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit aac8b37ffaa2bacc0430aa7b45c7d3aad22209fc
Author: Lai Jiangshan <[email protected]>
Date:   Fri Apr 18 11:04:16 2014 -0400

    workqueue: fix a possible race condition between rescuer and pwq-release

    commit 77668c8b559e4fe2acf2a0749c7c83cde49a5025 upstream.

    There is a race condition between rescuer_thread() and
    pwq_unbound_release_workfn().

    Even after a pwq is scheduled for rescue, the associated work items
    may be consumed by any worker.  If all of them are consumed before the
    rescuer gets to them and the pwq's base ref was put due to attribute
    change, the pwq may be released while still being linked on
    @wq->maydays list making the rescuer dereference already freed pwq
    later.

    Make send_mayday() pin the target pwq until the rescuer is done with
    it.

    tj: Updated comment and patch description.

    Signed-off-by: Lai Jiangshan <[email protected]>
    Signed-off-by: Tejun Heo <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 55a3dfcc84ab3dc82708d93cd0bca4a0aad7715c
Author: Daeseok Youn <[email protected]>
Date:   Wed Apr 16 14:32:29 2014 +0900

    workqueue: fix bugs in wq_update_unbound_numa() failure path

    commit 77f300b198f93328c26191b52655ce1b62e202cf upstream.

    wq_update_unbound_numa() failure path has the following two bugs.

    - alloc_unbound_pwq() is called without holding wq->mutex; however, if
      the allocation fails, it jumps to out_unlock which tries to unlock
      wq->mutex.

    - The function should switch to dfl_pwq on failure but didn't do so
      after alloc_unbound_pwq() failure.

    Fix it by regrabbing wq->mutex and jumping to use_dfl_pwq on
    alloc_unbound_pwq() failure.

    Signed-off-by: Daeseok Youn <[email protected]>
    Acked-by: Lai Jiangshan <[email protected]>
    Signed-off-by: Tejun Heo <[email protected]>
    Fixes: 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues")
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 04931ac044a638b79ea3c4b48c448b66cae0c2b5
Author: J. Bruce Fields <[email protected]>
Date:   Tue May 20 15:55:21 2014 -0400

    nfsd4: remove lockowner when removing lock stateid

    commit a1b8ff4c97b4375d21b6d6c45d75877303f61b3b upstream.

    The nfsv4 state code has always assumed a one-to-one correspondance
    between lock stateid's and lockowners even if it appears not to in some
    places.

    We may actually change that, but for now when FREE_STATEID releases a
    lock stateid it also needs to release the parent lockowner.

    Symptoms were a subsequent LOCK crashing in find_lockowner_str when it
    calls same_lockowner_ino on a lockowner that unexpectedly has an empty
    so_stateids list.

    Signed-off-by: J. Bruce Fields <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 02016987ba67614366a3d7cbd58b401ca956f816
Author: J. Bruce Fields <[email protected]>
Date:   Thu May 8 11:19:41 2014 -0400

    nfsd4: warn on finding lockowner without stateid's

    commit 27b11428b7de097c42f205beabb1764f4365443b upstream.

    The current code assumes a one-to-one lockowner<->lock stateid
    correspondance.

    Signed-off-by: J. Bruce Fields <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 53a3b8bea5827a9f647f411d9230e563e745c58c
Author: Kinglong Mee <[email protected]>
Date:   Fri Apr 18 20:49:04 2014 +0800

    NFSD: Call ->set_acl with a NULL ACL structure if no entries

    commit aa07c713ecfc0522916f3cd57ac628ea6127c0ec upstream.

    After setting ACL for directory, I got two problems that caused
    by the cached zero-length default posix acl.

    This patch make sure nfsd4_set_nfs4_acl calls ->set_acl
    with a NULL ACL structure if there are no entries.

    Thanks for Christoph Hellwig's advice.

    First problem:
    ............ hang ...........

    Second problem:
    [ 1610.167668] ------------[ cut here ]------------
    [ 1610.168320] kernel BUG at /root/nfs/linux/fs/nfsd/nfs4acl.c:239!
    [ 1610.168320] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    [ 1610.168320] Modules linked in: nfsv4(OE) nfs(OE) nfsd(OE)
    rpcsec_gss_krb5 fscache ip6t_rpfilter ip6t_REJECT cfg80211 xt_conntrack
    rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
    ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
    ip6table_mangle ip6table_security ip6table_raw ip6table_filter
    ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
    nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
    auth_rpcgss nfs_acl snd_intel8x0 ppdev lockd snd_ac97_codec ac97_bus
    snd_pcm snd_timer e1000 pcspkr parport_pc snd parport serio_raw joydev
    i2c_piix4 sunrpc(OE) microcode soundcore i2c_core ata_generic pata_acpi
    [last unloaded: nfsd]
    [ 1610.168320] CPU: 0 PID: 27397 Comm: nfsd Tainted: G           OE
    3.15.0-rc1+ #15
    [ 1610.168320] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
    VirtualBox 12/01/2006
    [ 1610.168320] task: ffff88005ab653d0 ti: ffff88005a944000 task.ti:
    ffff88005a944000
    [ 1610.168320] RIP: 0010:[<ffffffffa034d5ed>]  [<ffffffffa034d5ed>]
    _posix_to_nfsv4_one+0x3cd/0x3d0 [nfsd]
    [ 1610.168320] RSP: 0018:ffff88005a945b00  EFLAGS: 00010293
    [ 1610.168320] RAX: 0000000000000001 RBX: ffff88006700bac0 RCX:
    0000000000000000
    [ 1610.168320] RDX: 0000000000000000 RSI: ffff880067c83f00 RDI:
    ffff880068233300
    [ 1610.168320] RBP: ffff88005a945b48 R08: ffffffff81c64830 R09:
    0000000000000000
    [ 1610.168320] R10: ffff88004ea85be0 R11: 000000000000f475 R12:
    ffff880068233300
    [ 1610.168320] R13: 0000000000000003 R14: 0000000000000002 R15:
    ffff880068233300
    [ 1610.168320] FS:  0000000000000000(0000) GS:ffff880077800000(0000)
    knlGS:0000000000000000
    [ 1610.168320] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 1610.168320] CR2: 00007f5bcbd3b0b9 CR3: 0000000001c0f000 CR4:
    00000000000006f0
    [ 1610.168320] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
    0000000000000000
    [ 1610.168320] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
    0000000000000400
    [ 1610.168320] Stack:
    [ 1610.168320]  ffffffff00000000 0000000b67c83500 000000076700bac0
    0000000000000000
    [ 1610.168320]  ffff88006700bac0 ffff880068233300 ffff88005a945c08
    0000000000000002
    [ 1610.168320]  0000000000000000 ffff88005a945b88 ffffffffa034e2d5
    000000065a945b68
    [ 1610.168320] Call Trace:
    [ 1610.168320]  [<ffffffffa034e2d5>] nfsd4_get_nfs4_acl+0x95/0x150 [nfsd]
    [ 1610.168320]  [<ffffffffa03400d6>] nfsd4_encode_fattr+0x646/0x1e70 [nfsd]
    [ 1610.168320]  [<ffffffff816a6e6e>] ? kmemleak_alloc+0x4e/0xb0
    [ 1610.168320]  [<ffffffffa0327962>] ?
    nfsd_setuser_and_check_port+0x52/0x80 [nfsd]
    [ 1610.168320]  [<ffffffff812cd4bb>] ? selinux_cred_prepare+0x1b/0x30
    [ 1610.168320]  [<ffffffffa0341caa>] nfsd4_encode_getattr+0x5a/0x60 [nfsd]
    [ 1610.168320]  [<ffffffffa0341e07>] nfsd4_encode_operation+0x67/0x110
    [nfsd]
    [ 1610.168320]  [<ffffffffa033844d>] nfsd4_proc_compound+0x21d/0x810 [nfsd]
    [ 1610.168320]  [<ffffffffa0324d9b>] nfsd_dispatch+0xbb/0x200 [nfsd]
    [ 1610.168320]  [<ffffffffa00850cd>] svc_process_common+0x46d/0x6d0 [sunrpc]
    [ 1610.168320]  [<ffffffffa0085433>] svc_process+0x103/0x170 [sunrpc]
    [ 1610.168320]  [<ffffffffa032472f>] nfsd+0xbf/0x130 [nfsd]
    [ 1610.168320]  [<ffffffffa0324670>] ? nfsd_destroy+0x80/0x80 [nfsd]
    [ 1610.168320]  [<ffffffff810a5202>] kthread+0xd2/0xf0
    [ 1610.168320]  [<ffffffff810a5130>] ? insert_kthread_work+0x40/0x40
    [ 1610.168320]  [<ffffffff816c1ebc>] ret_from_fork+0x7c/0xb0
    [ 1610.168320]  [<ffffffff810a5130>] ? insert_kthread_work+0x40/0x40
    [ 1610.168320] Code: 78 02 e9 e7 fc ff ff 31 c0 31 d2 31 c9 66 89 45 ce
    41 8b 04 24 66 89 55 d0 66 89 4d d2 48 8d 04 80 49 8d 5c 84 04 e9 37 fd
    ff ff <0f> 0b 90 0f 1f 44 00 00 55 8b 56 08 c7 07 00 00 00 00 8b 46 0c
    [ 1610.168320] RIP  [<ffffffffa034d5ed>] _posix_to_nfsv4_one+0x3cd/0x3d0
    [nfsd]
    [ 1610.168320]  RSP <ffff88005a945b00>
    [ 1610.257313] ---[ end trace 838254e3e352285b ]---

    Signed-off-by: Kinglong Mee <[email protected]>
    Signed-off-by: J. Bruce Fields <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit d6a18aea9577844da0cdc0a595cbedde46b512d8
Author: Trond Myklebust <[email protected]>
Date:   Fri Apr 18 14:43:57 2014 -0400

    NFSd: call rpc_destroy_wait_queue() from free_client()

    commit 4cb57e3032d4e4bf5e97780e9907da7282b02b0c upstream.

    Mainly to ensure that we don't leave any hanging timers.

    Signed-off-by: Trond Myklebust <[email protected]>
    Signed-off-by: J. Bruce Fields <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit ed6ad7a5caac4bc865280a2946b54f348a3bb2f4
Author: Trond Myklebust <[email protected]>
Date:   Fri Apr 18 14:43:56 2014 -0400

    NFSd: Move default initialisers from create_client() to alloc_client()

    commit 5694c93e6c4954fa9424c215f75eeb919bddad64 upstream.

    Aside from making it clearer what is non-trivial in create_client(), it
    also fixes a bug whereby we can call free_client() before idr_init()
    has been called.

    Signed-off-by: Trond Myklebust <[email protected]>
    Signed-off-by: J. Bruce Fields <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 21ec04003007ce13a632b1d53816e27f63e4dc3f
Author: Takashi Iwai <[email protected]>
Date:   Fri May 23 09:02:44 2014 +0200

    ALSA: hda - Fix onboard audio on Intel H97/Z97 chipsets

    commit 77f07800cb456bed6e5c345e6e4e83e8eda62437 upstream.

    The recent Intel H97/Z97 chipsets need the similar setups like other
    Intel chipsets for snooping, etc.  Especially without snooping, the
    audio playback stutters or gets corrupted.  This fix patch just adds
    the corresponding PCI ID entry with the proper flags.

    Reported-and-tested-by: Arthur Borsboom <[email protected]>
    Signed-off-by: Takashi Iwai <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit e60b0a2765dca37d50133463e639492b7e46a06a
Author: Hans de Goede <[email protected]>
Date:   Mon May 19 22:52:30 2014 -0700

    Input: synaptics - T540p - unify with other LEN0034 models

    commit 6d396ede224dc596d92d7cab433713536e68916c upstream.

    The T540p has a touchpad with pnp-id LEN0034, all the models with this
    pnp-id have the same min/max values, except the T540p where the values are
    slightly off. Fix them to be identical.

    This is a preparation patch for simplifying the quirk table.

    Signed-off-by: Hans de Goede <[email protected]>
    Signed-off-by: Dmitry Torokhov <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 2d597d4480e99eaf426eded8e9e9fcb6feb40673
Author: Hans de Goede <[email protected]>
Date:   Wed May 14 11:10:40 2014 -0700

    Input: synaptics - add min/max quirk for the ThinkPad W540

    commit 0b5fe736fe923f1f5e05413878d5990e92ffbdf5 upstream.

    https://bugzilla.redhat.com/show_bug.cgi?id=1096436

    Tested-and-reported-by: [email protected]
    Signed-off-by: Hans de Goede <[email protected]>
    Signed-off-by: Dmitry Torokhov <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 5e02a153d1a0d5131a9943762f01179b42f074e2
Author: Hans de Goede <[email protected]>
Date:   Mon May 5 09:36:43 2014 -0700

    Input: elantech - fix touchpad initialization on Gigabyte U2442

    commit 36189cc3cd57ab0f1cd75241f93fe01de928ac06 upstream.

    The hw_version 3 Elantech touchpad on the Gigabyte U2442 does not accept
    0x0b as initialization value for r10, this stand-alone version of the
    driver: http://planet76.com/drivers/elantech/psmouse-elantech-v6.tar.bz2

    Uses 0x03 which does work, so this means not setting bit 3 of r10 which
    sets: "Enable Real H/W Resolution In Absolute mode"

    Which will result in half the x and y resolution we get with that bit set,
    so simply not setting it everywhere is not a solution. We've been unable to
    find a way to identify touchpads where setting the bit will fail, so this
    patch uses a dmi based blacklist for this.

    https://bugzilla.kernel.org/show_bug.cgi?id=61151

    Reported-by: Philipp Wolfer <[email protected]>
    Tested-by: Philipp Wolfer <[email protected]>
    Signed-off-by: Hans de Goede <[email protected]>
    Signed-off-by: Dmitry Torokhov <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 7f63e60bd6e1843fd3510d4cbc036e3d388944d1
Author: Sheng-Liang Song <[email protected]>
Date:   Thu Apr 24 16:28:29 2014 -0700

    Input: atkbd - fix keyboard not working on some LG laptops

    commit 3d725caa9dcc78c3dc9e7ea0c04f626468edd9c9 upstream.

    After issuing ATKBD_CMD_RESET_DIS, keyboard on some LG laptops stops
    working. The workaround is to stop issuing ATKBD_CMD_RESET_DIS commands.

    In order to keep changes in atkbd driver to the minimum we check DMI
    signature and only skip ATKBD_CMD_RESET_DIS if we are running on LG
    LW25-B7HV or P1-J273B.

    Signed-off-by: Sheng-Liang Song <[email protected]>
    Signed-off-by: Dmitry Torokhov <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit f6de6225ca40427023398d1b22a6af810792741a
Author: Romain Izard <[email protected]>
Date:   Tue Mar 4 10:09:39 2014 +0100

    trace: module: Maintain a valid user count

    commit 098507ae3ec2331476fb52e85d4040c1cc6d0ef4 upstream.

    The replacement of the 'count' variable by two variables 'incs' and
    'decs' to resolve some race conditions during module unloading was done
    in parallel with some cleanup in the trace subsystem, and was integrated
    as a merge.

    Unfortunately, the formula for this replacement was wrong in the tracing
    code, and the refcount in the traces was not usable as a result.

    Use 'count = incs - decs' to compute the user count.

    Link: http://lkml.kernel.org/p/[email protected]

    Acked-by: Ingo Molnar <[email protected]>
    Cc: Rusty Russell <[email protected]>
    Cc: Frederic Weisbecker <[email protected]>
    Fixes: c1ab9cab7509 "merge conflict resolution"
    Signed-off-by: Romain Izard <[email protected]>
    Signed-off-by: Steven Rostedt <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 863a921283fca22d63865314182e7c9e5fba0ad3
Author: K. Y. Srinivasan <[email protected]>
Date:   Thu Apr 3 18:02:45 2014 -0700

    Drivers: hv: vmbus: Negotiate version 3.0 when running on ws2012r2 hosts

    commit 03367ef5ea811475187a0732aada068919e14d61 upstream.

    Only ws2012r2 hosts support the ability to reconnect to the host on VMBUS. This functionality
    is needed by kexec in Linux. To use this functionality we need to negotiate version 3.0 of the
    VMBUS protocol.

    Signed-off-by: K. Y. Srinivasan <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 8567f561ed67f4a444b554c8463d129c2ad0e8ad
Author: Chew, Kean ho <[email protected]>
Date:   Sat Mar 1 00:03:56 2014 +0800

    i2c: i801: enable Intel BayTrail SMBUS

    commit 1b31e9b76ef8c62291e698dfdb973499986a7f68 upstream.

    Add Device ID of Intel BayTrail SMBus Controller.

    Signed-off-by: Chew, Kean ho <[email protected]>
    Signed-off-by: Chew, Chiau Ee <[email protected]>
    Reviewed-by: Jean Delvare <[email protected]>
    Signed-off-by: Wolfram Sang <[email protected]>
    Cc: "Chang, Rebecca Swee Fun" <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 1b796c0acb32f18938041159d2d9538b26b75893
Author: James Ralston <[email protected]>
Date:   Mon Nov 4 09:29:48 2013 -0800

    i2c: i801: Add Device IDs for Intel Wildcat Point-LP PCH

    commit afc659241258b40b683998ec801d25d276529f43 upstream.

    This patch adds the SMBus Device IDs for the Intel Wildcat Point-LP PCH.

    Signed-off-by: James Ralston <[email protected]>
    Signed-off-by: Wolfram Sang <[email protected]>
    Cc: "Chang, Rebecca Swee Fun" <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 4e32a7c66fae40bde0fbff8cbc893eabe8575135
Author: Salva Peiró <[email protected]>
Date:   Wed Apr 30 19:48:02 2014 +0200

    media: media-device: fix infoleak in ioctl media_enum_entities()

    commit e6a623460e5fc960ac3ee9f946d3106233fd28d8 upstream.

    This fixes CVE-2014-1739.

    Signed-off-by: Salva Peiró <[email protected]>
    Acked-by: Laurent Pinchart <[email protected]>
    Signed-off-by: Mauro Carvalho Chehab <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit c4f3c998c17e31c73c1ab223469435a12358d25e
Author: Dan Carpenter <[email protected]>
Date:   Thu Nov 7 08:08:44 2013 +0000

    clk: vexpress: NULL dereference on error path

    commit 6b4ed8b00e93bd31f24a25f59ed8d1b808d0cc00 upstream.

    If the allocation fails then we dereference the NULL in the error path.
    Just return directly.

    Fixes: ed27ff1db869 ('clk: Versatile Express clock generators ("osc") driver')
    Signed-off-by: Dan Carpenter <[email protected]>
    Signed-off-by: Pawel Moll <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit d2157c29092990941b82488a3653d95fc9e2cb7a
Author: Tim Chen <[email protected]>
Date:   Mon Mar 17 16:52:26 2014 -0700

    crypto: crypto_wq - Fix late crypto work queue initialization

    commit 130fa5bc81b44b6cc1fbdea3abf6db0da22964e0 upstream.

    The crypto algorithm modules utilizing the crypto daemon could
    be used early when the system start up.  Using module_init
    does not guarantee that the daemon's work queue is initialized
    when the cypto alorithm depending on crypto_wq starts.  It is necessary
    to initialize the crypto work queue earlier at the subsystem
    init time to make sure that it is initialized
    when used.

    Signed-off-by: Tim Chen <[email protected]>
    Signed-off-by: Herbert Xu <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit b944e0e0b84a1f8871b09723a1d035d70368e298
Author: Geert Uytterhoeven <[email protected]>
Date:   Mon Apr 14 18:52:14 2014 +0200

    Documentation: Update stable address in Chinese and Japanese translations

    commit 98b0f811aade1b7c6e7806c86aa0befd5919d65f upstream.

    The English and Korean translations were updated, the Chinese and Japanese
    weren't.

    Signed-off-by: Geert Uytterhoeven <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit b020ee793f714e7d293078a1e0ea8a545c33b16f
Author: Emil Goode <[email protected]>
Date:   Sun Mar 9 21:06:51 2014 +0100

    brcmsmac: fix deadlock on missing firmware

    commit 8fc1e8c240aab968db658b2d8d079b4391207a36 upstream.

    When brcm80211 firmware is not installed networking hangs.
    A deadlock happens because we call ieee80211_unregister_hw()
    from the .start callback of struct ieee80211_ops. When .start
    is called we are under rtnl lock and ieee80211_unregister_hw()
    tries to take it again.

    Function call stack:

    dev_change_flags()
            __dev_change_flags()
                    __dev_open()
                            ASSERT_RTNL() <-- Assert rtnl lock
                            ops->ndo_open()

    .ndo_open = ieee80211_open,

    ieee80211_open()
            ieee80211_do_open()
                    drv_start()
                            local->ops->start()

    .start = brcms_ops_start,

    brcms_ops_start()
            brcms_remove()
                    ieee80211_unregister_hw()
                            rtnl_lock() <-- Here we deadlock

    Introduced by:
    commit 25b5632fb35ca61b8ae3eee235edcdc2883f7a5e
    ("brcmsmac: request firmware in .start() callback")

    This patch fixes the bug by removing the call to brcms_remove()
    and moves the brcms_request_fw() call to the top of the .start
    callback to not initiate anything unless firmware is installed.

    Signed-off-by: Emil Goode <[email protected]>
    Signed-off-by: John W. Linville <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 5d33fff5ca9aab5ae7d28fc06179a91eb529758e
Author: Russell King <[email protected]>
Date:   Sun Apr 6 15:20:03 2014 -0700

    leds: leds-pwm: properly clean up after probe failure

    commit 392369019eb96e914234ea21eda806cb51a1073e upstream.

    When probing with DT, we add each LED one at a time.  If we find a LED
    without a PWM device (because it is not available yet) we fail the
    initialisation, unregister previous LEDs, and then by way of managed
    resources, we free the structure.

    The problem with this is we may have a scheduled and active work_struct
    in this structure, and this results in a nasty kernel oops.

    We need to cancel this work_struct properly upon cleanup - and the
    cleanup we require is the same cleanup as we do when the LED platform
    device is removed.  Rather than writing this same code three times,
    move it into a separate function and use it in all three places.

    Fixes: c971ff185f64 ("leds: leds-pwm: Defer led_pwm_set() if PWM can sleep")
    Signed-off-by: Russell King <[email protected]>
    Signed-off-by: Bryan Wu <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit d3f09691b3583edfabbd0ea04ef3e015df23a708
Author: Martin Peres <[email protected]>
Date:   Fri Mar 14 00:26:52 2014 +0100

    drm/nouveau/pm/fan: drop the fan lock in fan_update() before rescheduling

    commit 61679fe153b2b9ea5b5e2ab93305419e85e99a9d upstream.

    This should fix a deadlock that has been reported to us where fan_update()
    would hold the fan lock and try to grab the alarm_program_lock to reschedule
    an update. On an other CPU, the alarm_program_lock would have been taken
    before calling fan_update(), leading to a deadlock.

    We should Cc: <[email protected]> # 3.9+

    Reported-by: Marcin Slusarz <[email protected]>
    Tested-by: Timothée Ravier <[email protected]>
    Tested-by: Boris Fersing (IRC nick fersingb, no public email address)
    Signed-off-by: Martin Peres <[email protected]>
    Signed-off-by: Ben Skeggs <[email protected]>
    Signed-off-by: Greg Kroah-…
Thunderoar pushed a commit to Thunderoar/latest_goyave_kernel that referenced this issue Mar 10, 2018
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Thunderoar pushed a commit to Thunderoar/latest_goyave_kernel that referenced this issue May 10, 2018
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Thunderoar pushed a commit to Thunderoar/latest_goyave_kernel that referenced this issue May 26, 2018
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
MSe1969 pushed a commit to lin14-mGoms/android_kernel_samsung_gts2 that referenced this issue Feb 28, 2019
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
timocapa pushed a commit to timocapa/kernel_lenok that referenced this issue May 26, 2019
commit 178eda2 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Thunderoar pushed a commit to Thunderoar/latest_goyave_kernel that referenced this issue Jun 8, 2019
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream.

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

openzfs/spl#241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants