Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large KMem Allocation When Using ZVOL on RAIDZ2 atop an LSI 1078 (MegaRAID 8888ELP) #3684

Closed
sempervictus opened this issue Aug 17, 2015 · 7 comments
Milestone

Comments

@sempervictus
Copy link
Contributor

Doing a bit of testing on a new storage system, and seem to be able to consistently reproduce a large allocation notice in dmesg:

[Mon Aug 17 02:11:40 2015] https://github.com/zfsonlinux/zfs/issues/new
[Mon Aug 17 02:11:40 2015] CPU: 1 PID: 17756 Comm: zvol Tainted: P           OE   4.0.9-sv-i7 #sv
[Mon Aug 17 02:11:40 2015] Hardware name: SGI S5000PSL/S5000PSL, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010
[Mon Aug 17 02:11:40 2015]  000000000000c210 ffff8807f288fbe8 ffffffff827d77c5 0000000000000001
[Mon Aug 17 02:11:40 2015]  0000000000000000 ffff8807f288fc28 ffffffffc04f8b93 ffff8807a95c06f0
[Mon Aug 17 02:11:40 2015]  ffff8807a95c06f0 000000006b7e0000 0000000000002000 ffffffffc0b83b43
[Mon Aug 17 02:11:40 2015] Call Trace:
[Mon Aug 17 02:11:40 2015]  [<ffffffff827d77c5>] dump_stack+0x45/0x57
[Mon Aug 17 02:11:40 2015]  [<ffffffffc04f8b93>] spl_kmem_zalloc+0x113/0x180 [spl]
[Mon Aug 17 02:11:40 2015]  [<ffffffffc0ad051c>] dmu_buf_hold_array_by_dnode+0x9c/0x4d0 [zfs]
[Mon Aug 17 02:11:40 2015]  [<ffffffffc0ad0a2d>] dmu_buf_hold_array+0x5d/0x80 [zfs]
[Mon Aug 17 02:11:40 2015]  [<ffffffffc0ad209a>] dmu_write_req+0x6a/0x1e0 [zfs]
[Mon Aug 17 02:11:40 2015]  [<ffffffffc0b799c9>] zvol_write+0x109/0x440 [zfs]
[Mon Aug 17 02:11:40 2015]  [<ffffffffc04fbfb5>] taskq_thread+0x205/0x450 [spl]
[Mon Aug 17 02:11:40 2015]  [<ffffffff82094160>] ? wake_up_process+0x50/0x50
[Mon Aug 17 02:11:40 2015]  [<ffffffffc04fbdb0>] ? taskq_cancel_id+0x120/0x120 [spl]
[Mon Aug 17 02:11:40 2015]  [<ffffffff8208ba49>] kthread+0xc9/0xe0
[Mon Aug 17 02:11:40 2015]  [<ffffffff8208b980>] ? kthread_create_on_node+0x180/0x180
[Mon Aug 17 02:11:40 2015]  [<ffffffff827dfd18>] ret_from_fork+0x58/0x90
[Mon Aug 17 02:11:40 2015]  [<ffffffff8208b980>] ? kthread_create_on_node+0x180/0x180

The patch stack in question starts with DeHackEd's bleedingedge2 and consists of:

  * origin/pr/3663
  ** zvol state locking: refinements for scalability
  * origin/pr/2668
  ** Allow for "zfs receive" to skip existing snapshots
  * origin/pr/3169
  ** Add dfree_zfs for changing how Samba reports space
  * origin/pr/3344
  ** Linux 3.18 compat: Snapshot automounting
  * origin/pr/3526
  ** Change default cachefile property to 'none'.
  * origin/pr/3574
  ** 5745 zfs set allows only one dataset property to be set at a time
  * origin/pr/3643
  ** Remove fastwrite mutex
  * origin/pr/3672
  ** configure bdi_setup_and_register: don't blow config stack
  * origin/pr/2012
  ** Add option to zpool status to print guids
  * rdolbeau/abd2
  ** AVX-512F & AVX-512BW implementations for RAID-Z1/Z2/Z3. This is only for the sake of completeness, it was not tested in the kernel since no hardware is yet available for AVX512F let alone AVX512BW.
  * dehacked/dehacked-bleedingedge2 @ 6bec4351f5877f3f20dc9d7730aba7b1df983ecd

Figure this may be of interest to anyone running a similar stack of maintaining the included changes. System seems to run just fine, though i have observed ARC dip to almost zero since its on a test pool with no data (rebounds fine with data added).

@sempervictus
Copy link
Contributor Author

This is starting to look pretty grave. Happens on master, on DeHackEd's bleeding edge 2 branch, and all of our builds since start of July. Any chance that this could be caused by the pool consisting of a set of RAID0 individual disk volumes comprising the RAIDZ2? The MegaRAID 8888ELP backing this setup doesnt actually do JBOD far as i can tell (and LSI's download sites being gone isn't helping matters, at least SMC has some firmware). I've run this with and without @rdolbeau's SSE calculation patch in the stacks, same result.

The pool layout is a RAIDZ2 at ashift=9 of 10 2T Constellation ES drives on the aforementioned controller, with each disk presented as a RAID0 volume. ZVOLs in use have been destroyed and recreated at every ZFS rebuild (patch-stack/version deployment).

I'm going to try an older build, but i am somewhat limited in the number of ZFS deployments i can do here since the system runs off a thumb drive which will eventually die from the IO abuse of building DKMS modules (we've seen them run for >1y when they do quarterly deployments, and much less under these conditions).

The system in question is backing an OpenStack Glance storage and Horizon node atop a VM. It kills the fuel deployment in an ugly way, registering the kernel stack trace in dmesg of the physical host, but stalling the VM actually being provisioned for services (after the OS image is pushed) without killing the deployment process. The disks are all new, and have been run through smartctl long tests in xyratex chassis (our primary storage systems, with direct SAS/SATA access in JBOD mode), so i dont think they're the culprit.

Anyone wanna weigh in? I'd love to hear someone tell me this controller is garbage and i should have my head examined for even trying this through RAID0 abstractions, but we've done this on other systems where we couldnt get clients to purchase a real HBA off the bat, and its never been this bad before.

@sempervictus sempervictus changed the title Large Allocation When Abusing ZVOL Large KMem Allocation When Using ZVOL on RAIDZ2 atop an LSI 1078 (MegaRAID 8888ELP) Aug 18, 2015
@sempervictus
Copy link
Contributor Author

It gets more interesting by the minute: the pool now refuses to import altogether throwing the following in dmesg and hanging zpool import:

[Tue Aug 18 03:33:26 2015] BUG: unable to handle kernel paging request at ffffff00c0db3b20
[Tue Aug 18 03:33:26 2015] IP: [<ffffffffc0d96a5f>] zio_vdev_io_assess+0x4f/0x200 [zfs]
[Tue Aug 18 03:33:26 2015] PGD 2e92067 PUD 0 
[Tue Aug 18 03:33:26 2015] Oops: 0000 [#1] SMP 
[Tue Aug 18 03:33:26 2015] Modules linked in: ecryptfs pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) mptctl mptbase zfs(POE) nfsd auth_rpcgss nfs_acl nfs lockd grace ipmi_ssif coretemp sunrpc kvm_intel i5000_edac edac_core kvm i5k_amb microcode serio_raw ioatdma lpc_ich dca shpchp 8250_fintek fscache ipmi_si ipmi_msghandler zunicode(POE) zcommon(POE) znvpair(POE) zavl(POE) spl(OE) lp parport xts gf128mul dm_crypt ses enclosure amdkfd amd_iommu_v2 radeon uas i2c_algo_bit ttm hid_generic drm_kms_helper usbhid ahci psmouse drm usb_storage hid libahci e1000e megaraid_sas
[Tue Aug 18 03:33:26 2015] CPU: 5 PID: 3685 Comm: z_rd_int_3 Tainted: P        W  OE   4.0.9-sv-i7 #sv
[Tue Aug 18 03:33:26 2015] Hardware name: SGI S5000PSL/S5000PSL, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010
[Tue Aug 18 03:33:26 2015] task: ffff8807e9974500 ti: ffff8807e9a14000 task.ti: ffff8807e9a14000
[Tue Aug 18 03:33:26 2015] RIP: 0010:[<ffffffffc0d96a5f>]  [<ffffffffc0d96a5f>] zio_vdev_io_assess+0x4f/0x200 [zfs]
[Tue Aug 18 03:33:26 2015] RSP: 0018:ffff8807e9a17c58  EFLAGS: 00010286
[Tue Aug 18 03:33:26 2015] RAX: ffffff00c0db3b20 RBX: ffff8807e94e1568 RCX: 00000000000016e9
[Tue Aug 18 03:33:26 2015] RDX: 0000000000000030 RSI: 0000000000000000 RDI: ffff8807e94e1568
[Tue Aug 18 03:33:26 2015] RBP: ffff8807e9a17c78 R08: ffff8807fbe35800 R09: ffff8807ffd577e0
[Tue Aug 18 03:33:26 2015] R10: ffffffffc04fbd20 R11: ffffea00278e6f80 R12: ffff8807eada8000
[Tue Aug 18 03:33:26 2015] R13: 0000000000000101 R14: 0000000000080000 R15: 0000000000000000
[Tue Aug 18 03:33:26 2015] FS:  0000000000000000(0000) GS:ffff8807ffd40000(0000) knlGS:0000000000000000
[Tue Aug 18 03:33:26 2015] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[Tue Aug 18 03:33:26 2015] CR2: ffffff00c0db3b20 CR3: 00000007eb11b000 CR4: 00000000000407e0
[Tue Aug 18 03:33:26 2015] Stack:
[Tue Aug 18 03:33:26 2015]  ffff8807e8ccf0f0 ffff8807e94e1568 ffff8807e9974500 0000000000000000
[Tue Aug 18 03:33:26 2015]  ffff8807e9a17cf8 ffffffffc0d9ccfa ffff8807e9a17cc8 ffffffffc04fbd20
[Tue Aug 18 03:33:26 2015]  ffff8807e9974500 ffff8807e9974500 ffff8807e8d61028 ffff8807e8d61000
[Tue Aug 18 03:33:26 2015] Call Trace:
[Tue Aug 18 03:33:26 2015]  [<ffffffffc0d9ccfa>] zio_done.part.12+0xa5a/0xf20 [zfs]
[Tue Aug 18 03:33:26 2015]  [<ffffffffc04fbd20>] ? spl_kmem_cache_free+0x170/0x1f0 [spl]
[Tue Aug 18 03:33:26 2015]  [<ffffffffc0d9d23a>] zio_done+0x7a/0x80 [zfs]
[Tue Aug 18 03:33:26 2015]  [<ffffffffc0d9ccfa>] zio_done.part.12+0xa5a/0xf20 [zfs]
[Tue Aug 18 03:33:26 2015]  [<ffffffffc0d9d23a>] zio_done+0x7a/0x80 [zfs]
[Tue Aug 18 03:33:26 2015]  [<ffffffffc0d98048>] zio_execute+0xc8/0x180 [zfs]
[Tue Aug 18 03:33:26 2015]  [<ffffffffc04fcfb5>] taskq_thread+0x205/0x450 [spl]
[Tue Aug 18 03:33:26 2015]  [<ffffffff82094160>] ? wake_up_process+0x50/0x50
[Tue Aug 18 03:33:26 2015]  [<ffffffffc04fcdb0>] ? taskq_cancel_id+0x120/0x120 [spl]
[Tue Aug 18 03:33:26 2015]  [<ffffffff8208ba49>] kthread+0xc9/0xe0
[Tue Aug 18 03:33:26 2015]  [<ffffffff8208b980>] ? kthread_create_on_node+0x180/0x180
[Tue Aug 18 03:33:26 2015]  [<ffffffff827dfd18>] ret_from_fork+0x58/0x90
[Tue Aug 18 03:33:26 2015]  [<ffffffff8208b980>] ? kthread_create_on_node+0x180/0x180
[Tue Aug 18 03:33:26 2015] Code: a7 10 02 00 00 e8 82 f9 ff ff 85 c0 75 6e 4d 85 e4 0f 84 15 01 00 00 48 83 bb 18 02 00 00 00 74 17 48 8b 83 20 02 00 00 48 89 df <ff> 10 48 c7 83 18 02 00 00 00 00 00 00 8b 3d ee b0 1b 00 85 ff 
[Tue Aug 18 03:33:26 2015] RIP  [<ffffffffc0d96a5f>] zio_vdev_io_assess+0x4f/0x200 [zfs]
[Tue Aug 18 03:33:26 2015]  RSP <ffff8807e9a17c58>
[Tue Aug 18 03:33:26 2015] CR2: ffffff00c0db3b20
[Tue Aug 18 03:33:26 2015] ---[ end trace 0ffa09d91958e2d3 ]---

The "unable to handle kernel paging request" bit seems interesting...

@behlendorf
Copy link
Contributor

@sempervictus do you determine this was in fact related to #3651?

@sempervictus
Copy link
Contributor Author

Testing that host is a bit of a problem right now - i've had to switch the system back to running the native RAID6 for the time being, and its booting directly from PXE through Fuel to test OpenStack.

I've built out an identical patch stack sans #3651 and am testing it presently on a system which only has mirrored VDEVs (a 5 mirror span). However, under significant IO load, it has shown none of the symptoms described above.

@prometheanfire
Copy link
Contributor

Think I can reproduce this as well 4.0.9 kernel with zfs/spl master as of 03/09/2015 (d/m/y).

oddly, also running openstack... raidz3 on individual luks disks

https://gist.github.com/prometheanfire/c752422c35bf070b5f0c

@behlendorf behlendorf added this to the 0.6.5 milestone Aug 28, 2015
behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 28, 2015
When support for large blocks was added DMU_MAX_ACCESS was increased
to allow for blocks of up to 16M to fix in a transaction handle.
This had the side effect of increasing the max_hw_sectors_kb for
volumes, which are scaled off DMU_MAX_ACCESS, to 64M from 10M.

This is an issue for volumes which by default use an 8K block size
because it results in dmu_buf_hold_array_by_dnode() allocating a
large array for the dbufs.  The solution is to restore the maximum
size to ~10M this patch specifically changes it to 16M which is
close enough.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3684
@behlendorf
Copy link
Contributor

@sempervictus @prometheanfire could you please verify the patch in #3710 resolves the issue.

behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 28, 2015
When support for large blocks was added DMU_MAX_ACCESS was increased
to allow for blocks of up to 16M to fit in a transaction handle.
This had the side effect of increasing the max_hw_sectors_kb for
volumes, which are scaled off DMU_MAX_ACCESS, to 64M from 10M.

This is an issue for volumes which by default use an 8K block size
because it results in dmu_buf_hold_array_by_dnode() allocating a
64K array for the dbufs.  The solution is to restore the maximum
size to ~10M.  This patch specifically changes it to 16M which is
close enough.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3684
@prometheanfire
Copy link
Contributor

done

tomgarcia pushed a commit to tomgarcia/zfs that referenced this issue Sep 11, 2015
When support for large blocks was added DMU_MAX_ACCESS was increased
to allow for blocks of up to 16M to fit in a transaction handle.
This had the side effect of increasing the max_hw_sectors_kb for
volumes, which are scaled off DMU_MAX_ACCESS, to 64M from 10M.

This is an issue for volumes which by default use an 8K block size
because it results in dmu_buf_hold_array_by_dnode() allocating a
64K array for the dbufs.  The solution is to restore the maximum
size to ~10M.  This patch specifically changes it to 16M which is
close enough.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#3684
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants