Crash during zfs activity (scrub, rsync. etc) #404

mrobbetts · 2011-09-21T12:56:36Z

Hi all,

I'm getting crashes when ZFS is under load. It takes a random time happen, ranging from almost immediately after loading the zfs module to over a day.

I'm running zfs-9999 from Pendor's overlay (built since rc5 was tagged, but I'm not sure of the revision) on Gentoo Hardened 2.6.39 (amd64) on a system with 1.5G of ram, and have a single-disk zpool with a 1T Samsung disk. I cannot swear it's related, but I think this crash only started happening when I moved to the Hardened kernel/userland.

I load the zfs module with a restricted ARC size of 512MB and while the system is running I see memory usage by ZFS varying a lot but seemingly stable.

I managed to get the following via netconsole:

[  515.695406] BUG: unable to handle kernel paging request at ffff88000156a008
[  515.695864] IP: [<ffffffff810aff42>] __pmd_alloc+0x54/0x5f
[  515.696009] PGD 8000000001569063 PUD 1570063 PMD 14001e1 
[  515.696009] Oops: 0003 [#1] 
[  515.696009] last sysfs file: /sys/devices/virtual/bdi/zfs-10/uevent
[  515.696009] CPU 0 
[  515.696009] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx snd_hdsp snd_rawmidi snd_hwdep
[  515.696009] 
[  515.696009] Pid: 3379, comm: rsync Tainted: P            2.6.39-hardened-r8 #6    /MS-7030
[  515.696009] RIP: 0010:[<ffffffff810aff42>]  [<ffffffff810aff42>] __pmd_alloc+0x54/0x5f
[  515.696009] RSP: 0018:ffff88004eb29638  EFLAGS: 00010202
[  515.696009] RAX: 0000000009b32067 RBX: ffff880009b32000 RCX: 0000000000000000
[  515.696009] RDX: ffff880000000000 RSI: 0000000000000001 RDI: ffff880009b32000
[  515.696009] RBP: ffff88004eb29648 R08: 0000000000000000 R09: 00000000ffffffff
[  515.696009] R10: 00000000ffffff02 R11: 00000000ffffff01 R12: ffff88000156a008
[  515.696009] R13: ffffc90040002000 R14: ffffffff81567c90 R15: ffffc90040006000
[  515.696009] FS:  00000278fa3d8700(0000) GS:ffffffff81811000(0000) knlGS:0000000000000000
[  515.696009] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  515.696009] CR2: ffff88000156a008 CR3: 0000000001568000 CR4: 00000000000006f0
[  515.696009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  515.696009] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  515.696009] Process rsync (pid: 3379, threadinfo ffff88003cb73910, task ffff88003cb73480)
[  515.696009] Stack:
[  515.696009]  ffff88000156a008 ffffc90040006000 ffff88004eb296d8 ffffffff810bb053
[  515.696009]  ffffffff8186c8e8 ffff880009a93a00 0000000109a93a98 ffffc90040005fff
[  515.696009]  ffffffff007fffff ffff880009af4760 8000000000000163 ffffc90040006000
[  515.696009] Call Trace:
[  515.696009]  [<ffffffff810bb053>] vmap_page_range_noflush+0x10a/0x2da
[  515.696009]  [<ffffffff810bb250>] map_vm_area+0x2d/0x40
[  515.696009]  [<ffffffff810bbef9>] __vmalloc_node_range+0x16d/0x191
[  515.696009]  [<ffffffffa009cd80>] ? spl_kmem_cache_free+0xc0/0xde [spl]
[  515.696009]  [<ffffffff810bbf4d>] __vmalloc_node+0x30/0x32
[  515.696009]  [<ffffffffa009cd80>] ? spl_kmem_cache_free+0xc0/0xde [spl]
[  515.696009]  [<ffffffff810bc104>] __vmalloc+0x1b/0x1d
[  515.696009]  [<ffffffffa009cd80>] spl_kmem_cache_free+0xc0/0xde [spl]
[  515.696009]  [<ffffffffa009ce8d>] spl_kmem_cache_alloc+0xef/0x553 [spl]
[  515.696009]  [<ffffffff810c599f>] ? kfree+0xa8/0xcc
[  515.696009]  [<ffffffffa015923f>] dnode_setbonuslen+0xd5/0x21c [zfs]
[  515.696009]  [<ffffffffa015980e>] dnode_hold_impl+0x29d/0x42c [zfs]
[  515.696009]  [<ffffffffa01599b1>] dnode_hold+0x14/0xaf [zfs]
[  515.696009]  [<ffffffffa014c5b2>] dmu_bonus_hold+0x20/0x1e1 [zfs]
[  515.696009]  [<ffffffffa0174052>] sa_buf_hold+0x9/0xb [zfs]
[  515.696009]  [<ffffffffa01b5468>] zfs_zget+0xa9/0x299 [zfs]
[  515.696009]  [<ffffffffa0197d0d>] ? zap_lookup_norm+0x13d/0x14f [zfs]
[  515.696009]  [<ffffffffa019d733>] zfs_dirent_lock+0x447/0x48c [zfs]
[  515.696009]  [<ffffffffa019da1c>] zfs_dirlook+0x142/0x47a [zfs]
[  515.696009]  [<ffffffffa01999d9>] ? zfs_zaccess+0x136/0x1d5 [zfs]
[  515.696009]  [<ffffffffa01b256e>] zfs_lookup+0x269/0x2e0 [zfs]
[  515.696009]  [<ffffffffa01c0e46>] zpl_read_common+0x218/0x1028 [zfs]
[  515.696009]  [<ffffffff810d1d01>] d_alloc_and_lookup+0x49/0x68
[  515.696009]  [<ffffffff810d3423>] walk_component+0x160/0x30b
[  515.696009]  [<ffffffff810d43f0>] path_lookupat+0xaa/0x372
[  515.696009]  [<ffffffff810e082f>] ? mntput+0x21/0x23
[  515.696009]  [<ffffffff810d1c00>] ? path_put+0x1d/0x22
[  515.696009]  [<ffffffff810c5dc3>] ? kmem_cache_alloc+0x56/0xce
[  515.696009]  [<ffffffff810d46dd>] do_path_lookup+0x25/0x9d
[  515.696009]  [<ffffffff810d5302>] user_path_at+0x53/0x97
[  515.696009]  [<ffffffffa01b017a>] ? zfs_getattr_fast+0xa8/0xb3 [zfs]
[  515.696009]  [<ffffffff810cc2bb>] ? cp_new_stat+0x19e/0x1b6
[  515.696009]  [<ffffffff810cc07a>] vfs_fstatat+0x47/0x72
[  515.696009]  [<ffffffff810cc0be>] vfs_lstat+0x19/0x1b
[  515.696009]  [<ffffffff810cc325>] sys_newlstat+0x1a/0x38
[  515.696009]  [<ffffffff81559d86>] system_call_fastpath+0x16/0x1b
[  515.696009]  [<ffffffff81559d25>] ? system_call_after_swapgs+0x15/0x60
[  515.696009] Code: c3 ff 0f 00 00 74 04 0f 0b eb fe 31 f6 48 89 df e8 3a ef fe ff eb 18 48 89 df e8 b2 79 f7 ff 48 89 df e8 aa 79 f7 ff 48 83 c8 67 
[  515.696009]  89 04 24 31 c0 5b 41 5c c9 c3 55 48 83 3e 00 48 
[  515.696009] RIP  [<ffffffff810aff42>] __pmd_alloc+0x54/0x5f
[  515.696009]  RSP <ffff88004eb29638>
[  515.696009] CR2: ffff88000156a008
[  515.696009] ---[ end trace 25a0461acbecc0fc ]---
[  515.751098] rsync used greatest stack depth: 2848 bytes left
[  516.187362] BUG: unable to handle kernel paging request at ffff88000156a008
[  516.188011] IP: [<ffffffff810aff42>] __pmd_alloc+0x54/0x5f
[  516.188011] PGD 8000000001569063 PUD 1570063 PMD 14001e1 
[  516.188011] Oops: 0003 [#2] 
[  516.188011] last sysfs file: /sys/devices/virtual/bdi/zfs-10/uevent
[  516.188011] CPU 0 
[  516.188011] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx snd_hdsp snd_rawmidi snd_hwdep
[  516.188011] 
[  516.188011] Pid: 3477, comm: rsync Tainted: P      D     2.6.39-hardened-r8 #6    /MS-7030
[  516.188011] RIP: 0010:[<ffffffff810aff42>]  [<ffffffff810aff42>] __pmd_alloc+0x54/0x5f
[  516.188011] RSP: 0018:ffff880016b25638  EFLAGS: 00010202
[  516.188011] RAX: 0000000018de1067 RBX: ffff880018de1000 RCX: 0000000000000000
[  516.188011] RDX: ffff880000000000 RSI: 0000000000000001 RDI: ffff880018de1000
[  516.188011] RBP: ffff880016b25648 R08: 0000000000000000 R09: 000000000000a4d8
[  516.188011] R10: 00000000ffffff02 R11: 00000000ffffff01 R12: ffff88000156a008
[  516.188011] R13: ffffc90040008000 R14: ffffffff81567c90 R15: ffffc9004000c000
[  516.188011] FS:  000002bf5701b700(0000) GS:ffffffff81811000(0000) knlGS:0000000000000000
[  516.188011] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  516.188011] CR2: ffff88000156a008 CR3: 0000000001568000 CR4: 00000000000006f0
[  516.188011] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  516.188011] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  516.188011] Process rsync (pid: 3477, threadinfo ffff88005d0d5b10, task ffff88005d0d5680)
[  516.188011] Stack:
[  516.188011]  ffff88000156a008 ffffc9004000c000 ffff880016b256d8 ffffffff810bb053
[  516.188011]  ffffffff8186c8e8 ffff880009a93a80 0000000109a93d18 ffffc9004000bfff
[  516.188011]  ffffffff007fffff ffff880018de6180 8000000000000163 ffffc9004000c000
[  516.188011] Call Trace:
[  516.188011]  [<ffffffff810bb053>] vmap_page_range_noflush+0x10a/0x2da
[  516.188011]  [<ffffffff810bb250>] map_vm_area+0x2d/0x40
[  516.188011]  [<ffffffff810bbef9>] __vmalloc_node_range+0x16d/0x191
[  516.188011]  [<ffffffffa009cd80>] ? spl_kmem_cache_free+0xc0/0xde [spl]
[  516.188011]  [<ffffffff810bbf4d>] __vmalloc_node+0x30/0x32
[  516.188011]  [<ffffffffa009cd80>] ? spl_kmem_cache_free+0xc0/0xde [spl]
[  516.188011]  [<ffffffff810bc104>] __vmalloc+0x1b/0x1d
[  516.188011]  [<ffffffffa009cd80>] spl_kmem_cache_free+0xc0/0xde [spl]
[  516.188011]  [<ffffffffa009ce8d>] spl_kmem_cache_alloc+0xef/0x553 [spl]
[  516.188011]  [<ffffffff810c6336>] ? __kmalloc+0x118/0x126
[  516.188011]  [<ffffffffa009d5e8>] ? kmem_vasprintf+0x6e/0x80 [spl]
[  516.188011]  [<ffffffffa015923f>] dnode_setbonuslen+0xd5/0x21c [zfs]
[  516.188011]  [<ffffffffa015980e>] dnode_hold_impl+0x29d/0x42c [zfs]
[  516.188011]  [<ffffffffa01599b1>] dnode_hold+0x14/0xaf [zfs]
[  516.188011]  [<ffffffffa014c5b2>] dmu_bonus_hold+0x20/0x1e1 [zfs]
[  516.188011]  [<ffffffffa0174052>] sa_buf_hold+0x9/0xb [zfs]
[  516.188011]  [<ffffffffa01b5468>] zfs_zget+0xa9/0x299 [zfs]
[  516.188011]  [<ffffffffa0197d0d>] ? zap_lookup_norm+0x13d/0x14f [zfs]
[  516.188011]  [<ffffffffa019d733>] zfs_dirent_lock+0x447/0x48c [zfs]
[  516.188011]  [<ffffffffa019da1c>] zfs_dirlook+0x142/0x47a [zfs]
[  516.188011]  [<ffffffffa01999d9>] ? zfs_zaccess+0x136/0x1d5 [zfs]
[  516.188011]  [<ffffffffa01b256e>] zfs_lookup+0x269/0x2e0 [zfs]
[  516.188011]  [<ffffffffa01c0e46>] zpl_read_common+0x218/0x1028 [zfs]
[  516.188011]  [<ffffffff810d1d01>] d_alloc_and_lookup+0x49/0x68
[  516.188011]  [<ffffffff810d3423>] walk_component+0x160/0x30b
[  516.188011]  [<ffffffff810d43f0>] path_lookupat+0xaa/0x372
[  516.188011]  [<ffffffff810c5dc3>] ? kmem_cache_alloc+0x56/0xce
[  516.188011]  [<ffffffff810d46dd>] do_path_lookup+0x25/0x9d
[  516.188011]  [<ffffffff810d5302>] user_path_at+0x53/0x97
[  516.188011]  [<ffffffffa01b017a>] ? zfs_getattr_fast+0xa8/0xb3 [zfs]
[  516.188011]  [<ffffffff810cc2bb>] ? cp_new_stat+0x19e/0x1b6
[  516.188011]  [<ffffffff810cc07a>] vfs_fstatat+0x47/0x72
[  516.188011]  [<ffffffff810cc0be>] vfs_lstat+0x19/0x1b
[  516.188011]  [<ffffffff810cc325>] sys_newlstat+0x1a/0x38
[  516.188011]  [<ffffffff81559d86>] system_call_fastpath+0x16/0x1b
[  516.188011]  [<ffffffff81559d25>] ? system_call_after_swapgs+0x15/0x60
[  516.188011] Code: c3 ff 0f 00 00 74 04 0f 0b eb fe 31 f6 48 89 df e8 3a ef fe ff eb 18 48 89 df e8 b2 79 f7 ff 48 89 df e8 aa 79 f7 ff 48 83 c8 67 
[  516.188011]  89 04 24 31 c0 5b 41 5c c9 c3 55 48 83 3e 00 48 89 e5 41 54 
[  516.188011] RIP  [<ffffffff810aff42>] __pmd_alloc+0x54/0x5f
[  516.188011]  RSP <ffff880016b25638>
[  516.188011] CR2: ffff88000156a008
[  516.188011] ---[ end trace 25a0461acbecc0fd ]---

After this appears, the system grinds to a halt over a minute or two and needs a hard reset.

I apologise if this is a duplicate of an existing issue - I was unable to tell. Please let me know if you need any more output.

The text was updated successfully, but these errors were encountered:

behlendorf · 2011-09-26T22:35:54Z

Thanks for the bug report and going through the effort to get the console stacks. They'll make it possible to run this issue to ground.

mrobbetts · 2011-09-29T16:05:58Z

Quick update:

I can confirm that using a gentoo-sources (as opposed to hardened-sources) kernel does not fix the problem. While I have not fully switched over to the non-hardened toolchain (this takes ages), I set gcc's profile to 'vanilla' with gcc-config, and rebuilt both the kernel and the spl/zfs packages with it and am presuming this is sufficient to remove the effect of the hardened system. If anyone knows otherwise, please chime in.

Is it worth trying with vanilla-sources/git-sources? I'm under the impression that gentoo's patches are minimal, but don't know any detail.

Unfortunately, despite having crashed it three times since changing kernel, netconsole did not send any output (the system does remain "up" for a while, and continues other duties like being a wireless AP and returning ping, but all SSH connections freeze), so I have nothing to paste here yet. I'll keep trying.

Any hints with regard to capturing crash traces with netconsole are welcome :)

behlendorf · 2011-10-18T23:13:04Z

There is a decent change this was fixed by issue #279. Can you try the 0.6.0-rc6 sources (or master) which contains the fix.

mrobbetts · 2011-10-26T12:24:32Z

Thanks for the heads-up. Rebuilding from master appears to have fixed the issue! (I've been running daily rsyncs and shapshot sets since the weekend and haven't observed a crash). However, I do still get crashes with the new code running on the hardened kernel, so I believe I was actually suffering from two issues at once. I'm perfectly happy to use the vanilla kernel though.

Thanks a lot for your help.

Heh, I seem to have a machine that really brings out the worst in ZFS! Now that the basic stuff seems to be working, running zfs send (either redirecting to a file or piping to ssh) quite repeatably locks the machine up solid in about two seconds. No debug output yet. I'll try and get some somehow and open a new issue.

behlendorf · 2011-10-26T15:59:23Z

OK, well I'm glad this fixed you first issue. Please go ahead and open new bugs for the other issues you've seen and we'll get them fixed as well. A stack from the console during the crash is always helpful as are simple reproducers.

Follow up from openzfs#404 and openzfs#422 where we used an iterator fold() where we really wanted to use reduce().

…s#404)

mrobbetts closed this as completed Oct 26, 2011

ahrens pushed a commit to ahrens/zfs that referenced this issue Aug 23, 2021

DOSE-547 Break up ObjectSizeLog into multiple logs (openzfs#404)

12e42b0

ahrens pushed a commit to ahrens/zfs that referenced this issue Sep 23, 2021

Use iterator reduce function now that compiler supports it (openzfs#447)

3040eb6

Follow up from openzfs#404 and openzfs#422 where we used an iterator fold() where we really wanted to use reduce().

sdimitro added a commit to sdimitro/zfs that referenced this issue May 23, 2022

DLPX-80988 zcachedb bucket size overflow in slab summary info (openzf…

e125f5a

…s#404)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash during zfs activity (scrub, rsync. etc) #404

Crash during zfs activity (scrub, rsync. etc) #404

mrobbetts commented Sep 21, 2011

behlendorf commented Sep 26, 2011

mrobbetts commented Sep 29, 2011

behlendorf commented Oct 18, 2011

mrobbetts commented Oct 26, 2011

behlendorf commented Oct 26, 2011

Crash during zfs activity (scrub, rsync. etc) #404

Crash during zfs activity (scrub, rsync. etc) #404

Comments

mrobbetts commented Sep 21, 2011

behlendorf commented Sep 26, 2011

mrobbetts commented Sep 29, 2011

behlendorf commented Oct 18, 2011

mrobbetts commented Oct 26, 2011

behlendorf commented Oct 26, 2011