Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash during zfs activity (scrub, rsync. etc) #404

Closed
mrobbetts opened this issue Sep 21, 2011 · 5 comments
Closed

Crash during zfs activity (scrub, rsync. etc) #404

mrobbetts opened this issue Sep 21, 2011 · 5 comments
Milestone

Comments

@mrobbetts
Copy link

Hi all,

I'm getting crashes when ZFS is under load. It takes a random time happen, ranging from almost immediately after loading the zfs module to over a day.

I'm running zfs-9999 from Pendor's overlay (built since rc5 was tagged, but I'm not sure of the revision) on Gentoo Hardened 2.6.39 (amd64) on a system with 1.5G of ram, and have a single-disk zpool with a 1T Samsung disk. I cannot swear it's related, but I think this crash only started happening when I moved to the Hardened kernel/userland.

I load the zfs module with a restricted ARC size of 512MB and while the system is running I see memory usage by ZFS varying a lot but seemingly stable.

I managed to get the following via netconsole:

[  515.695406] BUG: unable to handle kernel paging request at ffff88000156a008
[  515.695864] IP: [<ffffffff810aff42>] __pmd_alloc+0x54/0x5f
[  515.696009] PGD 8000000001569063 PUD 1570063 PMD 14001e1 
[  515.696009] Oops: 0003 [#1] 
[  515.696009] last sysfs file: /sys/devices/virtual/bdi/zfs-10/uevent
[  515.696009] CPU 0 
[  515.696009] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx snd_hdsp snd_rawmidi snd_hwdep
[  515.696009] 
[  515.696009] Pid: 3379, comm: rsync Tainted: P            2.6.39-hardened-r8 #6    /MS-7030
[  515.696009] RIP: 0010:[<ffffffff810aff42>]  [<ffffffff810aff42>] __pmd_alloc+0x54/0x5f
[  515.696009] RSP: 0018:ffff88004eb29638  EFLAGS: 00010202
[  515.696009] RAX: 0000000009b32067 RBX: ffff880009b32000 RCX: 0000000000000000
[  515.696009] RDX: ffff880000000000 RSI: 0000000000000001 RDI: ffff880009b32000
[  515.696009] RBP: ffff88004eb29648 R08: 0000000000000000 R09: 00000000ffffffff
[  515.696009] R10: 00000000ffffff02 R11: 00000000ffffff01 R12: ffff88000156a008
[  515.696009] R13: ffffc90040002000 R14: ffffffff81567c90 R15: ffffc90040006000
[  515.696009] FS:  00000278fa3d8700(0000) GS:ffffffff81811000(0000) knlGS:0000000000000000
[  515.696009] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  515.696009] CR2: ffff88000156a008 CR3: 0000000001568000 CR4: 00000000000006f0
[  515.696009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  515.696009] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  515.696009] Process rsync (pid: 3379, threadinfo ffff88003cb73910, task ffff88003cb73480)
[  515.696009] Stack:
[  515.696009]  ffff88000156a008 ffffc90040006000 ffff88004eb296d8 ffffffff810bb053
[  515.696009]  ffffffff8186c8e8 ffff880009a93a00 0000000109a93a98 ffffc90040005fff
[  515.696009]  ffffffff007fffff ffff880009af4760 8000000000000163 ffffc90040006000
[  515.696009] Call Trace:
[  515.696009]  [<ffffffff810bb053>] vmap_page_range_noflush+0x10a/0x2da
[  515.696009]  [<ffffffff810bb250>] map_vm_area+0x2d/0x40
[  515.696009]  [<ffffffff810bbef9>] __vmalloc_node_range+0x16d/0x191
[  515.696009]  [<ffffffffa009cd80>] ? spl_kmem_cache_free+0xc0/0xde [spl]
[  515.696009]  [<ffffffff810bbf4d>] __vmalloc_node+0x30/0x32
[  515.696009]  [<ffffffffa009cd80>] ? spl_kmem_cache_free+0xc0/0xde [spl]
[  515.696009]  [<ffffffff810bc104>] __vmalloc+0x1b/0x1d
[  515.696009]  [<ffffffffa009cd80>] spl_kmem_cache_free+0xc0/0xde [spl]
[  515.696009]  [<ffffffffa009ce8d>] spl_kmem_cache_alloc+0xef/0x553 [spl]
[  515.696009]  [<ffffffff810c599f>] ? kfree+0xa8/0xcc
[  515.696009]  [<ffffffffa015923f>] dnode_setbonuslen+0xd5/0x21c [zfs]
[  515.696009]  [<ffffffffa015980e>] dnode_hold_impl+0x29d/0x42c [zfs]
[  515.696009]  [<ffffffffa01599b1>] dnode_hold+0x14/0xaf [zfs]
[  515.696009]  [<ffffffffa014c5b2>] dmu_bonus_hold+0x20/0x1e1 [zfs]
[  515.696009]  [<ffffffffa0174052>] sa_buf_hold+0x9/0xb [zfs]
[  515.696009]  [<ffffffffa01b5468>] zfs_zget+0xa9/0x299 [zfs]
[  515.696009]  [<ffffffffa0197d0d>] ? zap_lookup_norm+0x13d/0x14f [zfs]
[  515.696009]  [<ffffffffa019d733>] zfs_dirent_lock+0x447/0x48c [zfs]
[  515.696009]  [<ffffffffa019da1c>] zfs_dirlook+0x142/0x47a [zfs]
[  515.696009]  [<ffffffffa01999d9>] ? zfs_zaccess+0x136/0x1d5 [zfs]
[  515.696009]  [<ffffffffa01b256e>] zfs_lookup+0x269/0x2e0 [zfs]
[  515.696009]  [<ffffffffa01c0e46>] zpl_read_common+0x218/0x1028 [zfs]
[  515.696009]  [<ffffffff810d1d01>] d_alloc_and_lookup+0x49/0x68
[  515.696009]  [<ffffffff810d3423>] walk_component+0x160/0x30b
[  515.696009]  [<ffffffff810d43f0>] path_lookupat+0xaa/0x372
[  515.696009]  [<ffffffff810e082f>] ? mntput+0x21/0x23
[  515.696009]  [<ffffffff810d1c00>] ? path_put+0x1d/0x22
[  515.696009]  [<ffffffff810c5dc3>] ? kmem_cache_alloc+0x56/0xce
[  515.696009]  [<ffffffff810d46dd>] do_path_lookup+0x25/0x9d
[  515.696009]  [<ffffffff810d5302>] user_path_at+0x53/0x97
[  515.696009]  [<ffffffffa01b017a>] ? zfs_getattr_fast+0xa8/0xb3 [zfs]
[  515.696009]  [<ffffffff810cc2bb>] ? cp_new_stat+0x19e/0x1b6
[  515.696009]  [<ffffffff810cc07a>] vfs_fstatat+0x47/0x72
[  515.696009]  [<ffffffff810cc0be>] vfs_lstat+0x19/0x1b
[  515.696009]  [<ffffffff810cc325>] sys_newlstat+0x1a/0x38
[  515.696009]  [<ffffffff81559d86>] system_call_fastpath+0x16/0x1b
[  515.696009]  [<ffffffff81559d25>] ? system_call_after_swapgs+0x15/0x60
[  515.696009] Code: c3 ff 0f 00 00 74 04 0f 0b eb fe 31 f6 48 89 df e8 3a ef fe ff eb 18 48 89 df e8 b2 79 f7 ff 48 89 df e8 aa 79 f7 ff 48 83 c8 67 
[  515.696009]  89 04 24 31 c0 5b 41 5c c9 c3 55 48 83 3e 00 48 
[  515.696009] RIP  [<ffffffff810aff42>] __pmd_alloc+0x54/0x5f
[  515.696009]  RSP <ffff88004eb29638>
[  515.696009] CR2: ffff88000156a008
[  515.696009] ---[ end trace 25a0461acbecc0fc ]---
[  515.751098] rsync used greatest stack depth: 2848 bytes left
[  516.187362] BUG: unable to handle kernel paging request at ffff88000156a008
[  516.188011] IP: [<ffffffff810aff42>] __pmd_alloc+0x54/0x5f
[  516.188011] PGD 8000000001569063 PUD 1570063 PMD 14001e1 
[  516.188011] Oops: 0003 [#2] 
[  516.188011] last sysfs file: /sys/devices/virtual/bdi/zfs-10/uevent
[  516.188011] CPU 0 
[  516.188011] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx snd_hdsp snd_rawmidi snd_hwdep
[  516.188011] 
[  516.188011] Pid: 3477, comm: rsync Tainted: P      D     2.6.39-hardened-r8 #6    /MS-7030
[  516.188011] RIP: 0010:[<ffffffff810aff42>]  [<ffffffff810aff42>] __pmd_alloc+0x54/0x5f
[  516.188011] RSP: 0018:ffff880016b25638  EFLAGS: 00010202
[  516.188011] RAX: 0000000018de1067 RBX: ffff880018de1000 RCX: 0000000000000000
[  516.188011] RDX: ffff880000000000 RSI: 0000000000000001 RDI: ffff880018de1000
[  516.188011] RBP: ffff880016b25648 R08: 0000000000000000 R09: 000000000000a4d8
[  516.188011] R10: 00000000ffffff02 R11: 00000000ffffff01 R12: ffff88000156a008
[  516.188011] R13: ffffc90040008000 R14: ffffffff81567c90 R15: ffffc9004000c000
[  516.188011] FS:  000002bf5701b700(0000) GS:ffffffff81811000(0000) knlGS:0000000000000000
[  516.188011] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  516.188011] CR2: ffff88000156a008 CR3: 0000000001568000 CR4: 00000000000006f0
[  516.188011] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  516.188011] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  516.188011] Process rsync (pid: 3477, threadinfo ffff88005d0d5b10, task ffff88005d0d5680)
[  516.188011] Stack:
[  516.188011]  ffff88000156a008 ffffc9004000c000 ffff880016b256d8 ffffffff810bb053
[  516.188011]  ffffffff8186c8e8 ffff880009a93a80 0000000109a93d18 ffffc9004000bfff
[  516.188011]  ffffffff007fffff ffff880018de6180 8000000000000163 ffffc9004000c000
[  516.188011] Call Trace:
[  516.188011]  [<ffffffff810bb053>] vmap_page_range_noflush+0x10a/0x2da
[  516.188011]  [<ffffffff810bb250>] map_vm_area+0x2d/0x40
[  516.188011]  [<ffffffff810bbef9>] __vmalloc_node_range+0x16d/0x191
[  516.188011]  [<ffffffffa009cd80>] ? spl_kmem_cache_free+0xc0/0xde [spl]
[  516.188011]  [<ffffffff810bbf4d>] __vmalloc_node+0x30/0x32
[  516.188011]  [<ffffffffa009cd80>] ? spl_kmem_cache_free+0xc0/0xde [spl]
[  516.188011]  [<ffffffff810bc104>] __vmalloc+0x1b/0x1d
[  516.188011]  [<ffffffffa009cd80>] spl_kmem_cache_free+0xc0/0xde [spl]
[  516.188011]  [<ffffffffa009ce8d>] spl_kmem_cache_alloc+0xef/0x553 [spl]
[  516.188011]  [<ffffffff810c6336>] ? __kmalloc+0x118/0x126
[  516.188011]  [<ffffffffa009d5e8>] ? kmem_vasprintf+0x6e/0x80 [spl]
[  516.188011]  [<ffffffffa015923f>] dnode_setbonuslen+0xd5/0x21c [zfs]
[  516.188011]  [<ffffffffa015980e>] dnode_hold_impl+0x29d/0x42c [zfs]
[  516.188011]  [<ffffffffa01599b1>] dnode_hold+0x14/0xaf [zfs]
[  516.188011]  [<ffffffffa014c5b2>] dmu_bonus_hold+0x20/0x1e1 [zfs]
[  516.188011]  [<ffffffffa0174052>] sa_buf_hold+0x9/0xb [zfs]
[  516.188011]  [<ffffffffa01b5468>] zfs_zget+0xa9/0x299 [zfs]
[  516.188011]  [<ffffffffa0197d0d>] ? zap_lookup_norm+0x13d/0x14f [zfs]
[  516.188011]  [<ffffffffa019d733>] zfs_dirent_lock+0x447/0x48c [zfs]
[  516.188011]  [<ffffffffa019da1c>] zfs_dirlook+0x142/0x47a [zfs]
[  516.188011]  [<ffffffffa01999d9>] ? zfs_zaccess+0x136/0x1d5 [zfs]
[  516.188011]  [<ffffffffa01b256e>] zfs_lookup+0x269/0x2e0 [zfs]
[  516.188011]  [<ffffffffa01c0e46>] zpl_read_common+0x218/0x1028 [zfs]
[  516.188011]  [<ffffffff810d1d01>] d_alloc_and_lookup+0x49/0x68
[  516.188011]  [<ffffffff810d3423>] walk_component+0x160/0x30b
[  516.188011]  [<ffffffff810d43f0>] path_lookupat+0xaa/0x372
[  516.188011]  [<ffffffff810c5dc3>] ? kmem_cache_alloc+0x56/0xce
[  516.188011]  [<ffffffff810d46dd>] do_path_lookup+0x25/0x9d
[  516.188011]  [<ffffffff810d5302>] user_path_at+0x53/0x97
[  516.188011]  [<ffffffffa01b017a>] ? zfs_getattr_fast+0xa8/0xb3 [zfs]
[  516.188011]  [<ffffffff810cc2bb>] ? cp_new_stat+0x19e/0x1b6
[  516.188011]  [<ffffffff810cc07a>] vfs_fstatat+0x47/0x72
[  516.188011]  [<ffffffff810cc0be>] vfs_lstat+0x19/0x1b
[  516.188011]  [<ffffffff810cc325>] sys_newlstat+0x1a/0x38
[  516.188011]  [<ffffffff81559d86>] system_call_fastpath+0x16/0x1b
[  516.188011]  [<ffffffff81559d25>] ? system_call_after_swapgs+0x15/0x60
[  516.188011] Code: c3 ff 0f 00 00 74 04 0f 0b eb fe 31 f6 48 89 df e8 3a ef fe ff eb 18 48 89 df e8 b2 79 f7 ff 48 89 df e8 aa 79 f7 ff 48 83 c8 67 
[  516.188011]  89 04 24 31 c0 5b 41 5c c9 c3 55 48 83 3e 00 48 89 e5 41 54 
[  516.188011] RIP  [<ffffffff810aff42>] __pmd_alloc+0x54/0x5f
[  516.188011]  RSP <ffff880016b25638>
[  516.188011] CR2: ffff88000156a008
[  516.188011] ---[ end trace 25a0461acbecc0fd ]---

After this appears, the system grinds to a halt over a minute or two and needs a hard reset.

I apologise if this is a duplicate of an existing issue - I was unable to tell. Please let me know if you need any more output.

@behlendorf
Copy link
Contributor

Thanks for the bug report and going through the effort to get the console stacks. They'll make it possible to run this issue to ground.

@mrobbetts
Copy link
Author

Quick update:

I can confirm that using a gentoo-sources (as opposed to hardened-sources) kernel does not fix the problem. While I have not fully switched over to the non-hardened toolchain (this takes ages), I set gcc's profile to 'vanilla' with gcc-config, and rebuilt both the kernel and the spl/zfs packages with it and am presuming this is sufficient to remove the effect of the hardened system. If anyone knows otherwise, please chime in.

Is it worth trying with vanilla-sources/git-sources? I'm under the impression that gentoo's patches are minimal, but don't know any detail.

Unfortunately, despite having crashed it three times since changing kernel, netconsole did not send any output (the system does remain "up" for a while, and continues other duties like being a wireless AP and returning ping, but all SSH connections freeze), so I have nothing to paste here yet. I'll keep trying.

Any hints with regard to capturing crash traces with netconsole are welcome :)

@behlendorf
Copy link
Contributor

There is a decent change this was fixed by issue #279. Can you try the 0.6.0-rc6 sources (or master) which contains the fix.

@mrobbetts
Copy link
Author

Thanks for the heads-up. Rebuilding from master appears to have fixed the issue! (I've been running daily rsyncs and shapshot sets since the weekend and haven't observed a crash). However, I do still get crashes with the new code running on the hardened kernel, so I believe I was actually suffering from two issues at once. I'm perfectly happy to use the vanilla kernel though.

Thanks a lot for your help.


Heh, I seem to have a machine that really brings out the worst in ZFS! Now that the basic stuff seems to be working, running zfs send (either redirecting to a file or piping to ssh) quite repeatably locks the machine up solid in about two seconds. No debug output yet. I'll try and get some somehow and open a new issue.

@behlendorf
Copy link
Contributor

OK, well I'm glad this fixed you first issue. Please go ahead and open new bugs for the other issues you've seen and we'll get them fixed as well. A stack from the console during the crash is always helpful as are simple reproducers.

ahrens pushed a commit to ahrens/zfs that referenced this issue Sep 23, 2021
Follow up from openzfs#404 and openzfs#422 where we used an iterator fold() where
we really wanted to use reduce().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants