Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU soft lock 67s on rm operation #1263

Closed
a-frank-exchange-of-views opened this issue Feb 4, 2013 · 18 comments
Closed

CPU soft lock 67s on rm operation #1263

a-frank-exchange-of-views opened this issue Feb 4, 2013 · 18 comments
Milestone

Comments

@a-frank-exchange-of-views

rc14.

Linux version 2.6.32-279.19.1.el6.x86_64 ([email protected]) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Wed Dec 19 07:05:20 UTC 2012

I'm attempting to rm a folder. After a bit of time the rm locks (100% CPU usage) and I start to get soft locks:

Message from syslogd@khrushchev at Feb 4 16:54:12 ...
kernel:BUG: soft lockup - CPU#2 stuck for 67s! [rm:2934]

It's always the 67s figure.

This happens only on one particular folder. It may be relevant that many of the files contained therein have been created with cp -l (backup strategy involves hard linking to preserve space... Around 40 nested links in this instance)

config:

NAME        STATE     READ WRITE CKSUM
pool1       ONLINE       0     0     0
  raidz2-0  ONLINE       0     0     0
    sdb     ONLINE       0     0     0
    sdc     ONLINE       0     0     0
    sdd     ONLINE       0     0     0
    sde     ONLINE       0     0     0

Here's what I get out of /var/log/messages:

Feb 4 17:21:37 khrushchev kernel: BUG: soft lockup - CPU#2 stuck for 67s! [rm:3976]
Feb 4 17:21:37 khrushchev kernel: Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate r8169 mii e1000e microcode serio_raw i2c_i801 sg iTCO_wdt iTCO_vendor_support shpchp snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc xhci_hcd ext4 mbcache jbd2 sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Feb 4 17:21:37 khrushchev kernel: CPU 2
Feb 4 17:21:37 khrushchev kernel: Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate r8169 mii e1000e microcode serio_raw i2c_i801 sg iTCO_wdt iTCO_vendor_support shpchp snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc xhci_hcd ext4 mbcache jbd2 sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Feb 4 17:21:37 khrushchev kernel:
Feb 4 17:21:37 khrushchev kernel: Pid: 3976, comm: rm Tainted: P --------------- 2.6.32-279.19.1.el6.x86_64 #1 Gigabyte Technology Co., Ltd. H55M-S2H/H55M-S2H
Feb 4 17:21:37 khrushchev kernel: RIP: 0010:[] [] zap_leaf_lookup_closest+0xbf/0x180 [zfs]
Feb 4 17:21:37 khrushchev kernel: RSP: 0018:ffff88019b199b48 EFLAGS: 00000206
Feb 4 17:21:37 khrushchev kernel: RAX: 0000000000000000 RBX: ffff88019b199b78 RCX: ffffc90054d05430
Feb 4 17:21:37 khrushchev kernel: RDX: ffff88019b199ba8 RSI: 4000000000000000 RDI: ffff8801acca7540
Feb 4 17:21:37 khrushchev kernel: RBP: ffffffff8100bb8e R08: 0002000000000000 R09: 0000000000000000
Feb 4 17:21:37 khrushchev kernel: R10: 0000000000000001 R11: ffffffffffffffff R12: ffffffff814eb836
Feb 4 17:21:37 khrushchev kernel: R13: ffff88019b199ac8 R14: 0000000000000002 R15: 0000000000000003
Feb 4 17:21:37 khrushchev kernel: FS: 00007f41e46e2700(0000) GS:ffff880028280000(0000) knlGS:0000000000000000
Feb 4 17:21:37 khrushchev kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 4 17:21:37 khrushchev kernel: CR2: 0000000001c86568 CR3: 00000001a11d8000 CR4: 00000000000006e0
Feb 4 17:21:37 khrushchev kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 4 17:21:37 khrushchev kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 4 17:21:37 khrushchev kernel: Process rm (pid: 3976, threadinfo ffff88019b198000, task ffff8802192d4aa0)
Feb 4 17:21:37 khrushchev kernel: Stack:
Feb 4 17:21:37 khrushchev kernel: 000000009b199b78 ffff88019b199d08 ffff8801acca7540 0000000000000000
Feb 4 17:21:37 khrushchev kernel: ffff88019b199ba8 ffff88021899e900 ffff88019b199c08 ffffffffa0456476
Feb 4 17:21:37 khrushchev kernel: ffffffff00000001 0000000000000000 ffff88019b199d18 ffff88019b199d48
Feb 4 17:21:37 khrushchev kernel: Call Trace:
Feb 4 17:21:37 khrushchev kernel: [] ? fzap_cursor_retrieve+0x86/0x2a0 [zfs]
Feb 4 17:21:37 khrushchev kernel: [] ? zap_cursor_retrieve+0x17b/0x2f0 [zfs]
Feb 4 17:21:37 khrushchev kernel: [] ? dmu_prefetch+0x169/0x220 [zfs]
Feb 4 17:21:37 khrushchev kernel: [] ? zfs_readdir+0x17a/0x510 [zfs]
Feb 4 17:21:37 khrushchev kernel: [] ? filldir+0x0/0xe0
Feb 4 17:21:37 khrushchev kernel: [] ? __dentry_open+0x23f/0x360
Feb 4 17:21:37 khrushchev kernel: [] ? bstat+0x79e/0x970
Feb 4 17:21:37 khrushchev kernel: [] ? tsd_hash_search+0x99/0x200 [spl]

Message from syslogd@khrushchev at Feb 4 17:21:37 ...
kernel:BUG: soft lockup - CPU#2 stuck for 67s! [rm:3976]
Feb 4 17:21:37 khrushchev kernel: [] ? tsd_exit+0x5f/0x2b0 [spl]
Feb 4 17:21:37 khrushchev kernel: [] ? mutex_lock+0x1e/0x50
Feb 4 17:21:37 khrushchev kernel: [] ? rcu_start_gp+0x1be/0x230
Feb 4 17:21:37 khrushchev kernel: [] ? filldir+0x0/0xe0
Feb 4 17:21:37 khrushchev kernel: [] ? zpl_readdir+0x5c/0x90 [zfs]
Feb 4 17:21:37 khrushchev kernel: [] ? filldir+0x0/0xe0
Feb 4 17:21:37 khrushchev kernel: [] ? vfs_readdir+0xc0/0xe0
Feb 4 17:21:37 khrushchev kernel: [] ? sys_getdents+0x89/0xf0
Feb 4 17:21:37 khrushchev kernel: [] ? system_call_fastpath+0x16/0x1b
Feb 4 17:21:37 khrushchev kernel: Code: f8 ff 74 75 4c 8b 47 40 0f b7 c8 45 89 d7 4c 8d 0c 49 8b 4f 30 83 e9 05 41 d3 e7 49 63 cf 4a 8d 0c 89 49 8d 4c 48 30 4c 8b 41 10 <4c> 39 c6 73 c4 4d 39 c3 76 57 44 8b 49 0c 44 0f b7 59 0a 41 89
Feb 4 17:21:37 khrushchev kernel: Call Trace:
Feb 4 17:21:37 khrushchev kernel: [] ? zap_deref_leaf+0x6a/0x80 [zfs]
Feb 4 17:21:37 khrushchev kernel: [] ? fzap_cursor_retrieve+0x86/0x2a0 [zfs]
Feb 4 17:21:37 khrushchev kernel: [] ? zap_cursor_retrieve+0x17b/0x2f0 [zfs]
Feb 4 17:21:37 khrushchev kernel: [] ? dmu_prefetch+0x169/0x220 [zfs]
Feb 4 17:21:37 khrushchev kernel: [] ? zfs_readdir+0x17a/0x510 [zfs]
Feb 4 17:21:37 khrushchev kernel: [] ? filldir+0x0/0xe0
Feb 4 17:21:37 khrushchev kernel: [] ? __dentry_open+0x23f/0x360
Feb 4 17:21:37 khrushchev kernel: [] ? bstat+0x79e/0x970
Feb 4 17:21:37 khrushchev kernel: [] ? tsd_hash_search+0x99/0x200 [spl]
Feb 4 17:21:37 khrushchev kernel: [] ? tsd_exit+0x5f/0x2b0 [spl]
Feb 4 17:21:37 khrushchev kernel: [] ? mutex_lock+0x1e/0x50
Feb 4 17:21:37 khrushchev kernel: [] ? rcu_start_gp+0x1be/0x230
Feb 4 17:21:37 khrushchev kernel: [] ? filldir+0x0/0xe0
Feb 4 17:21:37 khrushchev kernel: [] ? zpl_readdir+0x5c/0x90 [zfs]
Feb 4 17:21:37 khrushchev kernel: [] ? filldir+0x0/0xe0
Feb 4 17:21:37 khrushchev kernel: [] ? vfs_readdir+0xc0/0xe0
Feb 4 17:21:37 khrushchev kernel: [] ? sys_getdents+0x89/0xf0
Feb 4 17:21:37 khrushchev kernel: [] ? system_call_fastpath+0x16/0x1b

@behlendorf
Copy link
Contributor

@ElComradio Do you believe this is a regression with rc14 or was it observed with rc13 as well? The good news is that these are soft lockups so nothing has probably failed, it's just moving slowly due to what appears to be some sort of contention.

If your able to reproduce this issue on the directory could you run 'perf top' on the system while it's occurring. That should implicate the hot spots in the code which need the most work.

@a-frank-exchange-of-views
Copy link
Author

I saw this with rc13 also. I suppose it could be moving slowly but this will go on for hours and I've never seen it stop once it's started. kill -9 fails to kill the rm process.

This is probably what you are looking to see from perf top? 93.92% [zfs] [k] zap_leaf_lookup_closest

@behlendorf
Copy link
Contributor

@ElComradio It certainly sounds like it is trapped spinning in the for() loop in zap_leaf_lookup_closest() If that's the case it would explain why it's unkillable and consumes so much processor. Here's what I'd try first.

  • Use zdb -vvvv <pool>/<dataset> <object id> to dump detailed debugging information about the directory in question. The data on disk could be incorrect but still have a good checksum so let's see what zdb has to say. You can get the object number by stat'ing the directory name, the inode number will be the object id. (stat <bad directory>, look for 'Inode: xxx' in the output).
  • Rebuild the spl and zfs code with ./configure --enable-debug. That will enable the various ASSERTs in the code base including the three in this function. I'd expect one of them to trip and halt the thread printing out some useful debugging about what exactly is wrong.

@a-frank-exchange-of-views
Copy link
Author

Thanks again. Not exactly sure what is... when I run:

[root@khrushchev ssl]# zdb -vvvv pool1 153951 > ~/zdb.txt
zdb: dmu_bonus_hold(153951) failed, errno 2

I get a huge amount of information in zdb.txt.

Will try the enable debug now and report back..

@behlendorf
Copy link
Contributor

Make sure you include the dataset name in the zdb command. For example:

zdb -vvvv pool1/the-broken-fs 153951

@a-frank-exchange-of-views
Copy link
Author

Forgive my ignorance but I'm still unclear as to what "the-broken-fs" entails. I've tried various things but no dice.

I've only ever addressed my FS by 'pool1'. e.g. "zfs mount pool1" mounts pool1 on /pool1. However, since rebuilding with debug and rebooting, now I'm unable to mount the FS at all. zpool status still shows OK.

[root@khrushchev comrade]# zfs mount pool1
cannot mount 'pool1': Invalid argument

@a-frank-exchange-of-views
Copy link
Author

Nevermind last- Compile or module install fail.

@a-frank-exchange-of-views
Copy link
Author

zdb -vvvv pool1 153951 generates a bunch of output (far too much to paste),
ending with:

Dataset mos [META], ID 0, cr_txg 4, 10.4M, 58 objects

Object  lvl   iblk   dblk  dsize  lsize   %full  type

zdb: dmu_bonus_hold(153951) failed, errno 2

There are 115 metaslab sections above this.

We now get this in dmesg:

SPLError: 4043:0:(zap_leaf.c:462:zap_leaf_lookup_closest()) VERIFY3(chunk <
(((1<<(l)->l_bs) - 2*(1 << ((l)->l_bs - 5))) / 24 - 2)) failed (57343 < 638)
SPLError: 4043:0:(zap_leaf.c:462:zap_leaf_lookup_closest()) SPL PANIC
SPL: Showing stack for process 4043
Pid: 4043, comm: rm Tainted: P ---------------
2.6.32-279.19.1.el6.x86_64 #1
Call Trace:
[] ? spl_debug_dumpstack+0x27/0x40 [spl]
[] ? spl_debug_bug+0x81/0xd0 [spl]
[] ? zap_leaf_lookup_closest+0x2dd/0x390 [zfs]
[] ? zap_deref_leaf+0x106/0x1d0 [zfs]
[] ? fzap_cursor_retrieve+0x86/0x410 [zfs]
[] ? zap_cursor_retrieve+0x6b/0x450 [zfs]
[] ? avc_has_perm+0x71/0x90
[] ? dmu_prefetch+0x205/0x330 [zfs]
[] ? filldir+0x7d/0xe0
[] ? zfs_readdir+0x17a/0x510 [zfs]
[] ? avc_has_perm+0x71/0x90
[] ? filldir+0x0/0xe0
[] ? tsd_exit+0x68/0x380 [spl]
[] ? rrw_exit+0xbf/0x320 [zfs]
[] ? dmu_object_size_from_db+0x70/0xa0 [zfs]
[] ? zfs_getattr_fast+0x100/0x1f0 [zfs]
[] ? filldir+0x0/0xe0
[] ? zpl_readdir+0x5c/0xe0 [zfs]
[] ? selinux_file_permission+0xfb/0x150
[] ? filldir+0x0/0xe0
[] ? vfs_readdir+0xc0/0xe0
[] ? sys_getdents+0x89/0xf0
[] ? system_call_fastpath+0x16/0x1b
SPL: Dumping log to /tmp/spl-log.1360025953.4043

@behlendorf
Copy link
Contributor

@Elcomradio The VERIFY3() message in dmesg is exactly what I was looking for. If I were to make an educated guess I'd say you were the victim of memory corruption. Here's why:

The second for() loop in zap_leaf_lookup_closest is designed to terminate when to it reachs the CHAIN_END terminator. Now that's never happening because it looks like that terminator has been damaged. The value 57343 (0xDFFF in hex) above in the VERIFY is likely supposed to be the terminator which is 0xFFFF. A single bit flip in memory could have resulted in the bogus terminator getting written with a valid checksum.

Are you using ECC memory on this system? I'd check for any logged memory errors.

@csiden This sort of thing could also explain the following Illumos panics. In this case a low bit was cleared so we spin in the for() loop, but if a high bit got set we'd expect to see a NULL dereference or bad paging request just like this:

https://www.illumos.org/issues/1305
https://www.illumos.org/issues/2233
https://www.illumos.org/issues/3320

A little extra paranoia in this function wouldn't be a bad idea since it blindly indexes in to memory based on an on disk value. At a minimum we could terminate the loop if 'chunk > ZAP_LEAF_NUMCHUNKS(l)' which would prevent the panic and in this case the spinning. The damage would still exist on disk but at least it would be gracefully handled at run time. Something like this:

behlendorf/zfs@faf8d51

@a-frank-exchange-of-views
Copy link
Author

@behlendorf thanks very much. I am not using ECC mem. I have scrubbed, with no problems. Is there some way I can recover from this without scrapping the whole array?

@behlendorf
Copy link
Contributor

@ElComradio Perhaps with a little care. You could grab the patch above and apply it to your tree, behlendorf/zfs@faf8d51. That should prevent the spinning, and the crash, and instead give you an error when you try and list the directory. You then might be able to safely remove just that directory and recreate it from a snapshot or backup.

@a-frank-exchange-of-views
Copy link
Author

I will try that, but, and I'm no RAID expert, it seems odd to me that a
"robust file system" can contain uncorrectable errors caused by a single
bit flip?

Is the general advice "use ECC or you are risk of permanently borking your
FS?"

On Tue, Feb 5, 2013 at 6:16 PM, Brian Behlendorf
[email protected]:

@ElComradio https://github.com/ElComradio Perhaps with a little care.
You could grab the patch above and apply it to your tree, behlendorf/zfs@
faf8d51 behlendorf@faf8d51. That should
prevent the spinning, and the crash, and instead give you an error when you
try and list the directory. You then might be able to safely remove
just that directory and recreate it from a snapshot or backup.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1263#issuecomment-13158040.

@cwedgwood
Copy link
Contributor

This looks a bit like #1242

I most definitely do have ECC

@a-frank-exchange-of-views
Copy link
Author

Patch fixed the CPU locking. I am unable to rm or move or ls the bad
directory.

[root@khrushchev certs]# ls -la
ls: reading directory .: Bad address

On Tue, Feb 5, 2013 at 6:27 PM, Chris Wedgwood [email protected]:

This looks a bit like #1242#1242

I most definitely do have ECC


Reply to this email directly or view it on GitHubhttps://github.com//issues/1263#issuecomment-13158467.

@behlendorf
Copy link
Contributor

ElComradio Yes, ZFS has always depended on ECC memory. If the data can get damaged in memory, and then that bad data is written to disk with a good checksum, there's not much ZFS can do about it. Other file systems are similarly vulnerable it's just that your far less likely to notice. Ignorance is bliss.

To try and extract the contents of the directory you could replace the return (EFAULT); in the patch with break;. That should allow you to access the contents of the directory but you may end up crashing your system if you try and remove the directory. But it's worth a try, the damage clearly looks contains to this directory object.

@a-frank-exchange-of-views
Copy link
Author

Thanks again Brian. Shall we close this out?

On Tue, Feb 5, 2013 at 7:29 PM, Brian Behlendorf
[email protected]:

ElComradio Yes, ZFS has always depended on ECC memory. If the data can get
damaged in memory, and then that bad data is written to disk with a good
checksum, there's not much ZFS can do about it. Other file systems are
similarly vulnerable it's just that your far less likely to notice.
Ignorance is bliss.

To try and extract the contents of the directory you could replace the return
(EFAULT); in the patch with break;. That should allow you to access the
contents of the directory but you may end up crashing your system if you
try and remove the directory. But it's worth a try, the damage clearly
looks contains to this directory object.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1263#issuecomment-13160615.

@behlendorf
Copy link
Contributor

Sure, I'll open a new issue to tracker the larger question of how to make ZFS more resilient to minor damage likethis. Mandating ECC memory is fine on the enterprise side of the house and it certainly makes developer's lives easier. But the reality now is that ZFS will be running on many many many more non-ECC machines and it should do a better job handling this sort of thing.

@dswartz
Copy link
Contributor

dswartz commented Feb 6, 2013

Boy no kidding. I had this lesson reemphasized last week. My main server
had a stick of ECC memory go bad. First thing I found out about it was when
the weekly scrub kicked back 2 checksum errors on the 2nd mirrored vdev (a
3x2 pool). I was going 'how did both sides go bad at once???' and finally
rebooted with memtest86+ and it indicted that one stick. Even if the data
went bad on the way to the disk, at least I was warned that file had gone
bad - it was the vmdk for a win7 virtual machine. Without ZFS I would not
have known until win7 started failing in whatever oddball way and would
likely have grumbled and reinstalled. Thank you checksums :)


From: Brian Behlendorf [mailto:[email protected]]
Sent: Tuesday, February 05, 2013 7:30 PM
To: zfsonlinux/zfs
Subject: Re: [zfs] CPU soft lock 67s on rm operation (#1263)

ElComradio Yes, ZFS has always depended on ECC memory. If the data can get
damaged in memory, and then that bad data is written to disk with a good
checksum, there's not much ZFS can do about it. Other file systems are
similarly vulnerable it's just that your far less likely to notice.
Ignorance is bliss.

To try and extract the contents of the directory you could replace the
return (EFAULT); in the patch with break;. That should allow you to access
the contents of the directory but you may end up crashing your system if you
try and remove the directory. But it's worth a try, the damage clearly looks
contains to this directory object.

Reply to this email directly or view
#1263 (comment) it on
GitHub.

<https://github.com/notifications/beacon/Jshd8sI44GVrKZBvymxqKGNilBKKY1qaIyI
Nz5CPQNKdNPg0xENAuplrTQASv6EV.gif>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants