ZFS kernel panic during scrub and following imports - CentOS #2678

cointer · 2014-09-07T19:38:08Z

I've had this issue on both CentOS 6 and 7, the same thing happened in 6 and I rebuilt the pool from scratch on 7 using same hardware. I have a pool of roughly 28TB in size, about 6-7TB used. The pool is made up of two 6-drive raidz2, mirrored SLOG partitions (eMLC drives), 64GB ECC RAM, and 180GB L2ARC. Dedup is disabled, lz4 compression is enabled, other settings I've changed are related to xattr, nfsshare, and acltype.

All drives report "available" via zpool and all pass smart status. I started a scrub last night, it was going fine, roughly 300MB/s, reported a time of 3-5hrs for the scrub. I left it and went to bed. I woke this morning to find that there was a kernel panic, reboot, and another panic when the pool was trying to import at boot.

I went into single user mode and moved the zpool.cache file and system booted up normally. Trying to import via "zpool import -f pool" caused a kernel panic again. I tried "zpool import -F pool" but this would not run without the "-f" switch, which always induces panic.

I was able to import the pool in read-only mode and all data seems intact (I am currently rsync data off of this pool to another, non-zfs share while this is sorted out).

The pool still is in the middle of the scrub, so I feel like the scrub itself is hitting a certain point and causing the panic. Here are some relevant bits from zpool status and dmesg:

pool: pool
state: ONLINE
scan: scrub in progress since Sat Sep 6 17:43:08 2014
923G scanned out of 6.34T at 1/s, (scan is slow, no estimated time)
0 repaired, 14.21% done
config:

    NAME                                                      STATE     READ WRITE CKSUM
    pool                                                      ONLINE       0     0     0
      raidz2-0                                                ONLINE       0     0     0
        ata-WDC_WD4000F9YZ-09N20L0_WD-[redacted]            ONLINE       0     0     0
        ata-WDC_WD4000F9YZ-09N20L0_WD-[redacted]            ONLINE       0     0     0
        ata-WDC_WD4000F9YZ-09N20L0_WD-[redacted]            ONLINE       0     0     0
        ata-WDC_WD4000F9YZ-09N20L0_WD-[redacted]            ONLINE       0     0     0
        ata-WDC_WD4000F9YZ-09N20L0_WD-[redacted]            ONLINE       0     0     0
        ata-WDC_WD4000F9YZ-09N20L0_WD-[redacted]            ONLINE       0     0     0
      raidz2-1                                                ONLINE       0     0     0
        ata-WDC_WD4000F9YZ-09N20L0_WD-[redacted]            ONLINE       0     0     0
        ata-WDC_WD4000F9YZ-09N20L0_WD-[redacted]            ONLINE       0     0     0
        ata-WDC_WD4000F9YZ-09N20L0_WD-[redacted]            ONLINE       0     0     0
        ata-WDC_WD4000F9YZ-09N20L0_WD-[redacted]            ONLINE       0     0     0
        ata-WDC_WD4000F9YZ-09N20L0_WD-[redacted]            ONLINE       0     0     0
        ata-WDC_WD4000F9YZ-09N20L0_WD-[redacted]            ONLINE       0     0     0
    logs
      mirror-2                                                ONLINE       0     0     0
        ata-Edge_Boost_Server_PF_MLC_[redacted]-part3  ONLINE       0     0     0
        ata-Edge_Boost_Server_PF_MLC_[redacted]-part3  ONLINE       0     0     0
    cache
      ata-EDGE_Boost_Express_SSD_[redacted]             ONLINE       0     0     0

errors: No known data errors

[1200156.671391] general protection fault: 0000 [#1] SMP
[1200156.671426] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache bonding bridge stp llc sg iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core i2c_i801 lpc_ich mfd_core mei_me mei igb ixgbe ptp pps_core ioatdma mdio dca ses
[1200156.671838] enclosure ipmi_si ipmi_msghandler wmi mperf shpchp nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c raid1 sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ahci libahci ttm drm mpt2sas libata i2c_core raid_class scsi_transport_sas zfs(POF) zunicode(POF) zavl(POF) zcommon(POF) znvpair(POF) spl(OF) zlib_deflate [last unloaded: ip_tables]
[1200156.672036] CPU: 2 PID: 3910 Comm: txg_sync Tainted: PF O-------------- 3.10.0-123.6.3.el7.x86_64 #1
[1200156.672079] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013
[1200156.672119] task: ffff8810333c38e0 ti: ffff88103224c000 task.ti: ffff88103224c000
[1200156.672162] RIP: 0010:[] [] spl_kmem_cache_alloc+0x32/0x270 [spl]
[1200156.672227] RSP: 0018:ffff88103224d208 EFLAGS: 00010246
[1200156.672251] RAX: 00000007d74f6800 RBX: 0000000000c6cc00 RCX: ffffffffa01be4e0
[1200156.672282] RDX: ffff88103224d418 RSI: 0000000000000230 RDI: 657a5f73667a0065
[1200156.672313] RBP: ffff88103224d260 R08: 7e8e756e80cd7cfd R09: 7e8e756e80cd7cfd
[1200156.672344] R10: ffff88085f803900 R11: 69663a725f746365 R12: 0000000000000001
[1200156.672386] R13: ffff880c826e5310 R14: 0000000000000230 R15: 657a5f73667a0065
[1200156.672416] FS: 0000000000000000(0000) GS:ffff88085fd00000(0000) knlGS:0000000000000000
[1200156.672462] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1200156.672487] CR2: 00007effa9bf5880 CR3: 00000000018d0000 CR4: 00000000001407e0
[1200156.672518] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1200156.672549] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[1200156.672579] Stack:
[1200156.672591] ffff88103224d240 ffffffff81090b04 ffff881047c78300 ffffc908fbe43870
[1200156.672629] ffffc908fbe434e0 0000000000000246 0000000000c6cc00 0000000000000001
[1200156.672666] ffff880c826e5310 ffff880c826e5348 0000000000006365 ffff88103224d270
[1200156.672702] Call Trace:
[1200156.672720] [] ? __wake_up+0x44/0x50
[1200156.672792] [] zio_buf_alloc+0x23/0x30 [zfs]
[1200156.672832] [] arc_get_data_buf.isra.19+0x345/0x4b0 [zfs]
[1200156.672878] [] arc_buf_alloc+0xdc/0x110 [zfs]
[1200156.672918] [] arc_read+0x392/0x920 [zfs]
[1200156.672946] [] ? ktime_get_ts+0x48/0xe0
[1200156.672985] [] ? arc_buf_remove_ref+0x100/0x100 [zfs]
[1200156.673036] [] dsl_scan_visitbp.isra.5+0x5f1/0xc40 [zfs]
[1200156.673086] [] dsl_scan_visitbp.isra.5+0x56a/0xc40 [zfs]
[1200156.673135] [] dsl_scan_visitbp.isra.5+0x73b/0xc40 [zfs]
[1200156.673184] [] dsl_scan_visitbp.isra.5+0x73b/0xc40 [zfs]
[1200156.673232] [] dsl_scan_visitbp.isra.5+0x73b/0xc40 [zfs]
[1200156.674379] [] dsl_scan_visitbp.isra.5+0x73b/0xc40 [zfs]
[1200156.675512] [] dsl_scan_visitbp.isra.5+0x73b/0xc40 [zfs]
[1200156.676631] [] dsl_scan_visitbp.isra.5+0x73b/0xc40 [zfs]
[1200156.677743] [] dsl_scan_visitbp.isra.5+0x88f/0xc40 [zfs]
[1200156.678784] [] dsl_scan_visitds+0xd8/0x570 [zfs]
[1200156.679819] [] dsl_scan_sync+0x16d/0xb60 [zfs]
[1200156.680862] [] spa_sync+0x492/0xb20 [zfs]
[1200156.681850] [] ? ktime_get_ts+0x48/0xe0
[1200156.682831] [] txg_sync_thread+0x37e/0x5c0 [zfs]
[1200156.683788] [] ? txg_fini+0x290/0x290 [zfs]
[1200156.684704] [] thread_generic_wrapper+0x7a/0x90 [spl]
[1200156.685604] [] ? __thread_exit+0xa0/0xa0 [spl]
[1200156.686457] [] kthread+0xcf/0xe0
[1200156.687279] [] ? kthread_create_on_node+0x140/0x140
[1200156.688082] [] ret_from_fork+0x7c/0xb0
[1200156.688852] [] ? kthread_create_on_node+0x140/0x140
[1200156.689606] Code: 89 e5 41 57 49 89 ff 41 56 41 89 f6 41 55 41 54 53 48 83 ec 30 f6 05 3d 45 01 00 01 74 0d f6 05 3d 45 01 00 08 0f 85 2e 01 00 00 41 ff 87 68 a0 00 00 41 f6 87 48 a0 00 00 80 0f 84 90 00 00
[1200156.691217] RIP [] spl_kmem_cache_alloc+0x32/0x270 [spl]
[1200156.691987] RSP

The text was updated successfully, but these errors were encountered:

behlendorf · 2014-09-09T04:14:51Z

@cointer I suspect you're right. The stack you've posted shows the txg_sync thread in the middle of a scrub. It also shows the scan recursed quite deeply which makes me suspect a stack overflow is causing the crash. We're limited to 8k stacks in the Linux kernel.

Importing the pool read-only as you've done will effectively disable the scrub and avoid this issue.
You could also import the pool using FreeBSD or Illumos to stop the scrub, both of these platforms have much larger default kernel stack sizes. Once stopped you should be able to import the pool under Linux again.

What version of ZoL are you using?

cointer · 2014-09-09T13:26:40Z

@behlendorf Thanks for the possible temporary workaround, I may look into this today.

I currently have the latest rpm installed from the centos 7 ZoL repository.

zfs-0.6.3-1.el7.centos.x86_64.rpm

cointer · 2014-09-10T02:22:33Z

@behlendorf So here's some interesting news. I booted a live FreeBSD 10, tried to import the pool and it still crashed in the same manner. What do you think?

behlendorf · 2014-09-10T03:15:34Z

@cointer then it's not a stack overflow. Can you provide the stack trace from FreeBSD?

Based on the stack the next most likely reason would be that somehow a bogus size was passed to arc_read(). If you rebuild the spl and zfs code with the --enable-debug option we can check for this by enabling all the assertions in the code. If you're using the dkms packages set ZFS_DKMS_ENABLE_DEBUG=y in /etc/sysconfig/zfs and rebuild the packages.

cointer · 2014-09-10T03:28:09Z

I will try to get the stack trace from FreeBSD. I'm not too familiar with FreeBSD. Can you give advice on the best way to get the dump/stack trace? The live system instantly reboots when I import the pool so I see no output.

behlendorf · 2014-09-10T03:33:03Z

@cointer I'm not familiar with debugging on FreeBSD either. I mainly just wanted to verify it was the same failure. We can certainly debug this under Linux, the first step would be to build with debugging enabled.

cointer · 2014-09-10T05:12:19Z

Here is what I get with debugging enabled:

Message from syslogd ...
kernel:SPLError: 29891:0:(zio.c:254:zio_buf_alloc()) ASSERTION(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed

Message from syslogd ...
kernel:SPLError: 29891:0:(zio.c:254:zio_buf_alloc()) SPL PANIC

The general strategy used by ZFS to verify that blocks are is to checksum everything. This has the advantage of being extremely robust and generically applicable regardless of the contents of the block. If a blocks checksum is valid then its contents are trusted by the higher layers. This system works exceptionally well as long bad data is never written with a valid checksum. However, if this does somehow occur due to a software bug or a memory bit-flip on a non-ECC system it may result in kernel panic. One such place where this could occur is if somehow the logical size stored in a block pointer exceeds the maximum block size. This will result in an attempt to allocate a buffer greater than the maximum block size causing a system panic. To prevent this from happening the arc_read() function has been updated to detect this specific case. If a block pointer with an invalid logical size is passed it will treat the block as if it contained a checksum error. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2678

behlendorf · 2014-09-10T19:26:06Z

OK, that's what I was expecting. It shows that somehow the block pointer on disk contains an incorrect logical size despite having a valid checksum. This is what's resulting in the crash on both Linux, FreeBSD, and almost certainly any other ZFS platform.

I've proposed a fix for this specific case in pull request #2685. Could apply that patch to your source tree, rebuild, and import the pool again. It will allow scrub to detect the on-disk damage and if possible fix it. In this case it may be fixable because multiple copies of the block pointer will be stored on disk.

cointer · 2014-09-10T23:27:28Z

I applied the patch and it worked! The scrub is now continuing from that point and everything is back to read-write. Thanks for the fix!

behlendorf · 2014-09-11T00:13:39Z

@cointer Great news! Could you check the output of zpool status and see if it logged the error and fixed it. I'm glad we could turn around a fix for you.

cointer · 2014-09-11T01:23:13Z

@behlendorf According to zpool status after the scrub completed:

scrub repaired 0 in 99h29m with 10 errors
errors: No known data errors

Should I look into anything further?

behlendorf · 2014-09-11T04:01:28Z

@cointer Looks good. Apparently 10 meta-data blocks were impacted and no data blocks. Things look to be in pretty good shape. The only additional thing I'd suggest is clearing the errors, zpool clear, and scrubbing the pool again. That will tell us if those blocks were corrected.

cointer · 2014-09-11T19:40:24Z

OK, I scrubbed the pool again, it found a couple URE's and repaired them. I ran one more scrub after that to be sure, no more URE but it came back with 2 meta-data block errors this time. zpool hasn't marked any disks as bad yet, so I'm going to assume things are OK at the moment unless you have any other suggestions.

behlendorf · 2014-09-11T22:37:45Z

@cointer it's a little concerning you're still detecting errors after the first scrub. It should have corrected everything. It makes me wonder if there's some flaky hardware on your system (memory, cables, controller, drives, etc) causing errors.

cointer · 2014-09-11T22:54:07Z

@behlendorf I'm leaning toward that thought myself. All the URE's originated from a single drive, so it's most likely that is the problem child. I'll run a memtest first, but otherwise I'll keep an eye on that drive. I noticed zpool reported the drive that the URE's were discovered on but no such clues about the errors. When a metadata error is discovered, is there a way to tell what drives they originated from?

The general strategy used by ZFS to verify that blocks are is to checksum everything. This has the advantage of being extremely robust and generically applicable regardless of the contents of the block. If a blocks checksum is valid then its contents are trusted by the higher layers. This system works exceptionally well as long bad data is never written with a valid checksum. However, if this does somehow occur due to a software bug or a memory bit-flip on a non-ECC system it may result in kernel panic. One such place where this could occur is if somehow the logical size stored in a block pointer exceeds the maximum block size. This will result in an attempt to allocate a buffer greater than the maximum block size causing a system panic. To prevent this from happening the arc_read() function has been updated to detect this specific case. If a block pointer with an invalid logical size is passed it will treat the block as if it contained a checksum error. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2678

… corrupt logical size The general strategy used by ZFS to verify that blocks are is to checksum everything. This has the advantage of being extremely robust and generically applicable regardless of the contents of the block. If a blocks checksum is valid then its contents are trusted by the higher layers. This system works exceptionally well as long bad data is never written with a valid checksum. However, if this does somehow occur due to a software bug or a memory bit-flip on a non-ECC system it may result in kernel panic. One such place where this could occur is if somehow the logical size stored in a block pointer exceeds the maximum block size. This will result in an attempt to allocate a buffer greater than the maximum block size causing a system panic. To prevent this from happening the arc_read() function has been updated to detect this specific case. If a block pointer with an invalid logical size is passed it will treat the block as if it contained a checksum error. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2678

… logical size The general strategy used by ZFS to verify that blocks are is to checksum everything. This has the advantage of being extremely robust and generically applicable regardless of the contents of the block. If a blocks checksum is valid then its contents are trusted by the higher layers. This system works exceptionally well as long bad data is never written with a valid checksum. However, if this does somehow occur due to a software bug or a memory bit-flip on a non-ECC system it may result in kernel panic. One such place where this could occur is if somehow the logical size stored in a block pointer exceeds the maximum block size. This will result in an attempt to allocate a buffer greater than the maximum block size causing a system panic. To prevent this from happening the arc_read() function has been updated to detect this specific case. If a block pointer with an invalid logical size is passed it will treat the block as if it contained a checksum error. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2678

The general strategy used by ZFS to verify that blocks are valid is to checksum everything. This has the advantage of being extremely robust and generically applicable regardless of the contents of the block. If a blocks checksum is valid then its contents are trusted by the higher layers. This system works exceptionally well as long as bad data is never written with a valid checksum. If this does somehow occur due to a software bug or a memory bit-flip on a non-ECC system it may result in kernel panic. One such place where this could occur is if somehow the logical size stored in a block pointer exceeds the maximum block size. This will result in an attempt to allocate a buffer greater than the maximum block size causing a system panic. To prevent this from happening the arc_read() function has been updated to detect this specific case. If a block pointer with an invalid logical size is passed it will treat the block as if it contained a checksum error. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2678

… logical size The general strategy used by ZFS to verify that blocks are is to checksum everything. This has the advantage of being extremely robust and generically applicable regardless of the contents of the block. If a blocks checksum is valid then its contents are trusted by the higher layers. This system works exceptionally well as long bad data is never written with a valid checksum. However, if this does somehow occur due to a software bug or a memory bit-flip on a non-ECC system it may result in kernel panic. One such place where this could occur is if somehow the logical size stored in a block pointer exceeds the maximum block size. This will result in an attempt to allocate a buffer greater than the maximum block size causing a system panic. To prevent this from happening the arc_read() function has been updated to detect this specific case. If a block pointer with an invalid logical size is passed it will treat the block as if it contained a checksum error. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2678

behlendorf added the Bug label Sep 9, 2014

behlendorf mentioned this issue Sep 10, 2014

Handle block pointers with a corrupt logical size #2685

Closed

behlendorf added this to the 0.6.4 milestone Sep 10, 2014

behlendorf closed this as completed in 5f6d0b6 Oct 23, 2014

akorn mentioned this issue Nov 25, 2014

VERIFY(c < (1ULL << 17) >> 9) failed, PANIC at zio.c:263:zio_data_buf_alloc() on import #2932

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS kernel panic during scrub and following imports - CentOS #2678

ZFS kernel panic during scrub and following imports - CentOS #2678

cointer commented Sep 7, 2014

behlendorf commented Sep 9, 2014

cointer commented Sep 9, 2014

cointer commented Sep 10, 2014

behlendorf commented Sep 10, 2014

cointer commented Sep 10, 2014

behlendorf commented Sep 10, 2014

cointer commented Sep 10, 2014

behlendorf commented Sep 10, 2014

cointer commented Sep 10, 2014

behlendorf commented Sep 11, 2014

cointer commented Sep 11, 2014

behlendorf commented Sep 11, 2014

cointer commented Sep 11, 2014

behlendorf commented Sep 11, 2014

cointer commented Sep 11, 2014

ZFS kernel panic during scrub and following imports - CentOS #2678

ZFS kernel panic during scrub and following imports - CentOS #2678

Comments

cointer commented Sep 7, 2014

behlendorf commented Sep 9, 2014

cointer commented Sep 9, 2014

cointer commented Sep 10, 2014

behlendorf commented Sep 10, 2014

cointer commented Sep 10, 2014

behlendorf commented Sep 10, 2014

cointer commented Sep 10, 2014

behlendorf commented Sep 10, 2014

cointer commented Sep 10, 2014

behlendorf commented Sep 11, 2014

cointer commented Sep 11, 2014

behlendorf commented Sep 11, 2014

cointer commented Sep 11, 2014

behlendorf commented Sep 11, 2014

cointer commented Sep 11, 2014