Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC at btree.c:1780:zfs_btree_remove() #10989

Closed
hicotton02 opened this issue Sep 28, 2020 · 11 comments
Closed

PANIC at btree.c:1780:zfs_btree_remove() #10989

hicotton02 opened this issue Sep 28, 2020 · 11 comments
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@hicotton02
Copy link

System information

Type Version/Name
Distribution Name UnRaid
Distribution Version 6.8.3
Linux Kernel 4.19.107
Architecture x86_64
ZFS Version 2.0.0-rc2
SPL Version 2.0.0.-rc2

Describe the problem you're observing

zfs pools becoming unresponsive

Describe how to reproduce the problem

letting system run for a couple days

Include any warning/errors/backtraces from the system logs

Sep 27 13:14:40 Tower kernel: VERIFY3(zfs_btree_find(tree, value, &where) != NULL) failed (0000000000000000 != 0000000000000000)
Sep 27 13:14:40 Tower kernel: PANIC at btree.c:1780:zfs_btree_remove()
Sep 27 13:14:40 Tower kernel: Showing stack for process 8689
Sep 27 13:14:40 Tower kernel: CPU: 54 PID: 8689 Comm: txg_sync Tainted: P           O      4.19.107-Unraid #1
Sep 27 13:14:40 Tower kernel: Hardware name: System manufacturer System Product Name/ROG ZENITH II EXTREME ALPHA, BIOS 1101 06/05/2020
Sep 27 13:14:40 Tower kernel: Call Trace:
Sep 27 13:14:40 Tower kernel: dump_stack+0x67/0x83
Sep 27 13:14:40 Tower kernel: spl_panic+0xcf/0xf7 [spl]
Sep 27 13:14:40 Tower kernel: ? zfs_btree_find_in_buf+0x4a/0x99 [zfs]
Sep 27 13:14:40 Tower kernel: ? zfs_btree_find_in_buf+0x4a/0x99 [zfs]
Sep 27 13:14:40 Tower kernel: ? zfs_btree_find+0x148/0x182 [zfs]
Sep 27 13:14:40 Tower kernel: zfs_btree_remove+0x57/0x7d [zfs]
Sep 27 13:14:40 Tower kernel: range_tree_add_impl+0x4f3/0xa97 [zfs]
Sep 27 13:14:40 Tower kernel: ? _cond_resched+0x1b/0x1e
Sep 27 13:14:40 Tower kernel: ? __kmalloc_node+0x11e/0x12f
Sep 27 13:14:40 Tower kernel: ? range_tree_remove_impl+0xad5/0xad5 [zfs]
Sep 27 13:14:40 Tower kernel: range_tree_vacate+0x16a/0x1b3 [zfs]
Sep 27 13:14:40 Tower kernel: metaslab_sync_done+0x327/0x4c2 [zfs]
Sep 27 13:14:40 Tower kernel: ? _cond_resched+0x1b/0x1e
Sep 27 13:14:40 Tower kernel: vdev_sync_done+0x42/0x66 [zfs]
Sep 27 13:14:40 Tower kernel: spa_sync+0xbd1/0xd6a [zfs]
Sep 27 13:14:40 Tower kernel: txg_sync_thread+0x246/0x3f2 [zfs]
Sep 27 13:14:40 Tower kernel: ? txg_thread_exit.isra.0+0x50/0x50 [zfs]
Sep 27 13:14:40 Tower kernel: thread_generic_wrapper+0x67/0x6f [spl]
Sep 27 13:14:40 Tower kernel: ? __thread_exit+0xe/0xe [spl]
Sep 27 13:14:40 Tower kernel: kthread+0x10c/0x114
Sep 27 13:14:40 Tower kernel: ? kthread_park+0x89/0x89
Sep 27 13:14:40 Tower kernel: ret_from_fork+0x22/0x40
@hicotton02 hicotton02 added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Sep 28, 2020
@behlendorf
Copy link
Contributor

@pcd1193182 @sdimitro any thoughts how this might be possible. It's the first report I've seen like it.

@PrivatePuffin
Copy link
Contributor

letting system run for a couple days

@hicotton02 Could you make a shortlist of the services you are running and the load on said server?
To give us somewhat more of an idea, because said stack trace is quite... vague...

@hicotton02
Copy link
Author

It is an Unraid Server running a file share, docker with about 20 containers. KVM running 6 Virtual machines. the CPU is 64 core, 128 threads and is on average running 5-15% utilization overall. Here is the ZFS Pool status:

root@Tower:~# zpool status
  pool: datastore
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        datastore   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdk     ONLINE       0     0     0
            sdn     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdl     ONLINE       0     0     0
            sdm     ONLINE       0     0     0

errors: No known data errors

  pool: vmstorage
 state: ONLINE
config:

        NAME         STATE     READ WRITE CKSUM
        vmstorage    ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            nvme2n1  ONLINE       0     0     0
            nvme3n1  ONLINE       0     0     0
          mirror-1   ONLINE       0     0     0
            nvme0n1  ONLINE       0     0     0
            nvme1n1  ONLINE       0     0     0

errors: No known data errors

tower-diagnostics-20200928-1415.zip

Attached is a diagnostic output of the system. feel free to use what you want from it. if there is some info that is still missing, let me know.

@pcd1193182
Copy link
Contributor

It looks like the crash is happening as we try to move entries from the defer tree into the allocatable tree, during sync_done. It looks like btree_remove should only be called directly if we're in a gap supporting range tree, but it's possible there's another caller there that got omitted from the stack trace.

@pcd1193182
Copy link
Contributor

I disassembled the source, and it looks like that's the only caller. So for whatever reason, it looks like it thinks we're processing a gap-supported range tree, which none of the metaslab trees are. Not sure how exactly that happened.

@adamdmoss
Copy link
Contributor

@hicotton02 FYI 2.0.0-rc4 contains fix d8091c9 from @pcd1193182 for this issue.

Assuming this issue doesn't repro easily enough to verify the fix, I think this issue can probably be closed until proven otherwise.

@pcd1193182
Copy link
Contributor

pcd1193182 commented Oct 21, 2020

@adamdmoss that fix, while in a similar area, shouldn't be a fix for this issue. That issue presents as a failure after the btree_remove call, not in the btree_remove call.

@adamdmoss
Copy link
Contributor

Really? Darn, looked so close. Sorry.

@h1z1
Copy link

h1z1 commented Jul 7, 2022

Heh.. think I hit this today. No idea why. Tried to do zpool sync but it hangs and only on one pool ?

ZFS version: 2.1.99-1


[191881.155538] VERIFY3(zfs_btree_find(tree, value, &where) != NULL) failed (0000000000000000 != 0000000000000000)
[191881.155542] PANIC at btree.c:1780:zfs_btree_remove()
[191881.155543] Showing stack for process 2490
[191881.155545] CPU: 3 PID: 2490 Comm: z_wr_iss Tainted: P           O      5.4.70 #1
[191881.155546] Hardware name: Gigabyte Technology Co., Ltd. X399 AORUS Gaming 7/X399 AORUS Gaming 7, BIOS F12 12/11/2019
[191881.155547] Call Trace:
[191881.155554]  dump_stack+0x6d/0x98
[191881.155561]  spl_dumpstack+0x2b/0x30 [spl]
[191881.155564]  spl_panic+0xc9/0x110 [spl]
[191881.155601]  ? zfs_btree_insert_into_leaf+0x18a/0x200 [zfs]
[191881.155632]  ? metaslab_rangesize64_compare+0x50/0x50 [zfs]
[191881.155659]  ? zfs_btree_add_idx+0x136/0x220 [zfs]
[191881.155683]  zfs_btree_remove+0x8d/0xa0 [zfs]
[191881.155718]  metaslab_rt_remove+0x6f/0x130 [zfs]
[191881.155754]  range_tree_remove_impl+0x3d8/0xe60 [zfs]
[191881.155788]  ? metaslab_df_alloc+0x318/0x560 [zfs]
[191881.155822]  ? multilist_sublist_insert_tail+0x30/0x50 [zfs]
[191881.155856]  range_tree_remove+0x10/0x20 [zfs]
[191881.155890]  metaslab_group_alloc_normal+0x83b/0xae0 [zfs]
[191881.155929]  ? vdev_default_asize+0x5f/0x90 [zfs]
[191881.155963]  metaslab_alloc_dva+0x1ed/0x7e0 [zfs]
[191881.155996]  metaslab_alloc+0xcb/0x230 [zfs]
[191881.156038]  zio_dva_allocate+0xe9/0xa30 [zfs]
[191881.156040]  ? __switch_to_asm+0x40/0x70
[191881.156041]  ? __switch_to_asm+0x34/0x70
[191881.156042]  ? __switch_to_asm+0x40/0x70
[191881.156042]  ? __switch_to_asm+0x34/0x70
[191881.156043]  ? __switch_to_asm+0x40/0x70
# cat   /proc/$(pidof zpool)/stack 
[<0>] cv_wait_common+0xb2/0x140 [spl]
[<0>] __cv_wait_io+0x18/0x20 [spl]
[<0>] txg_wait_synced_impl+0xdb/0x130 [zfs]
[<0>] txg_wait_synced+0x10/0x40 [zfs]
[<0>] zfs_ioc_pool_sync+0xad/0xe0 [zfs]
[<0>] zfsdev_ioctl_common+0x51d/0x840 [zfs]
[<0>] zfsdev_ioctl+0x56/0xe0 [zfs]
[<0>] do_vfs_ioctl+0xaa/0x620
[<0>] ksys_ioctl+0x67/0x90
[<0>] __x64_sys_ioctl+0x1a/0x20
[<0>] do_syscall_64+0x60/0x1c0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Hmm. Non-fatal panic but IO is still dead to that pool. deadman kicked in eventually.

# cat taskq
taskq                       act  nthr  spwn  maxt   pri  mina         maxa  cura      flags
spl_delay_taskq/0             0     1     0    32   100    32   2147483647    32   80000005
	delay: spa_deadman [zfs](0xffff8888559cc000)
z_wr_iss/1                   24    24     0    24   101    50   2147483647     0   80000008
	active: [2490]zio_execute [zfs](0xffff88829c87c440) [2504]zio_execute [zfs](0xffff888561048ea0)
	        [2487]zio_execute [zfs](0xffff888561049860) [2501]zio_execute [zfs](0xffff888561049d40)
	        [2502]zio_execute [zfs](0xffff88856104ce00) [2508]zio_execute [zfs](0xffff88856104b5a0)
	        [2500]zio_execute [zfs](0xffff88856104f020) [2505]zio_execute [zfs](0xffff88856104eb40)
	        [2497]zio_execute [zfs](0xffff88856104bf60) [2498]zio_execute [zfs](0xffff88856104ba80)
	        [2499]zio_execute [zfs](0xffff8885610489c0) [2496]zio_execute [zfs](0xffff88856104c440)
	        [2492]zio_execute [zfs](0xffff88856104c920) [2503]zio_execute [zfs](0xffff88856104e180)
	        [2509]zio_execute [zfs](0xffff88856104dca0) [2493]zio_execute [zfs](0xffff8885610484e0)
	        [2506]zio_execute [zfs](0xffff88856104d2e0) [2494]zio_execute [zfs](0xffff88856104f500)
	        [2495]zio_execute [zfs](0xffff88856104d7c0) [2507]zio_execute [zfs](0xffff888561048000)
	        [2491]zio_execute [zfs](0xffff88856104e660) [2486]zio_execute [zfs](0xffff88856104f9e0)
	        [2488]zio_execute [zfs](0xffff88856104a700) [2489]zio_execute [zfs](0xffff88856104abe0)
	pend: zio_execute [zfs](0xffff8881e7119d40) zio_execute [zfs](0xffff888561049380)

<< 78 lines removed >>

	      zio_execute [zfs](0xffff8883dc45f020) zio_execute [zfs](0xffff8883dc45f9e0)
	      zio_execute [zfs](0xffff8883dc45bf60)
z_wr_int/13                   1     4     0    12   100    50   2147483647     0   80000004
	active: [2516]zio_execute [zfs](0xffff888667993f60)

# cat /proc/2516/stack
[<0>] metaslab_free_concrete+0xb9/0x250 [zfs]
[<0>] metaslab_free_impl+0xad/0xe0 [zfs]
[<0>] metaslab_free_dva+0x5a/0x80 [zfs]
[<0>] metaslab_free+0xea/0x180 [zfs]
[<0>] zio_free_sync+0x126/0x130 [zfs]
[<0>] zio_free+0xca/0x100 [zfs]
[<0>] dsl_free+0x11/0x20 [zfs]
[<0>] dsl_dataset_block_kill+0x2cd/0x500 [zfs]
[<0>] dbuf_write_done+0x1ea/0x230 [zfs]
[<0>] arc_write_done+0xa5/0x3e0 [zfs]
[<0>] zio_done+0x2c3/0xee0 [zfs]
[<0>] zio_execute+0x95/0xf0 [zfs]
[<0>] taskq_thread+0x31d/0x5b0 [spl]
[<0>] kthread+0x105/0x140
[<0>] ret_from_fork+0x22/0x40

Last edit: For the sake of posterity I looked at the logs above from @hicotton02 , most of those drives are throwing errors no? I see errors even from the SSD's:

 69 Total_Bad_Blocks        -O--CK   100   100   ---    -    1147
 170 Grown_Bad_Blocks        -O--CK   100   100   ---    -    0

In my case I half suspect there was some other environment issue given the UPSs show a power dip around the same time. The cause of that and why it would have affected this host is unclear. I don't see errors from any devices though so it's not clear why zfs wouldn't be able to recover?

@h1z1
Copy link

h1z1 commented Jul 7, 2022

sigh I couldn't unfreeze the pool, even zfs send hung. On reboot the host rolled back to some point I'm assuming is before the panic ... losing everything to that pool in the process. FUCK.

The host appeared fine.

@ryao
Copy link
Contributor

ryao commented Sep 11, 2022

This might be fixed by #13861.

Edit: In fact, #13861 explains the VERIFY3 going off nicely. The node was destroyed. Then another node was allocated at the same spot in memory, which then was removed from the tree (rather than the original node at that location), causing VERIFY3 to go off when trying to remove it.

@ryao ryao mentioned this issue Sep 13, 2022
13 tasks
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 13, 2022
Coverty static analysis found these.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Neal Gompa <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#10989
Closes openzfs#13861
beren12 pushed a commit to beren12/zfs that referenced this issue Sep 19, 2022
Coverty static analysis found these.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Neal Gompa <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#10989
Closes openzfs#13861
snajpa pushed a commit to vpsfreecz/zfs that referenced this issue Oct 22, 2022
Coverty static analysis found these.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Neal Gompa <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#10989
Closes openzfs#13861
snajpa pushed a commit to vpsfreecz/zfs that referenced this issue Oct 22, 2022
Coverty static analysis found these.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Neal Gompa <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#10989
Closes openzfs#13861
snajpa pushed a commit to vpsfreecz/zfs that referenced this issue Oct 23, 2022
Coverty static analysis found these.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Neal Gompa <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#10989
Closes openzfs#13861
geoffamey pushed a commit to BlueArchive/storage-zfs-wasabi that referenced this issue Jul 5, 2023
Coverty static analysis found these.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Neal Gompa <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#10989
Closes openzfs#13861
(cherry picked from commit 13f2b8f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

7 participants