Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool hangs when deleting snapshots #9575

Closed
pkramme opened this issue Nov 12, 2019 · 7 comments
Closed

Pool hangs when deleting snapshots #9575

pkramme opened this issue Nov 12, 2019 · 7 comments
Labels
Status: Stale No recent activity for issue Type: Performance Performance improvement or performance problem

Comments

@pkramme
Copy link

pkramme commented Nov 12, 2019

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 18.04
Linux Kernel 5.0.0-31-generic
Architecture amd64
ZFS Version 0.7.12-1ubuntu5
SPL Version 0.7.12-1ubuntu3

Describe the problem you're observing

When deleting many snapshots (>9000), the ZFS pool hangs during the deletion. After the deletion, the pool is available again.

Describe how to reproduce the problem

  • Create many snapshots
  • Delete them

Include any warning/errors/backtraces from the system logs

root@nitrogen:~# journalctl --dmesg -b --since 13:22:00 --no-pager 
-- Logs begin at Sun 2019-08-18 15:00:39 CEST, end at Tue 2019-11-12 16:04:01 CET. --
Nov 12 13:22:35 nitrogen kernel: INFO: task txg_sync:1906 blocked for more than 120 seconds.
Nov 12 13:22:35 nitrogen kernel:       Tainted: P           OE     5.0.0-31-generic #33~18.04.1-Ubuntu
Nov 12 13:22:35 nitrogen kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 12 13:22:35 nitrogen kernel: txg_sync        D    0  1906      2 0x80000000
Nov 12 13:22:35 nitrogen kernel: Call Trace:
Nov 12 13:22:35 nitrogen kernel:  __schedule+0x2c0/0x870
Nov 12 13:22:35 nitrogen kernel:  schedule+0x2c/0x70
Nov 12 13:22:35 nitrogen kernel:  cv_wait_common+0x11e/0x140 [spl]
Nov 12 13:22:35 nitrogen kernel:  ? wait_woken+0x80/0x80
Nov 12 13:22:35 nitrogen kernel:  __cv_wait+0x15/0x20 [spl]
Nov 12 13:22:35 nitrogen kernel:  rrw_enter_write+0x3c/0xa0 [zfs]
Nov 12 13:22:35 nitrogen kernel:  rrw_enter+0x13/0x20 [zfs]
Nov 12 13:22:35 nitrogen kernel:  spa_sync+0x7ce/0xd30 [zfs]
Nov 12 13:22:35 nitrogen kernel:  txg_sync_thread+0x2cd/0x4a0 [zfs]
Nov 12 13:22:35 nitrogen kernel:  ? __switch_to_asm+0x35/0x70
Nov 12 13:22:35 nitrogen kernel:  ? txg_quiesce_thread+0x3d0/0x3d0 [zfs]
Nov 12 13:22:35 nitrogen kernel:  thread_generic_wrapper+0x74/0x90 [spl]
Nov 12 13:22:35 nitrogen kernel:  kthread+0x121/0x140
Nov 12 13:22:35 nitrogen kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 12 13:22:35 nitrogen kernel:  ? kthread_park+0xb0/0xb0
Nov 12 13:22:35 nitrogen kernel:  ret_from_fork+0x35/0x40
Nov 12 13:24:36 nitrogen kernel: INFO: task txg_sync:1906 blocked for more than 120 seconds.
Nov 12 13:24:36 nitrogen kernel:       Tainted: P           OE     5.0.0-31-generic #33~18.04.1-Ubuntu
Nov 12 13:24:36 nitrogen kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 12 13:24:36 nitrogen kernel: txg_sync        D    0  1906      2 0x80000000
Nov 12 13:24:36 nitrogen kernel: Call Trace:
Nov 12 13:24:36 nitrogen kernel:  __schedule+0x2c0/0x870
Nov 12 13:24:36 nitrogen kernel:  schedule+0x2c/0x70
Nov 12 13:24:36 nitrogen kernel:  cv_wait_common+0x11e/0x140 [spl]
Nov 12 13:24:36 nitrogen kernel:  ? wait_woken+0x80/0x80
Nov 12 13:24:36 nitrogen kernel:  __cv_wait+0x15/0x20 [spl]
Nov 12 13:24:36 nitrogen kernel:  rrw_enter_write+0x3c/0xa0 [zfs]
Nov 12 13:24:36 nitrogen kernel:  rrw_enter+0x13/0x20 [zfs]
Nov 12 13:24:36 nitrogen kernel:  spa_sync+0x7ce/0xd30 [zfs]
Nov 12 13:24:36 nitrogen kernel:  txg_sync_thread+0x2cd/0x4a0 [zfs]
Nov 12 13:24:36 nitrogen kernel:  ? __switch_to_asm+0x35/0x70
Nov 12 13:24:36 nitrogen kernel:  ? txg_quiesce_thread+0x3d0/0x3d0 [zfs]
Nov 12 13:24:36 nitrogen kernel:  thread_generic_wrapper+0x74/0x90 [spl]
Nov 12 13:24:36 nitrogen kernel:  kthread+0x121/0x140
Nov 12 13:24:36 nitrogen kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 12 13:24:36 nitrogen kernel:  ? kthread_park+0xb0/0xb0
Nov 12 13:24:36 nitrogen kernel:  ret_from_fork+0x35/0x40
Nov 12 13:26:37 nitrogen kernel: INFO: task txg_sync:1906 blocked for more than 120 seconds.
Nov 12 13:26:37 nitrogen kernel:       Tainted: P           OE     5.0.0-31-generic #33~18.04.1-Ubuntu
Nov 12 13:26:37 nitrogen kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 12 13:26:37 nitrogen kernel: txg_sync        D    0  1906      2 0x80000000
Nov 12 13:26:37 nitrogen kernel: Call Trace:
Nov 12 13:26:37 nitrogen kernel:  __schedule+0x2c0/0x870
Nov 12 13:26:37 nitrogen kernel:  schedule+0x2c/0x70
Nov 12 13:26:37 nitrogen kernel:  cv_wait_common+0x11e/0x140 [spl]
Nov 12 13:26:37 nitrogen kernel:  ? wait_woken+0x80/0x80
Nov 12 13:26:37 nitrogen kernel:  __cv_wait+0x15/0x20 [spl]
Nov 12 13:26:37 nitrogen kernel:  rrw_enter_write+0x3c/0xa0 [zfs]
Nov 12 13:26:37 nitrogen kernel:  rrw_enter+0x13/0x20 [zfs]
Nov 12 13:26:37 nitrogen kernel:  spa_sync+0x7ce/0xd30 [zfs]
Nov 12 13:26:37 nitrogen kernel:  txg_sync_thread+0x2cd/0x4a0 [zfs]
Nov 12 13:26:37 nitrogen kernel:  ? __switch_to_asm+0x35/0x70
Nov 12 13:26:37 nitrogen kernel:  ? txg_quiesce_thread+0x3d0/0x3d0 [zfs]
Nov 12 13:26:37 nitrogen kernel:  thread_generic_wrapper+0x74/0x90 [spl]
Nov 12 13:26:37 nitrogen kernel:  kthread+0x121/0x140
Nov 12 13:26:37 nitrogen kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 12 13:26:37 nitrogen kernel:  ? kthread_park+0xb0/0xb0
Nov 12 13:26:37 nitrogen kernel:  ret_from_fork+0x35/0x40
Nov 12 13:28:37 nitrogen kernel: INFO: task txg_sync:1906 blocked for more than 120 seconds.
Nov 12 13:28:37 nitrogen kernel:       Tainted: P           OE     5.0.0-31-generic #33~18.04.1-Ubuntu
Nov 12 13:28:37 nitrogen kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 12 13:28:37 nitrogen kernel: txg_sync        D    0  1906      2 0x80000000
Nov 12 13:28:37 nitrogen kernel: Call Trace:
Nov 12 13:28:37 nitrogen kernel:  __schedule+0x2c0/0x870
Nov 12 13:28:37 nitrogen kernel:  schedule+0x2c/0x70
Nov 12 13:28:37 nitrogen kernel:  cv_wait_common+0x11e/0x140 [spl]
Nov 12 13:28:37 nitrogen kernel:  ? wait_woken+0x80/0x80
Nov 12 13:28:37 nitrogen kernel:  __cv_wait+0x15/0x20 [spl]
Nov 12 13:28:37 nitrogen kernel:  rrw_enter_write+0x3c/0xa0 [zfs]
Nov 12 13:28:37 nitrogen kernel:  rrw_enter+0x13/0x20 [zfs]
Nov 12 13:28:37 nitrogen kernel:  spa_sync+0x7ce/0xd30 [zfs]
Nov 12 13:28:37 nitrogen kernel:  txg_sync_thread+0x2cd/0x4a0 [zfs]
Nov 12 13:28:37 nitrogen kernel:  ? __switch_to_asm+0x35/0x70
Nov 12 13:28:37 nitrogen kernel:  ? txg_quiesce_thread+0x3d0/0x3d0 [zfs]
Nov 12 13:28:37 nitrogen kernel:  thread_generic_wrapper+0x74/0x90 [spl]
Nov 12 13:28:37 nitrogen kernel:  kthread+0x121/0x140
Nov 12 13:28:37 nitrogen kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 12 13:28:37 nitrogen kernel:  ? kthread_park+0xb0/0xb0
Nov 12 13:28:37 nitrogen kernel:  ret_from_fork+0x35/0x40
Nov 12 13:30:38 nitrogen kernel: INFO: task txg_sync:1906 blocked for more than 120 seconds.
Nov 12 13:30:38 nitrogen kernel:       Tainted: P           OE     5.0.0-31-generic #33~18.04.1-Ubuntu
Nov 12 13:30:38 nitrogen kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 12 13:30:38 nitrogen kernel: txg_sync        D    0  1906      2 0x80000000
Nov 12 13:30:38 nitrogen kernel: Call Trace:
Nov 12 13:30:38 nitrogen kernel:  __schedule+0x2c0/0x870
Nov 12 13:30:38 nitrogen kernel:  schedule+0x2c/0x70
Nov 12 13:30:38 nitrogen kernel:  cv_wait_common+0x11e/0x140 [spl]
Nov 12 13:30:38 nitrogen kernel:  ? wait_woken+0x80/0x80
Nov 12 13:30:38 nitrogen kernel:  __cv_wait+0x15/0x20 [spl]
Nov 12 13:30:38 nitrogen kernel:  rrw_enter_write+0x3c/0xa0 [zfs]
Nov 12 13:30:38 nitrogen kernel:  rrw_enter+0x13/0x20 [zfs]
Nov 12 13:30:38 nitrogen kernel:  spa_sync+0x7ce/0xd30 [zfs]
Nov 12 13:30:38 nitrogen kernel:  txg_sync_thread+0x2cd/0x4a0 [zfs]
Nov 12 13:30:38 nitrogen kernel:  ? __switch_to_asm+0x35/0x70
Nov 12 13:30:38 nitrogen kernel:  ? txg_quiesce_thread+0x3d0/0x3d0 [zfs]
Nov 12 13:30:38 nitrogen kernel:  thread_generic_wrapper+0x74/0x90 [spl]
Nov 12 13:30:38 nitrogen kernel:  kthread+0x121/0x140
Nov 12 13:30:38 nitrogen kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 12 13:30:38 nitrogen kernel:  ? kthread_park+0xb0/0xb0
Nov 12 13:30:38 nitrogen kernel:  ret_from_fork+0x35/0x40
Nov 12 13:34:40 nitrogen kernel: INFO: task zfs:24530 blocked for more than 120 seconds.
Nov 12 13:34:40 nitrogen kernel:       Tainted: P           OE     5.0.0-31-generic #33~18.04.1-Ubuntu
Nov 12 13:34:40 nitrogen kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 12 13:34:40 nitrogen kernel: zfs             D    0 24530  18116 0x00000002
Nov 12 13:34:40 nitrogen kernel: Call Trace:
Nov 12 13:34:40 nitrogen kernel:  __schedule+0x2c0/0x870
Nov 12 13:34:40 nitrogen kernel:  ? __wake_up_common+0x73/0x130
Nov 12 13:34:40 nitrogen kernel:  schedule+0x2c/0x70
Nov 12 13:34:40 nitrogen kernel:  cv_wait_common+0x11e/0x140 [spl]
Nov 12 13:34:40 nitrogen kernel:  ? wait_woken+0x80/0x80
Nov 12 13:34:40 nitrogen kernel:  __cv_wait+0x15/0x20 [spl]
Nov 12 13:34:40 nitrogen kernel:  txg_wait_synced+0xdd/0x130 [zfs]
Nov 12 13:34:40 nitrogen kernel:  dsl_sync_task+0x178/0x260 [zfs]
Nov 12 13:34:40 nitrogen kernel:  ? dsl_destroy_snapshot_check_impl+0x140/0x140 [zfs]
Nov 12 13:34:40 nitrogen kernel:  ? dsl_destroy_snapshot_sync_impl+0xa30/0xa30 [zfs]
Nov 12 13:34:40 nitrogen kernel:  ? dsl_destroy_snapshot_check_impl+0x140/0x140 [zfs]
Nov 12 13:34:40 nitrogen kernel:  ? dsl_destroy_snapshot_sync_impl+0xa30/0xa30 [zfs]
Nov 12 13:34:40 nitrogen kernel:  dsl_destroy_snapshots_nvl+0x90/0xf0 [zfs]
Nov 12 13:34:40 nitrogen kernel:  zfs_ioc_destroy_snaps+0x87/0xd0 [zfs]
Nov 12 13:34:40 nitrogen kernel:  zfsdev_ioctl+0x451/0x610 [zfs]
Nov 12 13:34:40 nitrogen kernel:  ? __switch_to_asm+0x35/0x70
Nov 12 13:34:40 nitrogen kernel:  do_vfs_ioctl+0xa9/0x640
Nov 12 13:34:40 nitrogen kernel:  ? __schedule+0x2c8/0x870
Nov 12 13:34:40 nitrogen kernel:  ksys_ioctl+0x75/0x80
Nov 12 13:34:40 nitrogen kernel:  __x64_sys_ioctl+0x1a/0x20
Nov 12 13:34:40 nitrogen kernel:  do_syscall_64+0x5a/0x120
Nov 12 13:34:40 nitrogen kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 12 13:34:40 nitrogen kernel: RIP: 0033:0x7f58bdab05d7
Nov 12 13:34:40 nitrogen kernel: Code: Bad RIP value.
Nov 12 13:34:40 nitrogen kernel: RSP: 002b:00007ffe3f4f8678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 12 13:34:40 nitrogen kernel: RAX: ffffffffffffffda RBX: 0000000000005a3b RCX: 00007f58bdab05d7
Nov 12 13:34:40 nitrogen kernel: RDX: 00007ffe3f4f8690 RSI: 0000000000005a3b RDI: 0000000000000005
Nov 12 13:34:40 nitrogen kernel: RBP: 00007ffe3f4f8690 R08: 0000000000000000 R09: 0000000000000000
Nov 12 13:34:40 nitrogen kernel: R10: 000055c0aa9c7010 R11: 0000000000000246 R12: 00007ffe3f4fbdc8
Nov 12 13:34:40 nitrogen kernel: R13: 0000000000000005 R14: 000055c0bb179540 R15: 000000000000000c
Nov 12 13:36:41 nitrogen kernel: INFO: task zfs:24530 blocked for more than 120 seconds.
Nov 12 13:36:41 nitrogen kernel:       Tainted: P           OE     5.0.0-31-generic #33~18.04.1-Ubuntu
Nov 12 13:36:41 nitrogen kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 12 13:36:41 nitrogen kernel: zfs             D    0 24530  18116 0x00000002
Nov 12 13:36:41 nitrogen kernel: Call Trace:
Nov 12 13:36:41 nitrogen kernel:  __schedule+0x2c0/0x870
Nov 12 13:36:41 nitrogen kernel:  ? __wake_up_common+0x73/0x130
Nov 12 13:36:41 nitrogen kernel:  schedule+0x2c/0x70
Nov 12 13:36:41 nitrogen kernel:  cv_wait_common+0x11e/0x140 [spl]
Nov 12 13:36:41 nitrogen kernel:  ? wait_woken+0x80/0x80
Nov 12 13:36:41 nitrogen kernel:  __cv_wait+0x15/0x20 [spl]
Nov 12 13:36:41 nitrogen kernel:  txg_wait_synced+0xdd/0x130 [zfs]
Nov 12 13:36:41 nitrogen kernel:  dsl_sync_task+0x178/0x260 [zfs]
Nov 12 13:36:41 nitrogen kernel:  ? dsl_destroy_snapshot_check_impl+0x140/0x140 [zfs]
Nov 12 13:36:41 nitrogen kernel:  ? dsl_destroy_snapshot_sync_impl+0xa30/0xa30 [zfs]
Nov 12 13:36:41 nitrogen kernel:  ? dsl_destroy_snapshot_check_impl+0x140/0x140 [zfs]
Nov 12 13:36:41 nitrogen kernel:  ? dsl_destroy_snapshot_sync_impl+0xa30/0xa30 [zfs]
Nov 12 13:36:41 nitrogen kernel:  dsl_destroy_snapshots_nvl+0x90/0xf0 [zfs]
Nov 12 13:36:41 nitrogen kernel:  zfs_ioc_destroy_snaps+0x87/0xd0 [zfs]
Nov 12 13:36:41 nitrogen kernel:  zfsdev_ioctl+0x451/0x610 [zfs]
Nov 12 13:36:41 nitrogen kernel:  ? __switch_to_asm+0x35/0x70
Nov 12 13:36:41 nitrogen kernel:  do_vfs_ioctl+0xa9/0x640
Nov 12 13:36:41 nitrogen kernel:  ? __schedule+0x2c8/0x870
Nov 12 13:36:41 nitrogen kernel:  ksys_ioctl+0x75/0x80
Nov 12 13:36:41 nitrogen kernel:  __x64_sys_ioctl+0x1a/0x20
Nov 12 13:36:41 nitrogen kernel:  do_syscall_64+0x5a/0x120
Nov 12 13:36:41 nitrogen kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 12 13:36:41 nitrogen kernel: RIP: 0033:0x7f58bdab05d7
Nov 12 13:36:41 nitrogen kernel: Code: Bad RIP value.
Nov 12 13:36:41 nitrogen kernel: RSP: 002b:00007ffe3f4f8678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 12 13:36:41 nitrogen kernel: RAX: ffffffffffffffda RBX: 0000000000005a3b RCX: 00007f58bdab05d7
Nov 12 13:36:41 nitrogen kernel: RDX: 00007ffe3f4f8690 RSI: 0000000000005a3b RDI: 0000000000000005
Nov 12 13:36:41 nitrogen kernel: RBP: 00007ffe3f4f8690 R08: 0000000000000000 R09: 0000000000000000
Nov 12 13:36:41 nitrogen kernel: R10: 000055c0aa9c7010 R11: 0000000000000246 R12: 00007ffe3f4fbdc8
Nov 12 13:36:41 nitrogen kernel: R13: 0000000000000005 R14: 000055c0bb179540 R15: 000000000000000c
@pkramme
Copy link
Author

pkramme commented Nov 12, 2019

I asked in #9130 wether the problem there is the same I am experiencing. As I am not sure, I am opening the issue here.

@pkramme pkramme changed the title Pool hangs when trying to delete many snapshots Pool hangs when deleting snapshots Nov 12, 2019
@ahrens ahrens added the Type: Performance Performance improvement or performance problem label Nov 12, 2019
@PrivatePuffin
Copy link
Contributor

I'm going to make a somewhat educated guess:
For some reason there is not enough memory or IO, thats my experience with low memory proxmox VM's and these errors.

When you talk about "pool hanging", isn't just the pool saturated with IO? If so, could it be that deleting snapshots (for some reason) floods it with high priority io?

@pkramme
Copy link
Author

pkramme commented Nov 13, 2019

The server has 192GB RAM, so i guess that is not the problem. High IO might be a hit, I started the destruction of the snapshots, and captured the disk IO with zpool iostat -lq 1. This is not the complete output.
zpool_iostat_during_snapshot_destruction.log

Hardwarewise this is a RAIDZ1 with 6 disks, so this might actually be the maximum.

@PrivatePuffin
Copy link
Contributor

Might this be related to #9583 ?

@pkramme
Copy link
Author

pkramme commented Nov 13, 2019

I attached more logs, maybe there is helpful information in them, but it seems to be the same stacktrace.
dmesg_zfs_hung_tasks_after_deleting_many_snapshots2.log

@Ornias1993 In the linked PR it is suggested that this only happens on pools without any cache devices or SSDs. Would it be helpful to attach an SSD cache to the pool and see if the "hang" still occurs?

Aside from that, what can I do to help finding the issue?

@PrivatePuffin
Copy link
Contributor

Your explaination is quite clear to that issue...
Adding a cache... You can try it.
If it solves your problem that's even more evidence #9583 is the culprid.
if it doesn't solve your problem it might still be #9583 (its not 100% certain about the ssd thing), but it could be other things.

Worth the try if you asked me.

@stale
Copy link

stale bot commented Nov 12, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Nov 12, 2020
@stale stale bot closed this as completed Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

3 participants