Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot Delete Hang #1283

Closed
iamacarpet opened this issue Feb 10, 2013 · 17 comments
Closed

Snapshot Delete Hang #1283

iamacarpet opened this issue Feb 10, 2013 · 17 comments

Comments

@iamacarpet
Copy link

Hello,

I have an issue with my box running ZFS when it is running snapshots at night.

I have 3 WD Red 3TB drives in RAID-Z split across several volumes:

zpool status

pool: tank
state: ONLINE
scan: scrub repaired 0 in 23h46m with 0 errors on Thu Feb 7 10:49:02 2013
config:

NAME        STATE     READ WRITE CKSUM
tank        ONLINE       0     0     0
  raidz1-0  ONLINE       0     0     0
    sdb     ONLINE       0     0     0
    sdc     ONLINE       0     0     0
    sdd     ONLINE       0     0     0

errors: No known data errors

zpool list

NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
tank 8.12T 2.15T 5.98T 26% 1.02x ONLINE -

zfs list

NAME USED AVAIL REFER MOUNTPOINT
tank 1.46T 3.90T 261K /tank
tank/backups 4.41G 3.90T 1.89G /tank/backups
tank/home 30.4G 3.90T 19.1G /tank/home
tank/media 1.30T 3.90T 1.30T /tank/media
tank/secret 120G 3.90T 120G -

I am using the snapshot.py script from https://github.com/mmehnert/zfs-snapshot-rotation-and-backup in the following configuration:

m h dom mon dow user command

15 * * * * root /usr/bin/python /tank/scripts/snapshot.py -k 48 tank/backups hourly
20 * * * * root /usr/bin/python /tank/scripts/snapshot.py -k 48 tank/home hourly
10 23 * * * root /usr/bin/python /tank/scripts/snapshot.py -k 30 tank/backups daily
30 23 * * * root /usr/bin/python /tank/scripts/snapshot.py -k 3 tank/media daily
40 23 * * * root /usr/bin/python /tank/scripts/snapshot.py -k 30 tank/home daily

And every morning I check the logs after having to force-reboot to see the following:

Feb 6 23:21:12 santos kernel: [44521.368218] INFO: task txg_sync:522 blocked for more than 120 seconds.
Feb 6 23:21:12 santos kernel: [44521.368279] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 6 23:21:12 santos kernel: [44521.368333] txg_sync D 0000000000000001 0 522 2 0x00000000
Feb 6 23:21:12 santos kernel: [44521.368344] ffff880075155b40 0000000000000046 ffff880075155af0 ffff88007943af98
Feb 6 23:21:12 santos kernel: [44521.368356] ffff880075155fd8 ffff880075155fd8 ffff880075155fd8 00000000000137c0
Feb 6 23:21:12 santos kernel: [44521.368366] ffff880076ee5c00 ffff880079660000 ffff880075155b30 ffff88007dc94080
Feb 6 23:21:12 santos kernel: [44521.368375] Call Trace:
Feb 6 23:21:12 santos kernel: [44521.368390] [] schedule+0x3f/0x60
Feb 6 23:21:12 santos kernel: [44521.368398] [] io_schedule+0x8f/0xd0
Feb 6 23:21:12 santos kernel: [44521.368441] [] cv_wait_common+0xa8/0x1b0 [spl]
Feb 6 23:21:12 santos kernel: [44521.368452] [] ? add_wait_queue+0x60/0x60
Feb 6 23:21:12 santos kernel: [44521.368477] [] __cv_wait_io+0x18/0x20 [spl]
Feb 6 23:21:12 santos kernel: [44521.368567] [] zio_wait+0x103/0x1a0 [zfs]
Feb 6 23:21:12 santos kernel: [44521.368641] [] dsl_scan_sync+0x488/0xaa0 [zfs]
Feb 6 23:21:12 santos kernel: [44521.368719] [] spa_sync+0x40b/0xa50 [zfs]
Feb 6 23:21:12 santos kernel: [44521.368798] [] txg_sync_thread+0x2df/0x540 [zfs]
Feb 6 23:21:12 santos kernel: [44521.368879] [] ? txg_init+0x250/0x250 [zfs]
Feb 6 23:21:12 santos kernel: [44521.368902] [] thread_generic_wrapper+0x78/0x90 [spl]
Feb 6 23:21:12 santos kernel: [44521.368924] [] ? __thread_create+0x310/0x310 [spl]
Feb 6 23:21:12 santos kernel: [44521.368932] [] kthread+0x8c/0xa0
Feb 6 23:21:12 santos kernel: [44521.368942] [] kernel_thread_helper+0x4/0x10
Feb 6 23:21:12 santos kernel: [44521.368950] [] ? flush_kthread_worker+0xa0/0xa0
Feb 6 23:21:12 santos kernel: [44521.368958] [] ? gs_change+0x13/0x13
Feb 6 23:21:12 santos kernel: [44521.369004] INFO: task zfs:10515 blocked for more than 120 seconds.
Feb 6 23:21:12 santos kernel: [44521.369047] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 6 23:21:12 santos kernel: [44521.369098] zfs D 0000000000000000 0 10515 10514 0x00000000
Feb 6 23:21:12 santos kernel: [44521.369108] ffff8800044dfc88 0000000000000082 ffff8800044dfd28 ffffffffa00b5973
Feb 6 23:21:12 santos kernel: [44521.369118] ffff8800044dffd8 ffff8800044dffd8 ffff8800044dffd8 00000000000137c0
Feb 6 23:21:12 santos kernel: [44521.369127] ffff880079661700 ffff880079ffc500 ffff8800044dfc98 ffff880078fa2a90
Feb 6 23:21:12 santos kernel: [44521.369136] Call Trace:
Feb 6 23:21:12 santos kernel: [44521.369157] [] ? spl_debug_msg+0x483/0x8e0 [spl]
Feb 6 23:21:12 santos kernel: [44521.369165] [] schedule+0x3f/0x60
Feb 6 23:21:12 santos kernel: [44521.369188] [] cv_wait_common+0xfd/0x1b0 [spl]
Feb 6 23:21:12 santos kernel: [44521.369197] [] ? add_wait_queue+0x60/0x60
Feb 6 23:21:12 santos kernel: [44521.369219] [] __cv_wait+0x15/0x20 [spl]
Feb 6 23:21:12 santos kernel: [44521.369299] [] txg_wait_synced+0xb3/0x190 [zfs]
Feb 6 23:21:12 santos kernel: [44521.369375] [] dsl_sync_task_group_wait+0x17d/0x2a0 [zfs]
Feb 6 23:21:12 santos kernel: [44521.369446] [] dmu_snapshots_destroy_nvl+0x128/0x1f0 [zfs]
Feb 6 23:21:12 santos kernel: [44521.369523] [] ? zfs_unmount_snap+0x92/0x130 [zfs]
Feb 6 23:21:12 santos kernel: [44521.369600] [] zfs_ioc_destroy_snaps_nvl+0xfa/0x130 [zfs]
Feb 6 23:21:12 santos kernel: [44521.369677] [] ? spa_open+0x13/0x20 [zfs]
Feb 6 23:21:12 santos kernel: [44521.369755] [] zfsdev_ioctl+0xdc/0x1b0 [zfs]
Feb 6 23:21:12 santos kernel: [44521.369765] [] do_vfs_ioctl+0x8a/0x340
Feb 6 23:21:12 santos kernel: [44521.369774] [] ? __schedule+0x3cc/0x6f0
Feb 6 23:21:12 santos kernel: [44521.369783] [] ? do_munmap+0x1f3/0x2f0
Feb 6 23:21:12 santos kernel: [44521.369791] [] sys_ioctl+0x91/0xa0
Feb 6 23:21:12 santos kernel: [44521.369799] [] system_call_fastpath+0x16/0x1b

It appears to be something to do with the daily snapshots being taken at night, so I've adjusted them so they are now run with 10 mins between each other, but the same thing still happens.

zfs events just shows the following, although it appears to be non-persistant and I haven't been able to get back onto the box yet after a hang to investigate.

zpool events

TIME CLASS
Feb 9 2013 12:37:57.658878000 resource.fs.zfs.statechange
Feb 9 2013 12:37:57.658882000 resource.fs.zfs.statechange
Feb 9 2013 12:37:57.659034000 resource.fs.zfs.statechange
Feb 9 2013 12:37:57.935939000 resource.fs.zfs.statechange
Feb 9 2013 12:37:57.935947000 resource.fs.zfs.statechange
Feb 9 2013 12:37:57.936016000 resource.fs.zfs.statechange

Does anyone have any ideas what it might be?

@iamacarpet
Copy link
Author

And this is the latest nights worth of logs:

Feb 9 23:43:46 santos kernel: [39961.324190] INFO: task txg_sync:520 blocked for more than 120 seconds.
Feb 9 23:43:46 santos kernel: [39961.324250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 23:43:46 santos kernel: [39961.324303] txg_sync D ffffffff81806240 0 520 2 0x00000000
Feb 9 23:43:46 santos kernel: [39961.324315] ffff880069095b40 0000000000000046 ffff880069095af0 ffff880079986298
Feb 9 23:43:46 santos kernel: [39961.324327] ffff880069095fd8 ffff880069095fd8 ffff880069095fd8 00000000000137c0
Feb 9 23:43:46 santos kernel: [39961.324337] ffff880079811700 ffff88007736dc00 ffff880069095b30 ffff88007dc94080
Feb 9 23:43:46 santos kernel: [39961.324346] Call Trace:
Feb 9 23:43:46 santos kernel: [39961.324361] [] schedule+0x3f/0x60
Feb 9 23:43:46 santos kernel: [39961.324369] [] io_schedule+0x8f/0xd0
Feb 9 23:43:46 santos kernel: [39961.324410] [] cv_wait_common+0xa8/0x1b0 [spl]
Feb 9 23:43:46 santos kernel: [39961.324420] [] ? add_wait_queue+0x60/0x60
Feb 9 23:43:46 santos kernel: [39961.324446] [] __cv_wait_io+0x18/0x20 [spl]
Feb 9 23:43:46 santos kernel: [39961.324536] [] zio_wait+0x103/0x1a0 [zfs]
Feb 9 23:43:46 santos kernel: [39961.324610] [] dsl_scan_sync+0x488/0xaa0 [zfs]
Feb 9 23:43:46 santos kernel: [39961.324688] [] spa_sync+0x40b/0xa50 [zfs]
Feb 9 23:43:46 santos kernel: [39961.324766] [] txg_sync_thread+0x2df/0x540 [zfs]
Feb 9 23:43:46 santos kernel: [39961.324846] [] ? txg_init+0x250/0x250 [zfs]
Feb 9 23:43:46 santos kernel: [39961.324869] [] thread_generic_wrapper+0x78/0x90 [spl]
Feb 9 23:43:46 santos kernel: [39961.324891] [] ? __thread_create+0x310/0x310 [spl]
Feb 9 23:43:46 santos kernel: [39961.324899] [] kthread+0x8c/0xa0
Feb 9 23:43:46 santos kernel: [39961.324909] [] kernel_thread_helper+0x4/0x10
Feb 9 23:43:46 santos kernel: [39961.324917] [] ? flush_kthread_worker+0xa0/0xa0
Feb 9 23:43:46 santos kernel: [39961.324925] [] ? gs_change+0x13/0x13
Feb 9 23:43:46 santos kernel: [39961.324968] INFO: task zfs:4376 blocked for more than 120 seconds.
Feb 9 23:43:46 santos kernel: [39961.325012] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 23:43:46 santos kernel: [39961.325063] zfs D ffffffff81806240 0 4376 4375 0x00000000
Feb 9 23:43:46 santos kernel: [39961.325073] ffff88004f849c88 0000000000000086 0000000000000001 ffff8800799ada58
Feb 9 23:43:46 santos kernel: [39961.325083] ffff88004f849fd8 ffff88004f849fd8 ffff88004f849fd8 00000000000137c0
Feb 9 23:43:46 santos kernel: [39961.325092] ffff880079811700 ffff8800643c4500 ffff88004f849c98 ffff8800799ada90
Feb 9 23:43:46 santos kernel: [39961.325101] Call Trace:
Feb 9 23:43:46 santos kernel: [39961.325108] [] schedule+0x3f/0x60
Feb 9 23:43:46 santos kernel: [39961.325132] [] cv_wait_common+0xfd/0x1b0 [spl]
Feb 9 23:43:46 santos kernel: [39961.325141] [] ? add_wait_queue+0x60/0x60
Feb 9 23:43:46 santos kernel: [39961.325163] [] __cv_wait+0x15/0x20 [spl]
Feb 9 23:43:46 santos kernel: [39961.325241] [] txg_wait_synced+0xb3/0x190 [zfs]
Feb 9 23:43:46 santos kernel: [39961.325317] [] dsl_sync_task_group_wait+0x17d/0x2a0 [zfs]
Feb 9 23:43:46 santos kernel: [39961.325387] [] dmu_snapshots_destroy_nvl+0x128/0x1f0 [zfs]
Feb 9 23:43:46 santos kernel: [39961.325463] [] ? zfs_unmount_snap+0x92/0x130 [zfs]
Feb 9 23:43:46 santos kernel: [39961.325540] [] zfs_ioc_destroy_snaps_nvl+0xfa/0x130 [zfs]
Feb 9 23:43:46 santos kernel: [39961.325617] [] ? spa_open+0x13/0x20 [zfs]
Feb 9 23:43:46 santos kernel: [39961.325694] [] zfsdev_ioctl+0xdc/0x1b0 [zfs]
Feb 9 23:43:46 santos kernel: [39961.325704] [] do_vfs_ioctl+0x8a/0x340
Feb 9 23:43:46 santos kernel: [39961.325713] [] ? do_munmap+0x1f3/0x2f0
Feb 9 23:43:46 santos kernel: [39961.325721] [] sys_ioctl+0x91/0xa0
Feb 9 23:43:46 santos kernel: [39961.325729] [] system_call_fastpath+0x16/0x1b
Feb 9 23:45:46 santos kernel: [40081.324162] INFO: task txg_sync:520 blocked for more than 120 seconds.
Feb 9 23:45:46 santos kernel: [40081.324221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 23:45:46 santos kernel: [40081.324274] txg_sync D ffffffff81806240 0 520 2 0x00000000
Feb 9 23:45:46 santos kernel: [40081.324285] ffff880069095b40 0000000000000046 ffff880069095af0 ffff880079986298
Feb 9 23:45:46 santos kernel: [40081.324297] ffff880069095fd8 ffff880069095fd8 ffff880069095fd8 00000000000137c0
Feb 9 23:45:46 santos kernel: [40081.324307] ffff880079811700 ffff88007736dc00 ffff880069095b30 ffff88007dc94080
Feb 9 23:45:46 santos kernel: [40081.324320] Call Trace:
Feb 9 23:45:46 santos kernel: [40081.324336] [] schedule+0x3f/0x60
Feb 9 23:45:46 santos kernel: [40081.324344] [] io_schedule+0x8f/0xd0
Feb 9 23:45:46 santos kernel: [40081.324385] [] cv_wait_common+0xa8/0x1b0 [spl]
Feb 9 23:45:46 santos kernel: [40081.324396] [] ? add_wait_queue+0x60/0x60
Feb 9 23:45:46 santos kernel: [40081.324421] [] __cv_wait_io+0x18/0x20 [spl]
Feb 9 23:45:46 santos kernel: [40081.324512] [] zio_wait+0x103/0x1a0 [zfs]
Feb 9 23:45:46 santos kernel: [40081.324586] [] dsl_scan_sync+0x488/0xaa0 [zfs]
Feb 9 23:45:46 santos kernel: [40081.324663] [] spa_sync+0x40b/0xa50 [zfs]
Feb 9 23:45:46 santos kernel: [40081.324742] [] txg_sync_thread+0x2df/0x540 [zfs]
Feb 9 23:45:46 santos kernel: [40081.324822] [] ? txg_init+0x250/0x250 [zfs]
Feb 9 23:45:46 santos kernel: [40081.324845] [] thread_generic_wrapper+0x78/0x90 [spl]
Feb 9 23:45:46 santos kernel: [40081.324866] [] ? __thread_create+0x310/0x310 [spl]
Feb 9 23:45:46 santos kernel: [40081.324875] [] kthread+0x8c/0xa0
Feb 9 23:45:46 santos kernel: [40081.324884] [] kernel_thread_helper+0x4/0x10
Feb 9 23:45:46 santos kernel: [40081.324893] [] ? flush_kthread_worker+0xa0/0xa0
Feb 9 23:45:46 santos kernel: [40081.324901] [] ? gs_change+0x13/0x13
Feb 9 23:45:46 santos kernel: [40081.324947] INFO: task zfs:4376 blocked for more than 120 seconds.
Feb 9 23:45:46 santos kernel: [40081.324991] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 23:45:46 santos kernel: [40081.325043] zfs D ffffffff81806240 0 4376 4375 0x00000000
Feb 9 23:45:46 santos kernel: [40081.325052] ffff88004f849c88 0000000000000086 0000000000000001 ffff8800799ada58
Feb 9 23:45:46 santos kernel: [40081.325062] ffff88004f849fd8 ffff88004f849fd8 ffff88004f849fd8 00000000000137c0
Feb 9 23:45:46 santos kernel: [40081.325072] ffff880079811700 ffff8800643c4500 ffff88004f849c98 ffff8800799ada90
Feb 9 23:45:46 santos kernel: [40081.325081] Call Trace:
Feb 9 23:45:46 santos kernel: [40081.325088] [] schedule+0x3f/0x60
Feb 9 23:45:46 santos kernel: [40081.325112] [] cv_wait_common+0xfd/0x1b0 [spl]
Feb 9 23:45:46 santos kernel: [40081.325120] [] ? add_wait_queue+0x60/0x60
Feb 9 23:45:46 santos kernel: [40081.325143] [] __cv_wait+0x15/0x20 [spl]
Feb 9 23:45:46 santos kernel: [40081.325221] [] txg_wait_synced+0xb3/0x190 [zfs]
Feb 9 23:45:46 santos kernel: [40081.325296] [] dsl_sync_task_group_wait+0x17d/0x2a0 [zfs]
Feb 9 23:45:46 santos kernel: [40081.325366] [] dmu_snapshots_destroy_nvl+0x128/0x1f0 [zfs]
Feb 9 23:45:46 santos kernel: [40081.325443] [] ? zfs_unmount_snap+0x92/0x130 [zfs]
Feb 9 23:45:46 santos kernel: [40081.325519] [] zfs_ioc_destroy_snaps_nvl+0xfa/0x130 [zfs]
Feb 9 23:45:46 santos kernel: [40081.325596] [] ? spa_open+0x13/0x20 [zfs]
Feb 9 23:45:46 santos kernel: [40081.325673] [] zfsdev_ioctl+0xdc/0x1b0 [zfs]
Feb 9 23:45:46 santos kernel: [40081.325683] [] do_vfs_ioctl+0x8a/0x340
Feb 9 23:45:46 santos kernel: [40081.325692] [] ? do_munmap+0x1f3/0x2f0
Feb 9 23:45:46 santos kernel: [40081.325700] [] sys_ioctl+0x91/0xa0
Feb 9 23:45:46 santos kernel: [40081.325708] [] system_call_fastpath+0x16/0x1b
Feb 9 23:47:46 santos kernel: [40201.324187] INFO: task txg_sync:520 blocked for more than 120 seconds.
Feb 9 23:47:46 santos kernel: [40201.324246] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 23:47:46 santos kernel: [40201.324299] txg_sync D ffffffff81806240 0 520 2 0x00000000
Feb 9 23:47:46 santos kernel: [40201.324311] ffff880069095b40 0000000000000046 ffff880069095af0 ffff880079986298
Feb 9 23:47:46 santos kernel: [40201.324323] ffff880069095fd8 ffff880069095fd8 ffff880069095fd8 00000000000137c0
Feb 9 23:47:46 santos kernel: [40201.324333] ffff880079811700 ffff88007736dc00 ffff880069095b30 ffff88007dc94080
Feb 9 23:47:46 santos kernel: [40201.324342] Call Trace:
Feb 9 23:47:46 santos kernel: [40201.324357] [] schedule+0x3f/0x60
Feb 9 23:47:46 santos kernel: [40201.324365] [] io_schedule+0x8f/0xd0
Feb 9 23:47:46 santos kernel: [40201.324406] [] cv_wait_common+0xa8/0x1b0 [spl]
Feb 9 23:47:46 santos kernel: [40201.324416] [] ? add_wait_queue+0x60/0x60
Feb 9 23:47:46 santos kernel: [40201.324442] [] __cv_wait_io+0x18/0x20 [spl]
Feb 9 23:47:46 santos kernel: [40201.324532] [] zio_wait+0x103/0x1a0 [zfs]
Feb 9 23:47:46 santos kernel: [40201.324606] [] dsl_scan_sync+0x488/0xaa0 [zfs]
Feb 9 23:47:46 santos kernel: [40201.324683] [] spa_sync+0x40b/0xa50 [zfs]
Feb 9 23:47:46 santos kernel: [40201.324762] [] txg_sync_thread+0x2df/0x540 [zfs]
Feb 9 23:47:46 santos kernel: [40201.324841] [] ? txg_init+0x250/0x250 [zfs]
Feb 9 23:47:46 santos kernel: [40201.324865] [] thread_generic_wrapper+0x78/0x90 [spl]
Feb 9 23:47:46 santos kernel: [40201.324886] [] ? __thread_create+0x310/0x310 [spl]
Feb 9 23:47:46 santos kernel: [40201.324894] [] kthread+0x8c/0xa0
Feb 9 23:47:46 santos kernel: [40201.324904] [] kernel_thread_helper+0x4/0x10
Feb 9 23:47:46 santos kernel: [40201.324912] [] ? flush_kthread_worker+0xa0/0xa0
Feb 9 23:47:46 santos kernel: [40201.324920] [] ? gs_change+0x13/0x13
Feb 9 23:47:46 santos kernel: [40201.324964] INFO: task zfs:4376 blocked for more than 120 seconds.
Feb 9 23:47:46 santos kernel: [40201.325007] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 23:47:46 santos kernel: [40201.325059] zfs D ffffffff81806240 0 4376 4375 0x00000000
Feb 9 23:47:46 santos kernel: [40201.325068] ffff88004f849c88 0000000000000086 0000000000000001 ffff8800799ada58
Feb 9 23:47:46 santos kernel: [40201.325078] ffff88004f849fd8 ffff88004f849fd8 ffff88004f849fd8 00000000000137c0
Feb 9 23:47:46 santos kernel: [40201.325088] ffff880079811700 ffff8800643c4500 ffff88004f849c98 ffff8800799ada90
Feb 9 23:47:46 santos kernel: [40201.325097] Call Trace:
Feb 9 23:47:46 santos kernel: [40201.325104] [] schedule+0x3f/0x60
Feb 9 23:47:46 santos kernel: [40201.325128] [] cv_wait_common+0xfd/0x1b0 [spl]
Feb 9 23:47:46 santos kernel: [40201.325137] [] ? add_wait_queue+0x60/0x60
Feb 9 23:47:46 santos kernel: [40201.325159] [] __cv_wait+0x15/0x20 [spl]
Feb 9 23:47:46 santos kernel: [40201.325237] [] txg_wait_synced+0xb3/0x190 [zfs]
Feb 9 23:47:46 santos kernel: [40201.325313] [] dsl_sync_task_group_wait+0x17d/0x2a0 [zfs]
Feb 9 23:47:46 santos kernel: [40201.325383] [] dmu_snapshots_destroy_nvl+0x128/0x1f0 [zfs]
Feb 9 23:47:46 santos kernel: [40201.325460] [] ? zfs_unmount_snap+0x92/0x130 [zfs]
Feb 9 23:47:46 santos kernel: [40201.325536] [] zfs_ioc_destroy_snaps_nvl+0xfa/0x130 [zfs]
Feb 9 23:47:46 santos kernel: [40201.325613] [] ? spa_open+0x13/0x20 [zfs]
Feb 9 23:47:46 santos kernel: [40201.325690] [] zfsdev_ioctl+0xdc/0x1b0 [zfs]
Feb 9 23:47:46 santos kernel: [40201.325700] [] do_vfs_ioctl+0x8a/0x340
Feb 9 23:47:46 santos kernel: [40201.325709] [] ? do_munmap+0x1f3/0x2f0
Feb 9 23:47:46 santos kernel: [40201.325717] [] sys_ioctl+0x91/0xa0
Feb 9 23:47:46 santos kernel: [40201.325725] [] system_call_fastpath+0x16/0x1b
Feb 9 23:49:46 santos kernel: [40321.324167] INFO: task txg_sync:520 blocked for more than 120 seconds.
Feb 9 23:49:46 santos kernel: [40321.324226] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 23:49:46 santos kernel: [40321.324279] txg_sync D ffffffff81806240 0 520 2 0x00000000
Feb 9 23:49:46 santos kernel: [40321.324291] ffff880069095b40 0000000000000046 ffff880069095af0 ffff880079986298
Feb 9 23:49:46 santos kernel: [40321.324302] ffff880069095fd8 ffff880069095fd8 ffff880069095fd8 00000000000137c0
Feb 9 23:49:46 santos kernel: [40321.324312] ffff880079811700 ffff88007736dc00 ffff880069095b30 ffff88007dc94080
Feb 9 23:49:46 santos kernel: [40321.324322] Call Trace:
Feb 9 23:49:46 santos kernel: [40321.324336] [] schedule+0x3f/0x60
Feb 9 23:49:46 santos kernel: [40321.324344] [] io_schedule+0x8f/0xd0
Feb 9 23:49:46 santos kernel: [40321.324385] [] cv_wait_common+0xa8/0x1b0 [spl]
Feb 9 23:49:46 santos kernel: [40321.324395] [] ? add_wait_queue+0x60/0x60
Feb 9 23:49:46 santos kernel: [40321.324421] [] __cv_wait_io+0x18/0x20 [spl]
Feb 9 23:49:46 santos kernel: [40321.324510] [] zio_wait+0x103/0x1a0 [zfs]
Feb 9 23:49:46 santos kernel: [40321.324584] [] dsl_scan_sync+0x488/0xaa0 [zfs]
Feb 9 23:49:46 santos kernel: [40321.324661] [] spa_sync+0x40b/0xa50 [zfs]
Feb 9 23:49:46 santos kernel: [40321.324740] [] txg_sync_thread+0x2df/0x540 [zfs]
Feb 9 23:49:46 santos kernel: [40321.324819] [] ? txg_init+0x250/0x250 [zfs]
Feb 9 23:49:46 santos kernel: [40321.324843] [] thread_generic_wrapper+0x78/0x90 [spl]
Feb 9 23:49:46 santos kernel: [40321.324864] [] ? __thread_create+0x310/0x310 [spl]
Feb 9 23:49:46 santos kernel: [40321.324872] [] kthread+0x8c/0xa0
Feb 9 23:49:46 santos kernel: [40321.324882] [] kernel_thread_helper+0x4/0x10
Feb 9 23:49:46 santos kernel: [40321.324890] [] ? flush_kthread_worker+0xa0/0xa0
Feb 9 23:49:46 santos kernel: [40321.324898] [] ? gs_change+0x13/0x13
Feb 9 23:49:46 santos kernel: [40321.324943] INFO: task zfs:4376 blocked for more than 120 seconds.
Feb 9 23:49:46 santos kernel: [40321.324987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 23:49:46 santos kernel: [40321.325038] zfs D ffffffff81806240 0 4376 4375 0x00000000
Feb 9 23:49:46 santos kernel: [40321.325048] ffff88004f849c88 0000000000000086 0000000000000001 ffff8800799ada58
Feb 9 23:49:46 santos kernel: [40321.325057] ffff88004f849fd8 ffff88004f849fd8 ffff88004f849fd8 00000000000137c0
Feb 9 23:49:46 santos kernel: [40321.325067] ffff880079811700 ffff8800643c4500 ffff88004f849c98 ffff8800799ada90
Feb 9 23:49:46 santos kernel: [40321.325076] Call Trace:
Feb 9 23:49:46 santos kernel: [40321.325083] [] schedule+0x3f/0x60
Feb 9 23:49:46 santos kernel: [40321.325107] [] cv_wait_common+0xfd/0x1b0 [spl]
Feb 9 23:49:46 santos kernel: [40321.325116] [] ? add_wait_queue+0x60/0x60
Feb 9 23:49:46 santos kernel: [40321.325138] [] __cv_wait+0x15/0x20 [spl]
Feb 9 23:49:46 santos kernel: [40321.325216] [] txg_wait_synced+0xb3/0x190 [zfs]
Feb 9 23:49:46 santos kernel: [40321.325291] [] dsl_sync_task_group_wait+0x17d/0x2a0 [zfs]
Feb 9 23:49:46 santos kernel: [40321.325361] [] dmu_snapshots_destroy_nvl+0x128/0x1f0 [zfs]
Feb 9 23:49:46 santos kernel: [40321.325438] [] ? zfs_unmount_snap+0x92/0x130 [zfs]
Feb 9 23:49:46 santos kernel: [40321.325515] [] zfs_ioc_destroy_snaps_nvl+0xfa/0x130 [zfs]
Feb 9 23:49:46 santos kernel: [40321.325592] [] ? spa_open+0x13/0x20 [zfs]
Feb 9 23:49:46 santos kernel: [40321.325669] [] zfsdev_ioctl+0xdc/0x1b0 [zfs]
Feb 9 23:49:46 santos kernel: [40321.325679] [] do_vfs_ioctl+0x8a/0x340
Feb 9 23:49:46 santos kernel: [40321.325688] [] ? do_munmap+0x1f3/0x2f0
Feb 9 23:49:46 santos kernel: [40321.325696] [] sys_ioctl+0x91/0xa0
Feb 9 23:49:46 santos kernel: [40321.325704] [] system_call_fastpath+0x16/0x1b
Feb 9 23:51:46 santos kernel: [40441.324187] INFO: task txg_sync:520 blocked for more than 120 seconds.
Feb 9 23:51:46 santos kernel: [40441.324246] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 23:51:46 santos kernel: [40441.324299] txg_sync D ffffffff81806240 0 520 2 0x00000000
Feb 9 23:51:46 santos kernel: [40441.324311] ffff880069095b40 0000000000000046 ffff880069095af0 ffff880079986298
Feb 9 23:51:46 santos kernel: [40441.324323] ffff880069095fd8 ffff880069095fd8 ffff880069095fd8 00000000000137c0
Feb 9 23:51:46 santos kernel: [40441.324333] ffff880079811700 ffff88007736dc00 ffff880069095b30 ffff88007dc94080
Feb 9 23:51:46 santos kernel: [40441.324343] Call Trace:
Feb 9 23:51:46 santos kernel: [40441.324357] [] schedule+0x3f/0x60
Feb 9 23:51:46 santos kernel: [40441.324365] [] io_schedule+0x8f/0xd0
Feb 9 23:51:46 santos kernel: [40441.324406] [] cv_wait_common+0xa8/0x1b0 [spl]
Feb 9 23:51:46 santos kernel: [40441.324422] [] ? add_wait_queue+0x60/0x60
Feb 9 23:51:46 santos kernel: [40441.324447] [] __cv_wait_io+0x18/0x20 [spl]
Feb 9 23:51:46 santos kernel: [40441.324539] [] zio_wait+0x103/0x1a0 [zfs]
Feb 9 23:51:46 santos kernel: [40441.324621] [] dsl_scan_sync+0x488/0xaa0 [zfs]
Feb 9 23:51:46 santos kernel: [40441.324698] [] spa_sync+0x40b/0xa50 [zfs]
Feb 9 23:51:46 santos kernel: [40441.324777] [] txg_sync_thread+0x2df/0x540 [zfs]
Feb 9 23:51:46 santos kernel: [40441.324857] [] ? txg_init+0x250/0x250 [zfs]
Feb 9 23:51:46 santos kernel: [40441.324880] [] thread_generic_wrapper+0x78/0x90 [spl]
Feb 9 23:51:46 santos kernel: [40441.324901] [] ? __thread_create+0x310/0x310 [spl]
Feb 9 23:51:46 santos kernel: [40441.324910] [] kthread+0x8c/0xa0
Feb 9 23:51:46 santos kernel: [40441.324919] [] kernel_thread_helper+0x4/0x10
Feb 9 23:51:46 santos kernel: [40441.324928] [] ? flush_kthread_worker+0xa0/0xa0
Feb 9 23:51:46 santos kernel: [40441.324936] [] ? gs_change+0x13/0x13
Feb 9 23:51:46 santos kernel: [40441.324978] INFO: task zfs:4376 blocked for more than 120 seconds.
Feb 9 23:51:46 santos kernel: [40441.325022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 23:51:46 santos kernel: [40441.325074] zfs D ffffffff81806240 0 4376 4375 0x00000000
Feb 9 23:51:46 santos kernel: [40441.325084] ffff88004f849c88 0000000000000086 0000000000000001 ffff8800799ada58
Feb 9 23:51:46 santos kernel: [40441.325094] ffff88004f849fd8 ffff88004f849fd8 ffff88004f849fd8 00000000000137c0
Feb 9 23:51:46 santos kernel: [40441.325103] ffff880079811700 ffff8800643c4500 ffff88004f849c98 ffff8800799ada90
Feb 9 23:51:46 santos kernel: [40441.325112] Call Trace:
Feb 9 23:51:46 santos kernel: [40441.325119] [] schedule+0x3f/0x60
Feb 9 23:51:46 santos kernel: [40441.325143] [] cv_wait_common+0xfd/0x1b0 [spl]
Feb 9 23:51:46 santos kernel: [40441.325152] [] ? add_wait_queue+0x60/0x60
Feb 9 23:51:46 santos kernel: [40441.325174] [] __cv_wait+0x15/0x20 [spl]
Feb 9 23:51:46 santos kernel: [40441.325252] [] txg_wait_synced+0xb3/0x190 [zfs]
Feb 9 23:51:46 santos kernel: [40441.325327] [] dsl_sync_task_group_wait+0x17d/0x2a0 [zfs]
Feb 9 23:51:46 santos kernel: [40441.325397] [] dmu_snapshots_destroy_nvl+0x128/0x1f0 [zfs]
Feb 9 23:51:46 santos kernel: [40441.325474] [] ? zfs_unmount_snap+0x92/0x130 [zfs]
Feb 9 23:51:46 santos kernel: [40441.325551] [] zfs_ioc_destroy_snaps_nvl+0xfa/0x130 [zfs]
Feb 9 23:51:46 santos kernel: [40441.325628] [] ? spa_open+0x13/0x20 [zfs]
Feb 9 23:51:46 santos kernel: [40441.325705] [] zfsdev_ioctl+0xdc/0x1b0 [zfs]
Feb 9 23:51:46 santos kernel: [40441.325715] [] do_vfs_ioctl+0x8a/0x340
Feb 9 23:51:46 santos kernel: [40441.325724] [] ? do_munmap+0x1f3/0x2f0
Feb 9 23:51:46 santos kernel: [40441.325732] [] sys_ioctl+0x91/0xa0
Feb 9 23:51:46 santos kernel: [40441.325740] [] system_call_fastpath+0x16/0x1b

@iamacarpet
Copy link
Author

Just been doing some testing today, and it's the command:

root@santos:~# /usr/bin/python /tank/scripts/snapshot.py -k 30 tank/home daily
Namespace(dry_run=False, fs='tank/home', k=30, prefix='daily', r=False, remote='', verbose=False)

That's causing the issue - I ran it manually and got the lock-up issue - Directory listings and things work fine, but actually trying to access a files data stored on the zpool just hangs the accessing thread.

I did however manage to get a dump of zpool events this time:

root@santos:~# zpool events
TIME CLASS
Feb 10 2013 12:40:19.470212000 resource.fs.zfs.statechange
Feb 10 2013 12:40:19.470216000 resource.fs.zfs.statechange
Feb 10 2013 12:40:19.470370000 resource.fs.zfs.statechange
Feb 10 2013 12:40:19.794208000 resource.fs.zfs.statechange
Feb 10 2013 12:40:19.794216000 resource.fs.zfs.statechange
Feb 10 2013 12:40:19.794283000 resource.fs.zfs.statechange

and zpool status:

root@santos:~# zpool status
pool: tank
state: ONLINE
scan: scrub repaired 0 in 23h46m with 0 errors on Thu Feb 7 10:49:02 2013
config:

NAME        STATE     READ WRITE CKSUM
tank        ONLINE       0     0     0
  raidz1-0  ONLINE       0     0     0
    sdb     ONLINE       0     0     0
    sdc     ONLINE       0     0     0
    sdd     ONLINE       0     0     0

errors: No known data errors

@iamacarpet
Copy link
Author

Found out a little better what is causing it:

zfs destroy "tank/home@daily-2013-01-05--23-18-01"

That snapshot is one that is 11GB in size due to that much data being deleted, and has now been removed from the list (we'll see if that's still the case after a reboot).

@behlendorf
Copy link
Contributor

@iamacarpet Strange. It's clear everything is blocked waiting for an I/O to complete. However, it's not at all clear why that I/O isn't completing.

A few additional comments:

Feb 10 2013 12:40:19.470212000 resource.fs.zfs.statechange

These are normal. I was hoping to see an I/O error/delay event.

zfs destroy "tank/home@daily-2013-01-05--23-18-01"

That's interesting. The fact that you can provoke this with a specific snapshot makes me suspect it's damaged in some way. It looks like the scrub claims everything is correct on disk as of a few days ago so if there is damage here it has a good checksum. By chance have you enabled the 'async destroy' feature flag, or is this still a v28 pool?

Now that you have a simple reproducer. It would be useful to see if you can cause this issue with an Illumos kernel, that would narrow down if this is a ZoL specific issue or something more generic. I'd suggest booting a SmartOS USB key which should use a recent ZFS version. Then import the pool and try and destroy the snapshot. Straight forward directions for Linux can be found here:

http://wiki.smartos.org/display/DOC/Getting+Started+with+SmartOS

@iamacarpet
Copy link
Author

I have recently upgraded to the newer format as part of the new RC release, and I didn't realise it had automatically enabled the flags.

Yes, I have the following:

NAME PROPERTY VALUE SOURCE
tank size 8.12T -
tank capacity 26% -
tank altroot - default
tank health ONLINE -
tank guid 9137038174884864491 default
tank version - default
tank bootfs - default
tank delegation on default
tank autoreplace off default
tank cachefile - default
tank failmode wait default
tank listsnapshots off default
tank autoexpand off default
tank dedupditto 0 default
tank dedupratio 1.02x -
tank free 5.98T -
tank allocated 2.15T -
tank readonly off -
tank ashift 0 default
tank comment - default
tank expandsize 0 -
tank freeing 0 default
tank feature@async_destroy enabled local
tank feature@empty_bpobj active local
tank feature@lz4_compress enabled local

I'd rather not have to boot SmartOS as it's a headless box hidden away in a cupboard as a media server, but do you believe disabling async_destroy may solve the problem or is SmartOS the first choice? And from a debugging perspective, I don't have the resources to make a copy of the disks and try it on Linux with async_destroy disabled and on SmartOS with it enabled if the first one happens to solves it - Which would be more useful to you?

Thanks.

@behlendorf
Copy link
Contributor

@iamacarpet I was mainly hoping to determine if it's a ZoL specific issue or something which appears upstream. SmartOS on a USB key was just an easy way to determine that. If it does appear in upstream tree I'm sure they would want to know about it so it can be fixed in all the implementations. So booting SmartOS and at least trying the destroy would be best if you can manage it.

I'd suggest disabling the async_destroy feature but once enabled the current code doesn't permit it to be disabled. That's actually somewhat annoying because since the feature currently isn't active it would be safe to disable it. I should open an issue for that and see what @csiden thinks about allowing enabled (but not active) features to be disabled.

@mailinglists35
Copy link

@iamacarpet if you can't boot the box with smartos, maybe you can export the whole drive with iscsitarget and access it with iscsi initiator onto another machine booted with smartos? (run "zpool export" before that, ofcourse)

@iamacarpet
Copy link
Author

Thanks for the comments, I finally had to perform some maintenance on the box today, so I thought while I had it out I'd have a go with SmartOS and see what happened.

I rebooted into SmartOS and once logged in, the "tank" was already mounted and zpool status showed it correctly - I listed the snapshots to find the one that was causing the problems on ZoL and tried to destroy it, and after about 2-3 mins without giving any warnings it came back as destroyed successfully.

Once booted back into Linux with ZoL, the pool wouldn't mount as SmartOS had changed the labels to the Solaris style device ids (not sure how it picked those up but couldn't find the drives, as it must have read the label from there), so after a bit of digging around, I managed to get it to work with:
zpool export tank
zpool import -d /dev/disk/by-id/ tank
(which then gave an in use by another system message, so to get it to actually work)
zpool import -f -d /dev/disk/by-id/ tank
(which took about half an hour with lots of zfs module hung messages, after which point the filesystem was accessible).

So looks like it's not upstream, is there anything else we can do to pin this down now the offending snapshot has gone do you think?

@behlendorf
Copy link
Contributor

@iamacarpet Glad to hear you got it sorted out. It sure does sound like there's an issue lingering here for Linux. Let's leave the issue open for now in case someone else encounters a similar issue. However, without a test case I'm not sure we can make any progress on this.

@ryao
Copy link
Contributor

ryao commented Mar 29, 2013

@iamacarpet Which version of ZFSOnLinux were you using?

@iamacarpet
Copy link
Author

@ryao I'm not sure exactly as I used the "stable" Ubuntu PPA and didn't really keep track of version numbers. I know I updated it whenever I saw a mailing list post saying a new release was out and I'm pretty sure it was the version the BSD specific extensions were added, so I think that would make it spl/zfs-0.6.0-rc14.

Now I think about it, I'm pretty sure the snapshot with the problem was created with rc13 and before I enabled the empty_bpobj option - if that could be it?

Thanks.

@AndyA
Copy link

AndyA commented Nov 16, 2013

I think I have the same problem - a particular snapshot that I can't destroy on Debian.

# dpkg -l | grep '\<spl\>'
ii  spl                                   0.6.2-2                            amd64        Solaris Porting Layer user-space utilities for Linux
ii  spl-dkms                              0.6.2-2                            all          Solaris Porting Layer kernel modules for Linux
# dpkg -l | grep 'zfs'
ii  debian-zfs                            7~wheezy                           amd64        Native ZFS filesystem metapackage for Debian.
ii  dkms                                  2.2.0.3-1.2+zfs6                   all          Dynamic Kernel Module Support Framework
ii  libzfs1                               0.6.2-2                            amd64        Native ZFS filesystem library for Linux
rc  zfs                                   0.6.1-2                            amd64        Commands to control the kernel modules and libraries
ii  zfs-dkms                              0.6.2-2                            all          Native ZFS filesystem kernel modules for Linux
ii  zfsonlinux                            2~wheezy                           all          archive.zfsonlinux.org trust package
ii  zfsutils                              0.6.2-2                            amd64        command-line tools to manage ZFS filesystems
# uname -a
Linux igloo 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64 GNU/Linux

I'm going to boot into SmartOS and attempt to delete the troublesome snapshot - but before I do are there any useful forensics I can gather that might help better characterise the issue?

@AndyA
Copy link

AndyA commented Nov 17, 2013

Correction - after a reboot I can now delete the snapshot. I guess I'm not seeing the same issue. Apologies for noise.

@behlendorf
Copy link
Contributor

Is this still an open issue for anybody?

@ryao
Copy link
Contributor

ryao commented Jan 17, 2014

I just hit something similar to this. Sadly, it took down my main pool. I had created a 30GB sparse zvol with deduplication on, started writing a 30GB image that I extracted from a SoftNAS vmdk, killed it midway because I had set volblocksize=4K on a ashift=13 pool, destroyed the pool and then had a lockup. All subsequent attempts to import the pool start with freeing at roughly 12.6GB. It then drops to 8.35GB and hangs. I don't have more information on this, but I would like to reproduce this in a controlled environment after I repair the pool.

@ryao
Copy link
Contributor

ryao commented Jan 17, 2014

Whatever went wrong is fixed in OmniOS. I was able to correct the problem by doing an import with OmniOS r151008j. However, Open Indiana 151a8 suffered from the same problem.

@behlendorf
Copy link
Contributor

I'm closing this issue until we have something concrete to go on.

@behlendorf behlendorf removed this from the 0.6.7 milestone Oct 7, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants