yay more zfs deadlock #1415

spacelama · 2013-04-21T11:41:45Z

(updated with kernel and zfs version info - dangers of composing something late at night)

About once a week for the past year, on average, ZFS locks up on me - usually rsync related (only ever a maximum of 2 running, talking potentially from a single disk to a 3 disk raidz), on a filesystem that has 15M inodes (mostly hardlinks with perhaps an average link count of 16, most not touched in the working set), and always running something near git HEAD. Finally got a trace because my home directory was still writable, dmesg was still working (can't log in as root though since /var is on ZFS and su/sudo/ssh all needs to acccess stuff in /var) and a watchdog had not reset the machine yet.

Yes, I've had all the symptoms everyone else has mentioned over the years - arc climbing well in excess of arc_max etc etc.

I'm currently using on a box with 4G of ram and 4 atom cores:
options zfs zfs_arc_max=1310720000 zfs_arc_meta_limit=1048576000
but of course I've had lockups on any sensible variation (and default) of those params.

Currently running debian kernel and zfs-dkms 0.6.0.97-0ubuntu1~oneiric1 from zfs-native-stable ubuntu PPA:

uname -a
Linux fs 3.7-trunk-amd64 #1 SMP Debian 3.7.8-1~experimental.1 x86_64 GNU/Linux

but of course these lockups have happened on every kernel I've ever tried since 2.6 days and from zfs-rc8 or so.

[300120.936061] INFO: task kswapd0:39 blocked for more than 120 seconds.
[300120.936070] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300120.936076] kswapd0 D ffff88013fd93b80 0 39 2 0x00000000
[300120.936085] ffff88013afd71c0 0000000000000046 0000000000013b80 ffff88013aea41c0
[300120.936096] 0000000000013b80 ffff880137a9ffd8 ffff880137a9ffd8 ffff88013afd71c0
[300120.936105] ffff88013afd71c0 ffff8801248c8fb8 ffff88013afd71c0 ffff8801248c8fbc
[300120.936114] Call Trace:
[300120.936133] [] ? __mutex_lock_common.isra.6+0x149/0x165
[300120.936142] [] ? mutex_lock+0x1a/0x2d
[300120.936150] [] ? mutex_lock+0xd/0x2d
[300120.936256] [] ? zfs_zinactive+0x4a/0xc6 [zfs]
[300120.936341] [] ? zfs_inactive+0x103/0x13c [zfs]
[300120.936352] [] ? evict+0xa3/0x153
[300120.936359] [] ? dispose_list+0x27/0x31
[300120.936367] [] ? prune_icache_sb+0x2aa/0x2c5
[300120.936376] [] ? prune_super+0xc3/0x12d
[300120.936385] [] ? shrink_slab+0x19d/0x289
[300120.936393] [] ? mem_cgroup_iter+0x24/0x181
[300120.936402] [] ? balance_pgdat+0x2e1/0x552
[300120.936410] [] ? _raw_spin_lock_irqsave+0x14/0x35
[300120.936418] [] ? kswapd+0x2b4/0x325
[300120.936428] [] ? abort_exclusive_wait+0x79/0x79
[300120.936436] [] ? balance_pgdat+0x552/0x552
[300120.936443] [] ? balance_pgdat+0x552/0x552
[300120.936451] [] ? kthread+0x81/0x89
[300120.936459] [] ? __kthread_parkme+0x5c/0x5c
[300120.936468] [] ? ret_from_fork+0x7c/0xb0
[300120.936476] [] ? __kthread_parkme+0x5c/0x5c
[300120.936498] INFO: task arc_adapt:1128 blocked for more than 120 seconds.
[300120.936502] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300120.936506] arc_adapt D ffff88013fd93b80 0 1128 2 0x00000000
[300120.936513] ffff880138547800 0000000000000046 ffff880139c950c8 ffff88013aea41c0
[300120.936522] 0000000000013b80 ffff8801366c7fd8 ffff8801366c7fd8 ffff880138547800
[300120.936531] ffff88013aea41c0 ffffffffa0564768 ffff880138547800 ffffffffa056476c
[300120.936540] Call Trace:
[300120.936554] [] ? __mutex_lock_common.isra.6+0x149/0x165
[300120.936578] [] ? __thread_create+0x2d2/0x2d2 [spl]
[300120.936586] [] ? mutex_lock+0x1a/0x2d
[300120.936638] [] ? arc_evict+0x6c/0x49b [zfs]
[300120.936647] [] ? _raw_spin_lock_irqsave+0x14/0x35
[300120.936672] [] ? spl_kmem_cache_free+0x298/0x30a [spl]
[300120.936691] [] ? __thread_create+0x2d2/0x2d2 [spl]
[300120.936740] [] ? arc_adjust+0x66/0x172 [zfs]
[300120.936791] [] ? arc_adjust_meta+0x176/0x176 [zfs]
[300120.936842] [] ? arc_adapt_thread+0x57/0xb6 [zfs]
[300120.936861] [] ? thread_generic_wrapper+0x6a/0x75 [spl]
[300120.936869] [] ? kthread+0x81/0x89
[300120.936877] [] ? __kthread_parkme+0x5c/0x5c
[300120.936885] [] ? ret_from_fork+0x7c/0xb0
[300120.936893] [] ? __kthread_parkme+0x5c/0x5c
[300120.936971] INFO: task txg_quiesce:1604 blocked for more than 120 seconds.
[300120.936975] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300120.936978] txg_quiesce D ffff88013fc13b80 0 1604 2 0x00000000
[300120.936986] ffff88012aa1a840 0000000000000046 ffffffffa0541c00 ffff880137025840
[300120.936995] 0000000000013b80 ffff880124fcbfd8 ffff880124fcbfd8 ffff88012aa1a840
[300120.937004] 0000000000000202 ffffc900110fe068 ffffc900110fe000 0000000000000000
[300120.937012] Call Trace:
[300120.937035] [] ? cv_wait_common+0xdc/0x171 [spl]
[300120.937044] [] ? should_resched+0x5/0x23
[300120.937053] [] ? _cond_resched+0x6/0x1b
[300120.937061] [] ? abort_exclusive_wait+0x79/0x79
[300120.937068] [] ? should_resched+0x5/0x23
[300120.937144] [] ? txg_quiesce_thread+0x201/0x2cc [zfs]
[300120.937221] [] ? txg_init+0x180/0x180 [zfs]
[300120.937240] [] ? __thread_create+0x2d2/0x2d2 [spl]
[300120.937258] [] ? thread_generic_wrapper+0x6a/0x75 [spl]
[300120.937266] [] ? kthread+0x81/0x89
[300120.937274] [] ? __kthread_parkme+0x5c/0x5c
[300120.937282] [] ? ret_from_fork+0x7c/0xb0
[300120.937290] [] ? __kthread_parkme+0x5c/0x5c
[300120.937334] INFO: task ntpd:4382 blocked for more than 120 seconds.
[300120.937338] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300120.937342] ntpd D ffff88013fd93b80 0 4382 1 0x00000004
[300120.937349] ffff88009c963140 0000000000000082 0000000000000000 ffff88013aea41c0
[300120.937358] 0000000000013b80 ffff88006e2b7fd8 ffff88006e2b7fd8 ffff88009c963140
[300120.937367] 0000000000000001 ffffffffa0564768 ffff88009c963140 ffffffffa056476c
[300120.937375] Call Trace:
[300120.937390] [] ? __mutex_lock_common.isra.6+0x149/0x165
[300120.937403] [] ? mutex_lock+0x1a/0x2d
[300120.937453] [] ? add_reference.isra.10+0x62/0xd1 [zfs]
[300120.937506] [] ? arc_buf_add_ref+0x69/0xce [zfs]
[300120.937560] [] ? __dbuf_hold_impl+0x27f/0x39b [zfs]
[300120.937614] [] ? dbuf_hold_impl+0x74/0x98 [zfs]
[300120.937668] [] ? dbuf_hold_level+0x19/0x2b [zfs]
[300120.937727] [] ? dmu_tx_check_ioerr+0x31/0x93 [zfs]
[300120.937779] [] ? RW_WRITE_HELD+0x4c/0x7e [zfs]
[300120.937839] [] ? dmu_tx_count_write+0x88/0x540 [zfs]
[300120.937891] [] ? __dbuf_hold_impl+0x371/0x39b [zfs]
[300120.937900] [] ? __kmalloc+0xde/0xf0
[300120.937908] [] ? should_resched+0x5/0x23
[300120.937928] [] ? kmem_alloc_debug+0x14b/0x2d7 [spl]
[300120.937989] [] ? dnode_hold_impl+0x362/0x37c [zfs]
[300120.938050] [] ? dmu_tx_hold_write+0x2c/0x3d [zfs]
[300120.938133] [] ? zfs_write+0x4c7/0x9cf [zfs]
[300120.938141] [] ? should_resched+0x5/0x23
[300120.938149] [] ? _cond_resched+0x6/0x1b
[300120.938158] [] ? page_add_file_rmap+0x60/0x79
[300120.938166] [] ? handle_pte_fault+0x2c6/0x7dd
[300120.938253] [] ? zpl_write_common+0x52/0x6a [zfs]
[300120.938338] [] ? zpl_write+0x5a/0x7e [zfs]
[300120.938347] [] ? vfs_write+0x96/0xf8
[300120.938354] [] ? sys_write+0x51/0x80
[300120.938362] [] ? system_call_fastpath+0x16/0x1b
[300120.938371] INFO: task squid:4509 blocked for more than 120 seconds.
[300120.938375] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300120.938379] squid D ffff88013fc93b80 0 4509 4502 0x00000000
[300120.938387] ffff880139c79840 0000000000000082 0000000000000000 ffff88013aea2180
[300120.938395] 0000000000013b80 ffff880068a43fd8 ffff880068a43fd8 ffff880139c79840
[300120.938404] 0000000000000400 ffffffffa0564768 ffff880139c79840 ffffffffa056476c
[300120.938413] Call Trace:
[300120.938427] [] ? __mutex_lock_common.isra.6+0x149/0x165
[300120.938440] [] ? mutex_lock+0x1a/0x2d
[300120.938489] [] ? arc_change_state.isra.7+0x121/0x21c [zfs]
[300120.938541] [] ? arc_access.isra.8+0xe7/0x17e [zfs]
[300120.938593] [] ? arc_read_nolock+0x374/0x695 [zfs]
[300120.938601] [] ? __cache_free.isra.39+0x18e/0x19d
[300120.938609] [] ? should_resched+0x5/0x23
[300120.938616] [] ? _cond_resched+0x6/0x1b
[300120.938669] [] ? arc_read+0xbe/0x122 [zfs]
[300120.938723] [] ? dbuf_prefetch+0x20c/0x235 [zfs]
[300120.938784] [] ? dmu_zfetch_dofetch.isra.4+0xcf/0x104 [zfs]
[300120.938792] [] ? should_resched+0x5/0x23
[300120.938852] [] ? dmu_zfetch+0x42b/0x937 [zfs]
[300120.938861] [] ? should_resched+0x5/0x23
[300120.938913] [] ? dbuf_read+0xd6/0x4b6 [zfs]
[300120.938966] [] ? __dbuf_hold_impl+0x371/0x39b [zfs]
[300120.938973] [] ? _raw_spin_lock_irqsave+0x14/0x35
[300120.938984] [] ? _raw_spin_lock_irqsave+0x14/0x35
[300120.939044] [] ? dnode_hold_impl+0x138/0x37c [zfs]
[300120.939094] [] ? remove_reference.isra.9+0x7b/0x87 [zfs]
[300120.939102] [] ? should_resched+0x5/0x23
[300120.939110] [] ? _cond_resched+0x6/0x1b
[300120.939166] [] ? dmu_bonus_hold+0x1c/0x177 [zfs]
[300120.939174] [] ? mutex_lock+0xd/0x2d
[300120.939258] [] ? zfs_zget+0x5b/0x191 [zfs]
[300120.939337] [] ? zfs_dirent_lock+0x3f4/0x425 [zfs]
[300120.939417] [] ? zfs_zaccess_aces_check+0x1ca/0x1f5 [zfs]
[300120.939496] [] ? zfs_dirlook+0x1bc/0x215 [zfs]
[300120.939575] [] ? zfs_zaccess+0x1cf/0x255 [zfs]
[300120.939583] [] ? should_resched+0x5/0x23
[300120.939666] [] ? zfs_lookup+0x263/0x2b0 [zfs]
[300120.939751] [] ? zpl_lookup+0x47/0x95 [zfs]
[300120.939760] [] ? d_alloc+0x61/0x69
[300120.939769] [] ? lookup_real+0x24/0x3f
[300120.939776] [] ? do_last+0x505/0xa0f
[300120.939783] [] ? __inode_permission+0x57/0x95
[300120.939791] [] ? path_openat+0xc1/0x33b
[300120.939799] [] ? do_filp_open+0x2a/0x6e
[300120.939807] [] ? __alloc_fd+0xdc/0xed
[300120.939816] [] ? do_sys_open+0x5c/0xe0
[300120.939825] [] ? system_call_fastpath+0x16/0x1b
[300120.939913] INFO: task BackupPC_dump:31833 blocked for more than 120 seconds.
[300120.939917] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300120.939920] BackupPC_dump D ffff88013fd93b80 0 31833 4692 0x00000000
[300120.939929] ffff880056161140 0000000000000086 ffff88008ca23c58 ffff88013765f0c0
[300120.939937] 0000000000013b80 ffff88008ca23fd8 ffff88008ca23fd8 ffff880056161140
[300120.939946] ffffc90049402640 ffff88013fd93b80 ffff880056161140 0000000000000001
[300120.939955] Call Trace:
[300120.939964] [] ? io_schedule+0x56/0x6d
[300120.939987] [] ? cv_wait_common+0xd5/0x171 [spl]
[300120.939995] [] ? abort_exclusive_wait+0x79/0x79
[300120.940053] [] ? should_resched+0x5/0x23
[300120.940176] [] ? zio_wait+0x107/0x139 [zfs]
[300120.940260] [] ? dmu_buf_hold_array_by_dnode+0x242/0x329 [zfs]
[300120.940344] [] ? dmu_buf_hold_array+0x59/0x7b [zfs]
[300120.940429] [] ? dmu_read_uio+0x3d/0xbd [zfs]
[300120.940543] [] ? zfs_read+0x344/0x3ec [zfs]
[300120.940660] [] ? zpl_read_common+0x52/0x6a [zfs]
[300120.940776] [] ? zpl_read+0x5a/0x7e [zfs]
[300120.940802] [] ? vfs_read+0x93/0xf5
[300120.940815] [] ? sys_read+0x51/0x80
[300120.940828] [] ? system_call_fastpath+0x16/0x1b
[300120.940841] INFO: task BackupPC_dump:2324 blocked for more than 120 seconds.
[300120.940847] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300120.940853] BackupPC_dump D ffff88013fd13b80 0 2324 31833 0x00000000
[300120.940862] ffff880119f5b800 0000000000000082 ffff880086071c58 ffff8801376ec780
[300120.940874] 0000000000013b80 ffff880086071fd8 ffff880086071fd8 ffff880119f5b800
[300120.940887] ffffc90053cd4440 ffff88013fd13b80 ffff880119f5b800 0000000000000001
[300120.940900] Call Trace:
[300120.940913] [] ? io_schedule+0x56/0x6d
[300120.940944] [] ? cv_wait_common+0xd5/0x171 [spl]
[300120.940957] [] ? abort_exclusive_wait+0x79/0x79
[300120.940968] [] ? should_resched+0x5/0x23
[300120.941060] [] ? zio_wait+0x107/0x139 [zfs]
[300120.941117] [] ? dmu_buf_hold_array_by_dnode+0x242/0x329 [zfs]
[300120.941215] [] ? dmu_buf_hold_array+0x59/0x7b [zfs]
[300120.941274] [] ? dmu_read_uio+0x3d/0xbd [zfs]
[300120.941358] [] ? zfs_read+0x344/0x3ec [zfs]
[300120.941444] [] ? zpl_read_common+0x52/0x6a [zfs]
[300120.941529] [] ? zpl_read+0x5a/0x7e [zfs]
[300120.941538] [] ? vfs_read+0x93/0xf5
[300120.941545] [] ? sys_read+0x51/0x80
[300120.941554] [] ? system_call_fastpath+0x16/0x1b
[300120.941564] INFO: task BackupPC_dump:31174 blocked for more than 120 seconds.
[300120.941568] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300120.941572] BackupPC_dump D ffff88013fc13b80 0 31174 31167 0x00000000
[300120.941580] ffff88012491b8c0 0000000000000086 ffff88008d8051a8 ffff880137025840
[300120.941589] 0000000000013b80 ffff88001a5f7fd8 ffff88001a5f7fd8 ffff88012491b8c0
[300120.941598] 0000000000000202 ffff88013fc13b80 ffff88012491b8c0 0000000000000001
[300120.941607] Call Trace:
[300120.941616] [] ? io_schedule+0x56/0x6d
[300120.941637] [] ? cv_wait_common+0xd5/0x171 [spl]
[300120.941646] [] ? abort_exclusive_wait+0x79/0x79
[300120.941653] [] ? should_resched+0x5/0x23
[300120.941738] [] ? zio_wait+0x107/0x139 [zfs]
[300120.941791] [] ? dbuf_read+0x3fe/0x4b6 [zfs]
[300120.941802] [] ? _raw_spin_lock_irqsave+0x14/0x35
[300120.941864] [] ? dnode_hold_impl+0x138/0x37c [zfs]
[300120.941913] [] ? remove_reference.isra.9+0x7b/0x87 [zfs]
[300120.941921] [] ? should_resched+0x5/0x23
[300120.941929] [] ? _cond_resched+0x6/0x1b
[300120.941985] [] ? dmu_bonus_hold+0x1c/0x177 [zfs]
[300120.941993] [] ? mutex_lock+0xd/0x2d
[300120.942076] [] ? zfs_zget+0x5b/0x191 [zfs]
[300120.942157] [] ? zfs_dirent_lock+0x3f4/0x425 [zfs]
[300120.942236] [] ? zfs_zaccess_aces_check+0x1ca/0x1f5 [zfs]
[300120.942315] [] ? zfs_dirlook+0x1bc/0x215 [zfs]
[300120.942397] [] ? zfs_zaccess+0x1cf/0x255 [zfs]
[300120.942408] [] ? should_resched+0x5/0x23
[300120.942506] [] ? zfs_lookup+0x263/0x2b0 [zfs]
[300120.942595] [] ? zpl_lookup+0x47/0x95 [zfs]
[300120.942604] [] ? d_alloc+0x61/0x69
[300120.942613] [] ? lookup_real+0x24/0x3f
[300120.942620] [] ? do_last+0x505/0xa0f
[300120.942628] [] ? __inode_permission+0x57/0x95
[300120.942636] [] ? path_openat+0xc1/0x33b
[300120.942644] [] ? do_filp_open+0x2a/0x6e
[300120.942652] [] ? __alloc_fd+0xdc/0xed
[300120.942661] [] ? do_sys_open+0x5c/0xe0
[300120.942670] [] ? system_call_fastpath+0x16/0x1b
[300240.940064] INFO: task kswapd0:39 blocked for more than 120 seconds.
[300240.940075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300240.940082] kswapd0 D ffff88013fd93b80 0 39 2 0x00000000
[300240.940096] ffff88013afd71c0 0000000000000046 0000000000013b80 ffff88013aea41c0
[300240.940110] 0000000000013b80 ffff880137a9ffd8 ffff880137a9ffd8 ffff88013afd71c0
[300240.940124] ffff88013afd71c0 ffff8801248c8fb8 ffff88013afd71c0 ffff8801248c8fbc
[300240.940138] Call Trace:
[300240.940160] [] ? __mutex_lock_common.isra.6+0x149/0x165
[300240.940169] [] ? mutex_lock+0x1a/0x2d
[300240.940177] [] ? mutex_lock+0xd/0x2d
[300240.940281] [] ? zfs_zinactive+0x4a/0xc6 [zfs]
[300240.940366] [] ? zfs_inactive+0x103/0x13c [zfs]
[300240.940377] [] ? evict+0xa3/0x153
[300240.940384] [] ? dispose_list+0x27/0x31
[300240.940391] [] ? prune_icache_sb+0x2aa/0x2c5
[300240.940400] [] ? prune_super+0xc3/0x12d
[300240.940410] [] ? shrink_slab+0x19d/0x289
[300240.940418] [] ? mem_cgroup_iter+0x24/0x181
[300240.940427] [] ? balance_pgdat+0x2e1/0x552
[300240.940436] [] ? _raw_spin_lock_irqsave+0x14/0x35
[300240.940443] [] ? kswapd+0x2b4/0x325
[300240.940453] [] ? abort_exclusive_wait+0x79/0x79
[300240.940461] [] ? balance_pgdat+0x552/0x552
[300240.940468] [] ? balance_pgdat+0x552/0x552
[300240.940476] [] ? kthread+0x81/0x89
[300240.940484] [] ? __kthread_parkme+0x5c/0x5c
[300240.940493] [] ? ret_from_fork+0x7c/0xb0
[300240.940501] [] ? __kthread_parkme+0x5c/0x5c
[300240.940524] INFO: task arc_adapt:1128 blocked for more than 120 seconds.
[300240.940528] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300240.940532] arc_adapt D ffff88013fd93b80 0 1128 2 0x00000000
[300240.940539] ffff880138547800 0000000000000046 ffff880139c950c8 ffff88013aea41c0
[300240.940548] 0000000000013b80 ffff8801366c7fd8 ffff8801366c7fd8 ffff880138547800
[300240.940557] ffff88013aea41c0 ffffffffa0564768 ffff880138547800 ffffffffa056476c
[300240.940565] Call Trace:
[300240.940580] [] ? __mutex_lock_common.isra.6+0x149/0x165
[300240.940604] [] ? __thread_create+0x2d2/0x2d2 [spl]
[300240.940612] [] ? mutex_lock+0x1a/0x2d
[300240.940663] [] ? arc_evict+0x6c/0x49b [zfs]
[300240.940674] [] ? _raw_spin_lock_irqsave+0x14/0x35
[300240.940698] [] ? spl_kmem_cache_free+0x298/0x30a [spl]
[300240.940717] [] ? __thread_create+0x2d2/0x2d2 [spl]
[300240.940766] [] ? arc_adjust+0x66/0x172 [zfs]
[300240.940817] [] ? arc_adjust_meta+0x176/0x176 [zfs]
[300240.940868] [] ? arc_adapt_thread+0x57/0xb6 [zfs]
[300240.940887] [] ? thread_generic_wrapper+0x6a/0x75 [spl]
[300240.940895] [] ? kthread+0x81/0x89
[300240.940903] [] ? __kthread_parkme+0x5c/0x5c
[300240.940911] [] ? ret_from_fork+0x7c/0xb0
[300240.940919] [] ? __kthread_parkme+0x5c/0x5c

The pools currently look this, but the lockups happened when the pools were less full too:

                                                    capacity     operations    bandwidth
pool alloc free read write read write

tank 2.51T 215G 15 6 56.8K 28.9K
raidz1 2.51T 215G 15 6 56.8K 28.9K
scsi-SATA_WDC_WD10EAVS-32_WD-WCAU47710472 - - 5 3 43.3K 22.6K
scsi-SATA_WDC_WD10EAVS-00_WD-WCAU46837083 - - 5 3 44.1K 22.7K
scsi-SATA_WDC_WD10EAVS-32_WD-WCAU47677029 - - 5 3 43.8K 22.7K
cache - - - - - -
usb-Corsair_Voyager_3.0_1180800000001384-0:0-part1 35.5M 5.55G 0 0 58 10.2K

tank3 1.14T 1.58T 6 22 290K 1.04M
scsi-SATA_WDC_WD30EZRX-00_WD-WCAWZ2607889 1.14T 1.58T 6 22 290K 1.04M
cache - - - - - -
usb-Corsair_Voyager_3.0_1180800000001384-0:0-part3 1.30G 4.29G 0 3 19.6K 384K

(I can't believe the mailing list is full of people saying ZFS is production ready. Not if you use rsync. It's been no end of trouble for me, but unfortunately it's a 3 week job to migrate back to something sensible like mdadm+ext3)

The text was updated successfully, but these errors were encountered:

douardda · 2013-04-22T13:59:49Z

(I can't believe the mailing list is full of people saying ZFS is production ready. Not if you use rsync.
It's been no end of trouble for me, but unfortunately it's a 3 week job to migrate back to something sensible
like mdadm+ext3)

I must admit I feel a bit the same. I do not use rsync a lot on my zfs system, but tgtd to export zvols to run KVM machines, and I do have a lot of performance issues, since RC14 at least (Maybe it's a misconfiguration or hardware problem, don't know yet, but it's really painful.)

David

ryao · 2013-04-22T15:13:31Z

@spacelama I suspect that you are affected by the bug in issue #1342. It is a regression that was started in Linux 3.6. The current solutions are to patch the kernel or use an older kernel. If you decide to patch the kernel, there is a patch available from the Gentoo bug tracker:

https://bugs.gentoo.org/show_bug.cgi?id=462066

People on recent kernels have reported plenty of issues that appear to be unrelated to ZFS, so this patch alone might not be sufficient to have a good experience. My recommendation is to run only LTS kernels, such as Linux 2.6.32.y, Linux 3.2.y or Linux 3.4.y, on production systems. This applies regardless of whether or not a system is using ZFS.

spacelama · 2013-05-14T04:38:11Z

Nope, this or a bug very similar affects debian wheezy 3.2 kernels. I will try to get a stacktrace next time it happens while I'm home and before the hardware watchdog kicks in.

behlendorf · 2013-07-03T01:54:27Z

Commit 294f680 which was recently merged in to master may address this deadlock.

ryao · 2013-08-07T01:24:49Z

I see shrink_slab in the backtraces. #1627 could help with that.

behlendorf · 2013-08-09T22:16:58Z

The following patches were merged in to master and are expected to improve this.

78d7a5d Write dirty inodes on close
c11a12b Return -1 from arc_shrinker_func()
fadd0c4 Evict meta data from ghost lists + l2arc headers
68121a0 Allow arc_evict_ghost() to only evict meta data

behlendorf · 2013-10-29T23:02:43Z

@spacelama Any update? Have recent changes, such as those mentioned above improved things?

behlendorf · 2013-11-14T17:30:40Z

Closing issue, the patches referenced above should have improved things and I haven't heard otherwise.

jwiegley · 2013-11-15T11:56:18Z

I'm getting regular lockups trying to rsync to a filesystem with the current nightly beta.

behlendorf · 2013-11-15T15:15:49Z

Ok. I've reopened the issue.

eborisch · 2013-11-15T15:44:10Z

@jwiegley -- We experienced lockups during rsyncs that were eventually avoided by dialing back the (originally dialed up as the box's only usage is storage #1365 (comment)) zfs_arc_max. Fragmentation in the cache leads to higher total memory usage. Are you adjusting any of the zfs tunables? What does your memory situation look like pre-lockup?

jwiegley · 2013-11-15T21:03:38Z

Let me describe my scenario in a bit more detail:

I've got a home-built file server here, with 16GB of ECC RAM and a 12 drives in a RAID-10 array. The OS is Ubuntu 12.04.3, installed on a USB flash drive.

I use git-annex heavily for maintaining my archives, and one thing that it does is create a LOT of files (it's the biggest inode consumer I've ever seen). These all get populated within the .git directory.

I have seen it happen, maybe 4-5 times in the past, that trying to run rm -fr on such a .git directory leads to a lockup in the kernel. Previously, the answer was to reboot the system. Last night, however, something new happened: it appears that the filesystem became borked somehow. Any attempts to mount it with zfs mount would deadlock the kernel.

I worked around this by destroying the filesystem (which had no snapshots, btw) and recreating it from a backup copy.

The reason rsync is mentioned is that I was rsync'ing a newer copy of this filesystem's content which did not have a .git directory, and so rsync was attempting to delete all the files there. It got through a whole bunch of them before the rsync process just stopped and things were never right again. The console kept reporting a "hung task after 120 seconds". After reboot, mount.zfs would hang likewise every time (I tried 2-3 times).

I don't have the tarpit filesystem anymore, but I wonder if this could be reproduce on your side by just artificially manufacturing maybe a million files in, say, 2000 directories at maximum depth 4, and then trying to delete them all at once.

Lastly, I have not modified any ZFS variables in the kernel; I don't have an L2ARC or a ZIL device. Pre-lockup I don't know what the memory situation was.

wphilips · 2013-11-16T16:10:13Z

Regarding the soft lockup problem: I noticed frequent occurrences with zfs on linux 0.6.2 on
a 3.1.0.12 kernel. This problem went away after I added the kernel command line parameter
elevator=deadline

spacelama · 2013-11-18T06:57:23Z

On Sat, 16 Nov 2013, wphilips wrote:

Regarding the soft lockup problem: I noticed frequent occurrences with zfs on linux 0.6.2 on
a 3.1.0.12 kernel. This problem went away after I added the kernel command line parameter
elevator=deadline

As the original reporter of this copy of this bug, I note that I've had no
further deadlocks/crashes/extreme-slowdowns since I decided to "fill" all
my incremental backups. Instead of ending up with hundreds of thousands
of empty directories with just the one "attrib" file in each directory
(this is a BackupPC full of hardlinks with dozens of links to each file on
average), each directory is now filled from the previous full backup (so
now 30 more hardlinks per file, but directories that are normal size).
Maybe interacting with dcache in a weird way (I remember Brian mentioning
dcache a few days ago).

What I still get is some slowdowns from when linux's cache is reduced down
to 10MB-30MB or so, but free memory is still up around 1GB (on a 4GB box).
/proc/sys/vm/drop_caches and /proc/sys/vm/compact_memory don't seem to
alleviate it from this condition - stopping backups for a time makes it a
little better though (been up for 17 days now, and the memory free is back
down to a more reasonable 450MB and linux cache back up to 100MB - not
much on this system is cachable because just the root FS is on ext4 - the
rest is ZFS, but it definitely slows to a crawl when all but 7MB is
evicted from the cache as it was a few days ago). Can extreme ZFS memory
fragmentation skew the linux memory accounting so much that there is
actually 1GB of ram available that can be accounted for, but none of it is
usable by the linux page cache or anything else? Another thing I did a
few days ago was to run a userspace program to allocate 1.5GB of ram in
one go to the point of inducing a little bit swapping. I don't know
whether it was responsible for evicting zfs buffers and reshuffling things
around to make it less fragmented.

module options current:
zfs_arc_max=1310720000 zfs_arc_meta_limit=1048576000
zfs_prefetch_disable=1 l2arc_write_boost=199229440
l2arc_write_max=104857600

Tim Connors

jwiegley · 2013-11-26T00:59:41Z

This has happened to me again. While trying to rsync from my Mac Pro to my ZOL mount, it started giving me a bunch of errors like:

rsync: chgrp "/tank/Data/Machines/OED.vmwarevm/Applications/Remote Access Phonebook — Oxford English Dictionary.app/Contents/Resources/zh-Hans.lproj/doMainMenu.nib" failed: No such file or directory (2)

Then the filesystem completely locked up in the kernel, nothing can be done with it. After a hard reset, any attempt to mount this filesystem results in a deadlock, necessitating a reset.

Is there a way to "clear" this filesystem so that I can mount it again? I don't think the rsync even made any changes yet, it was just trying to match permissions between files at this point.

dweeezil · 2013-11-26T04:31:27Z

I've been following this thread since it was originally reported and have been trying to make some sense of it. At this point, it seems likely that the problems being reported recently are not the same as @spacelama's original issue.

His original problem report was seen while using a 0.6.0.97-0ubuntu (very old) build which appears to me to correspond to 0.6.0-rc14 (a997dfb) in the main repository. At this point, with hopefully most of the major 0.6.2 stuff in the tree, it's not clear to me how relevant his original report is nor is it clear whether he's still seeing the problem with 0.6.1 or with a recent daily build.

This issue was closed but was then re-opened following @jwiegley's report, but that report didn't seem to include enough details to relate it to the original issue. It's also not clear that @eborisch's or @wphilips' problems are the same thing (as was originally reported in this issue).

@spacelama Are you still seeing your original problem with recent code? If not, then it I would suggest that this issue be closed.

As to the others who reported various lockups, it seems a common thread is massive metadata-walking in the form of an rsync receive to ZFS. There's good evidence that under some circumstances, the metadata can overwhelm the ARC and can far exceed the arc_meta_limit. I'd urge anyone who is having lockups or other problems that are triggered when performing an rsync receive to include the contents of their arcstats kstat file, their kernel version and their ZoL version in a new trouble report. I've personally seen this problem but it seems to be difficult to reproduce. It may be dependent on kernel version or on other factors. I'd also suggest that anyone having a related problem try recent ZoL code and if they're having a problem with their metadata size to refer to #1871 or one of the other more relevant issues.

If the lockups being reported by others are not metadata-related, and are on a substantially newer version than the 0.6.0-rc14 (I would hope so), I'd suggest opening a new issue.

eborisch · 2013-11-26T05:01:00Z

I did #1901. I'll be sure to grab kstat next time. I seem to be able to reproduce a hang (but it takes a few hours of letting it run.) I'll try again tomorrow.

eborisch · 2013-11-26T05:03:42Z

And yes, this is an rsync destination. The lock-ups I currently am getting are during scrubs or zfs sends. Failures during rsync haven't happened since I removed bumping the zfs_meta_max tunable.

smitb · 2013-11-26T13:29:44Z

Last night we've encountered, I think, the same problem. Running Ubuntu 12.04.3 on usb stick with kernel 3.5.0-42-generic and ZFS 0.6.2-1 (Ubuntu launchpad ZOL repo) on a dual quad core intel, 12G mem and a zpool with two 6 disk raidz1's (total ~20T). Machine is used for backups with heavy rsync and samba load during the night. The smb is high io (VMWare VeeAM backup target).

Almost all mem is in use, not bij processes, cache is exhausted:
total used free shared buffers cached
Mem: 12242924 11485176 757748 0 4536 33124
-/+ buffers/cache: 11447516 795408
Swap: 1048572 57864 990708

rsyslog was killed by the oom-killer during the backup window.

Following is the arcstats from kstat:

4 1 0x01 84 4032 13734614807 878327691108390
name type data
hits 4 968976456
misses 4 88220415
demand_data_hits 4 108422736
demand_data_misses 4 7919340
demand_metadata_hits 4 665251820
demand_metadata_misses 4 21443314
prefetch_data_hits 4 9708807
prefetch_data_misses 4 44656986
prefetch_metadata_hits 4 185593093
prefetch_metadata_misses 4 14200775
mru_hits 4 217452006
mru_ghost_hits 4 10979401
mfu_hits 4 556223091
mfu_ghost_hits 4 3601194
deleted 4 255388102
recycle_miss 4 16518487
mutex_miss 4 1022901
evict_skip 4 5356330674
evict_l2_cached 4 0
evict_l2_eligible 4 21571885274624
evict_l2_ineligible 4 952428091904
hash_elements 4 377618
hash_elements_max 4 801339
hash_collisions 4 149012451
hash_chains 4 111997
hash_chain_max 4 13
p 4 135199
c 4 3159255720
c_min 4 783547136
c_max 4 6268377088
size 4 3156220864
hdr_size 4 114942960
data_size 4 2306568192
other_size 4 734709712
anon_size 4 2244608
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 1351150592
mru_evict_data 4 1105199104
mru_evict_metadata 4 170083328
mru_ghost_size 4 1807990272
mru_ghost_evict_data 4 135528448
mru_ghost_evict_metadata 4 1672461824
mfu_size 4 953172992
mfu_evict_data 4 487981056
mfu_evict_metadata 4 442184704
mfu_ghost_size 4 271427072
mfu_ghost_evict_data 4 37355520
mfu_ghost_evict_metadata 4 234071552
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_hdr_miss 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
l2_compress_successes 4 0
l2_compress_zeros 4 0
l2_compress_failures 4 0
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 0
memory_direct_count 4 250764
memory_indirect_count 4 20110458
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 61355
arc_meta_used 4 1563040704
arc_meta_limit 4 1567094272
arc_meta_max 4 1828710168

System is up for 10 days. During the day since startup a lot of zfs recevies were done via netcat to migrate data from another (older) zfs box to this one.

I hope this helps you guys finding the source of the problem. I will reboot the box in a couple of hours before the upcoming backupwindow.

jwiegley · 2013-11-26T21:57:09Z

Any suggestion on how to unbork a filesystem that will no longer mount after experiencing this problem? I'd like to get some of the data back out of it.

chrisrd · 2013-11-26T22:52:15Z

@jwiegley How long have you left it trying to mount? Especially if your removed files had xattrs, and your zfs isn't set to xattr=sa, you could find you're simply not giving it enough time (due to #457) to work through it's 'deleted' queue (which is processed on mount: zfs_domount->zfs_sb_setup->zfs_unlinked_drain). E.g. I've had systems in this state for 10h+ (!!!) before finally mounting.

dweeezil · 2013-11-27T05:47:23Z

@jwiegley What happens when you try to mount the file system? Does the pool import? Does the problem only occur when trying to mount the file system? You certainly could be facing a large delete queue.

Actually, these days, the files-with-xattr-doesn't-free-space problem is likely #1548. You can confirm whether your file systems have a large delete queue with zdb prior to importing. The delete queue is typically object 3 but you can figure out which object it is in a particular file system with zdb -dd <pool>/<fs> and look for the object with a type of "zfs delete queue". Once you find out what object id it is, then a zdb -dddd <pool>/<fs> <object_id> will display all the items in the queue. Note that if you don't have a cache file, you'll likely have to use the zdb -p <dev directory> -e form to examine the pool.

jwiegley · 2013-11-27T05:48:31Z

The pool seems to import fine, and if I zfs set canmount=off on the filesystem, I can run zfs mount -a. But if I try to mount the problem system itself, zombie. I'll run zdb when I get back home from my vacation.

dweeezil · 2013-11-27T05:54:35Z

OK. A huge delete queue does sound like it may be your problem if it imports OK but sits for a very long time when trying to mount it. Again, see #1548 for details. Filesystems in which most (all) files have xattrs can be deadly unless you're using xattr=sa.

dweeezil · 2013-11-27T06:04:29Z

@smitb Quick summary of your ARC: Your arc_max is about 6GiB but its size is only about 3GiB. The metadata, however, is taking up about 1.5GiB of it and that's also the metadata limit. You might want to increase zfs_arc_meta_limit.

I don't think either of you, @jwiegley or @smitb are having problems that are related the originally reported issue #1415.

smitb · 2013-11-27T10:49:04Z

@dweeezil There is more! :-)

We also experience lockups, some short (a few secs), some longer that get logged to syslog. It started to happen when were at ~ 80% of a migration to this box (from another box with zfs send/receive). The first event happened on november 22nd. No zfs send/receives were running, a couple of rsync's and sftp were:

Nov 22 02:07:42 [] kernel: [488337.747821] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 22 02:07:43 [] kernel: [488337.748350] kswapd0 D ffffffff8180cbe0 0 84 2 0x00000000
Nov 22 02:07:44 [] kernel: [488337.748354] ffff8801e4eaba60 0000000000000046 0000000000000000 ffff8801e9802200
Nov 22 02:07:45 [] kernel: [488337.748358] ffff8801e4eabfd8 ffff8801e4eabfd8 ffff8801e4eabfd8 00000000000139c0
Nov 22 02:07:45 [] kernel: [488337.748364] ffff8801e71c0000 ffff8801e5274500 0000000000000080 ffff8801e2434d48
Nov 22 02:07:45 [] kernel: [488337.748367] Call Trace:
Nov 22 02:07:45 [] kernel: [488337.748375] [] schedule+0x29/0x70
Nov 22 02:07:45 [] kernel: [488337.748378] [] schedule_preempt_disabled+0xe/0x10
Nov 22 02:07:45 [] kernel: [488337.748381] [] __mutex_lock_slowpath+0xd7/0x150
Nov 22 02:07:45 [] kernel: [488337.748398] [] ? kmem_free_debug+0x4b/0x150 [spl]
Nov 22 02:07:46 [] kernel: [488337.748401] [] mutex_lock+0x2a/0x50
Nov 22 02:07:46 [] kernel: [488337.748433] [] zfs_zinactive+0x5c/0xd0 [zfs]
Nov 22 02:07:46 [] kernel: [488337.748457] [] zfs_inactive+0x7c/0x230 [zfs]
Nov 22 02:07:46 [] kernel: [488337.748461] [] ? truncate_pagecache+0x5e/0x70
Nov 22 02:07:46 [] kernel: [488337.748484] [] zpl_evict_inode+0x28/0x30 [zfs]
Nov 22 02:07:46 [] kernel: [488337.748489] [] evict+0xc0/0x1d0
Nov 22 02:07:46 [] kernel: [488337.748491] [] dispose_list+0x41/0x50
Nov 22 02:07:46 [] kernel: [488337.748494] [] ? _raw_spin_trylock+0xf/0x30
Nov 22 02:07:47 [] kernel: [488337.748497] [] prune_icache_sb+0x185/0x340
Nov 22 02:07:48 [] kernel: [488337.748502] [] prune_super+0xee/0x1a0
Nov 22 02:07:49 [] kernel: [488337.748505] [] shrink_slab+0x154/0x300
Nov 22 02:07:50 [] kernel: [488337.748507] [] ? mem_cgroup_iter+0xe8/0x200
Nov 22 02:07:51 [] kernel: [488337.748510] [] balance_pgdat+0x5a4/0x720
Nov 22 02:07:51 [] kernel: [488337.748514] [] ? default_spin_lock_flags+0x9/0x10
Nov 22 02:07:52 [] kernel: [488337.748517] [] kswapd+0x123/0x240
Nov 22 02:07:52 [] kernel: [488337.748519] [] ? balance_pgdat+0x720/0x720
Nov 22 02:07:52 [] kernel: [488337.748523] [] kthread+0x93/0xa0
Nov 22 02:07:52 [] kernel: [488337.748526] [] kernel_thread_helper+0x4/0x10
Nov 22 02:07:52 [] kernel: [488337.748529] [] ? flush_kthread_worker+0xb0/0xb0
Nov 22 02:07:52 [] kernel: [488337.748531] [] ? gs_change+0x13/0x13
Nov 22 02:07:52 [] kernel: [488337.748544] INFO: task z_wr_iss/0:743 blocked for more than 120 seconds.

We've rebooted the box yesterday afternoon. No hung messages in syslog last night, but the box is experiencing short hangs of 3 - 8 seconds. During these short periods the machine is unresponsive.

zpool status:
pool: tank
state: ONLINE
scan: none requested
config:

NAME                                        STATE     READ WRITE CKSUM
tank                                        ONLINE       0     0     0
  raidz1-0                                  ONLINE       0     0     0
    scsi-3600050e020ac7400ece40000e5050000  ONLINE       0     0     0
    scsi-3600050e020ac79001ab00000eb3d0000  ONLINE       0     0     0
    scsi-3600050e020ac7e005dc40000f9250000  ONLINE       0     0     0
    scsi-3600050e020ac7e0084b0000069410000  ONLINE       0     0     0
    scsi-3600050e020ac830018100000f9250000  ONLINE       0     0     0
    scsi-3600050e020ac8300a4980000065d0000  ONLINE       0     0     0
  raidz1-1                                  ONLINE       0     0     0
    scsi-3600050e020ac88001b3c0000c9210000  ONLINE       0     0     0
    scsi-3600050e020ac880082f0000054850000  ONLINE       0     0     0
    scsi-3600050e020ac8d002aac000078390000  ONLINE       0     0     0
    scsi-3600050e020ac8d00abdc00000ad50000  ONLINE       0     0     0
    scsi-3600050e020ac920076ec000078390000  ONLINE       0     0     0
    scsi-3600050e020ac9200d6a40000d0ad0000  ONLINE       0     0     0

errors: No known data errors

zpool list:
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
tank 21.8T 18.2T 3.51T 83% 1.00x ONLINE -

We use Seagate 2T enterprise disks connected to a 3ware 9750 controller. The disks are configured as 12 separate units with a single disk in it. This configuration has the benefit that the on card bbu-ed raid cache is still used.

I will tune the zfs_arc_meta_limit and will reboot the box in about 4 hours. (But this won't be the culprit as you said) If I can assist by supplying more information, please ask.

Regards,

Bas

jwiegley · 2013-12-04T16:56:22Z

@dweeezil Here is the zdb dump of my delete queue on the affected filesystem:

Dataset tank/Data/Machines [ZPL], ID 11099, cr_txg 27186, 233G, 8601 objects, rootbp DVA[0]=<5:c00002600:200> DVA[1]=<0:1780013000:1000> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=3884808L/3884808P fill=8601 cksum=19c37601ae:764c9b9848f:1336917f1dfe8:24ca16a5f5b226

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         3    1    16K    512     8K    512  100.00  ZFS delete queue
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    microzap: 512 bytes, 0 entries

dweeezil · 2013-12-04T19:46:27Z

@jwiegley Your delete queue is empty. My next suggestion would be to try mounting the file systems individually rather than using zfs mount -a. Hopefully that will narrow the problem down a bit. When you do get your hang, you'll need to get some stack traces of the hung processes probably with either echo -n m > /proc/sysrq-trigger or echo -n w > /proc/sysrq-trigger (note, the output of the first one can be huge). If your stack traces seem to jive with @spacelama's original post in this issue, then post them here, otherwise try to figure out if you've got an existing issue or a brand new one. Looking at existing issues, yours may be related to #1911 (zil replay) or #1511 (I looked for issues containing the word "mount"). If you've got a brand new issue or can't associate it with an existing issue, I'd suggest creating a new issue for your specific problem.

jwiegley · 2014-01-06T20:50:47Z

@dweeezil I've tried both of the echo commands you mentioned, but I didn't see any traces. Are they accumulated to a special file?

Also, this issue of filesystems becoming unmountable is far too easy to trigger. I just had it happen again by running "chown btsync .*" in a nearly unpopulated filesystem. Now I have two filesystems that I cannot access anymore. Is there a way to "reset" the filesystem so that I can mount it and recover the data?

jwiegley · 2014-01-06T21:00:19Z

@dweeezil I found the traces. I'm seeing the exact same behavior as #1911.

behlendorf · 2014-01-14T22:06:17Z

@jwiegley @dweeezil Since the root cause here does appear to be #1911 I'd like to close this one out as a duplicate. Any objection?

dweeezil · 2014-01-14T22:11:55Z

@behlendorf I believe you're correct, no objection on my part.

jwiegley · 2014-01-15T01:56:58Z

No objection.

spacelama · 2014-01-15T03:21:36Z

Er, there's been a heck of a lot of scope creep since my original submission (why? Someone made a huge mental leap up there somewhere). I can assure you this bug as originally reported on my systems has absolutely nothing to do with mounting, so the bug issue 1415 shouldn't be closed since it still exists and has nothing to do with issue 1911.

I have actively tried to not trigger my bug in quite a few months (I have deliberately changed my workload to deal with more data and less metadata), but I know it still exists, because 0.6.2 doesn't have any of the metadata fixes that have been talked about on the mailing list.

Once 0.6.3 or whatever with the fixes comes out (I couldn't convince zfs/spl HEAD to build dkms deb images, otherwise I'd test sooner), I will try to reproduce the original bug again.

spacelama · 2014-01-15T03:24:25Z

I forgot to mention, I tried 0.6.2 with my old workload, and yes, it was still crashing every week or so. It is my changed workload that renders it more stable. I'll try with the old workload when I have the metadata fixes that are apparently in HEAD.

behlendorf · 2014-02-24T21:37:35Z

@spacelama sorry about your thread getting hijacked. Let me circle back to the original issue you reported with rsync.

On Friday I merged some significant improvements for the ARC which are designed to improve metadata heavy workloads like yours. Now they don't directly address the lockup you described but they may make it far less likely. We've seen a dramatic reduction is ARC misses for workloads similar to yours.

If it's not to inconvenient it would be very useful to see what impact they have on your systems. It would be ideal if you could gather some timing data from before and after you update so we can measure the impact. Another easy metric to grab would be from the <pool>/io kstat in proc. That shows how many I/Os ZFS issued to the disk, we'd expect it to decrease significanyly after this change. You can zero it out by echoing zero in to the proc file. Thanks.

spacelama · 2014-03-02T09:41:06Z

I'll have to report back after my 2 weeks holiday - right now, a
combination of the monthly scrub, a zfs send to my offsite storage, and
trying to restore my phone before going away means the current load is too
unrepresentative. I did briefly hit some 40% l2arc hits though, with
60-70 hits per second! That's a first!

Tim Connors

spacelama · 2014-03-25T23:59:04Z

modinfo zfs | grep version
version: 0.6.2-198_g5e9a764
srcversion: 71F780D402A2E69B38CC0CA
vermagic: 3.12-0.bpo.1-amd64 SMP mod_unload modversions
modinfo spl | grep version
version: 0.6.2-25_g4189cef
srcversion: 9FB960D438C1F28532D4C50
vermagic: 3.12-0.bpo.1-amd64 SMP mod_unload modversions

Uptime 25 days, and have been using the old backup scheme for that entire time that always up until 0.6.2 caused deadlocks after a week or two. I don't interact much with the server on the commandline, but I've never noticed it become debilitatingly slow since the upgrade to HEAD.

Never seems to have overly used RAM, linux's cache is allowed to be larger than I've seen it in ages (150MB cached which is probably pretty much the entire working set of my ext4 root filesystem).

Looking good!

arcstats, if it helps:

cat arcstats
5 1 0x01 85 4080 9379183799 2231762302882957
name type data
hits 4 482794723
misses 4 352093598
demand_data_hits 4 114674601
demand_data_misses 4 23943641
demand_metadata_hits 4 313531679
demand_metadata_misses 4 133426602
prefetch_data_hits 4 11554490
prefetch_data_misses 4 62194093
prefetch_metadata_hits 4 43033953
prefetch_metadata_misses 4 132529262
mru_hits 4 190317238
mru_ghost_hits 4 55163955
mfu_hits 4 246905181
mfu_ghost_hits 4 69094175
deleted 4 293908006
recycle_miss 4 98432856
mutex_miss 4 625674
evict_skip 4 25074089003
evict_l2_cached 4 3937013048832
evict_l2_eligible 4 5987872272896
evict_l2_ineligible 4 5609669048320
hash_elements 4 1668292
hash_elements_max 4 2445839
hash_collisions 4 348766806
hash_chains 4 65536
hash_chain_max 4 66
p 4 52057600
c 4 61442536
c_min 4 4194304
c_max 4 1310720000
size 4 695872664
hdr_size 4 9929480
data_size 4 8192
meta_size 4 74037248
other_size 4 53582224
anon_size 4 194560
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 30654464
mru_evict_data 4 0
mru_evict_metadata 4 49152
mru_ghost_size 4 7190016
mru_ghost_evict_data 4 1196032
mru_ghost_evict_metadata 4 5993984
mfu_size 4 43196416
mfu_evict_data 4 0
mfu_evict_metadata 4 0
mfu_ghost_size 4 31296000
mfu_ghost_evict_data 4 114688
mfu_ghost_evict_metadata 4 31181312
l2_hits 4 67007570
l2_misses 4 285086007
l2_feeds 4 2185537
l2_rw_clash 4 21718
l2_read_bytes 4 664554672640
l2_write_bytes 4 1235810893824
l2_writes_sent 4 607709
l2_writes_done 4 607709
l2_writes_error 4 0
l2_writes_hdr_miss 4 1461
l2_evict_lock_retry 4 497
l2_evict_reading 4 4
l2_free_on_write 4 1511129
l2_abort_lowmem 4 46410
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 126247401472
l2_asize 4 123502299136
l2_hdr_size 4 558315520
l2_compress_successes 4 7451564
l2_compress_zeros 4 0
l2_compress_failures 4 136
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 2
memory_direct_count 4 943030
memory_indirect_count 4 19827291
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 45498078
arc_meta_used 4 695864472
arc_meta_limit 4 1048576000
arc_meta_max 4 1553576592

spacelama · 2014-03-26T05:44:40Z

Er, and since git versions aren't useful by themselves when merging private branches:
spl> git log --pretty=oneline | head
4189cef28140a65a8f08367ff80a3d16a3bf97a7 Merge branch 'master' of git://github.com/zfsonlinux/spl into 0.6.2
7b5eb66275adebb71809ac5845382853bb760e02 debian's 0.6.2-4~wheezy changes
4c99541 Remove incorrect use of EXTRA_DIST for man pages
d58a99a Define the needed ISA types for Sparc
aeeb4e0 Remove default taskq thread to CPU bindings
2f117de Include linux/vmalloc.h for ARM and Sparc
921a35a Add module versioning
50a0749 Linux 3.13 compat: Pass NULL for new delegated inode argument
3e96de1 Linux 3.13 compat: Remove unused flags variable from __cv_init()
30607d9 Document SPL module parameters.

zfs> git log --pretty=oneline | head
5e9a7642acf0bcadc485e7f9bcbe325d87ba66dc Merge branch 'master' of git://github.com/zfsonlinux/zfs into 0.6.2
0ad85ed Merge branch 'arc-changes'
2b13331 Set "arc_meta_limit" to 3/4 arc_c_max by default
cc7f677 Split "data_size" into "meta" and "data"
da8ccd0 Prioritize "metadata" in arc_get_data_buf
77765b5 Remove "arc_meta_used" from arc_adjust calculation
94520ca Prune metadata from ghost lists in arc_adjust_meta
1e3cb67 Revert "Return -1 from arc_shrinker_func()"
6242278 Disable arc_p adapt dampener by default
f521ce1 Allow "arc_p" to drop to zero or grow to "arc_c"

behlendorf · 2014-03-26T16:28:41Z

Thanks for the update. That's encouraging.

spacelama · 2014-04-13T12:29:20Z

44 days uptime, back to a load of 30 and minute long stalls for all local and NFS serving traffic. The kernel's cache is dropping down to 8MB(!) usage out of 4GB in the machine, which is what stalls everything else on the machine - it's got to serve / and my Maildirs out of that tiny cache.

After a round of drop_caches and leaving it to continue with backups for a few minutes, this is arcstats:

cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 85 4080 9379183799 3831840007771407
name type data
hits 4 759471442
misses 4 572492249
demand_data_hits 4 162723778
demand_data_misses 4 51977273
demand_metadata_hits 4 492042741
demand_metadata_misses 4 247600931
prefetch_data_hits 4 15894416
prefetch_data_misses 4 71321333
prefetch_metadata_hits 4 88810507
prefetch_metadata_misses 4 201592712
mru_hits 4 269153413
mru_ghost_hits 4 79484439
mfu_hits 4 400302514
mfu_ghost_hits 4 126401850
deleted 4 438980665
recycle_miss 4 188186846
mutex_miss 4 992772
evict_skip 4 49551372118
evict_l2_cached 4 5747740450304
evict_l2_eligible 4 7862334980096
evict_l2_ineligible 4 6956169229824
hash_elements 4 2699653
hash_elements_max 4 2889555
hash_collisions 4 511342904
hash_chains 4 65536
hash_chain_max 4 76
p 4 14522368
c 4 15702016
c_min 4 4194304
c_max 4 1310720000
size 4 980793536
hdr_size 4 29261144
data_size 4 28160
meta_size 4 32879616
other_size 4 12487768
anon_size 4 117248
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 25966592
mru_evict_data 4 0
mru_evict_metadata 4 32768
mru_ghost_size 4 0
mru_ghost_evict_data 4 0
mru_ghost_evict_metadata 4 0
mfu_size 4 6823936
mfu_evict_data 4 0
mfu_evict_metadata 4 0
mfu_ghost_size 4 15643136
mfu_ghost_evict_data 4 0
mfu_ghost_evict_metadata 4 15643136
l2_hits 4 131732612
l2_misses 4 440759616
l2_feeds 4 3733768
l2_rw_clash 4 35935
l2_read_bytes 4 1246392250880
l2_write_bytes 4 1375273980416
l2_writes_sent 4 1045030
l2_writes_done 4 1045030
l2_writes_error 4 0
l2_writes_hdr_miss 4 1835
l2_evict_lock_retry 4 646
l2_evict_reading 4 5
l2_free_on_write 4 1872601
l2_abort_lowmem 4 98034
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 116683735552
l2_asize 4 110868225536
l2_hdr_size 4 906136848
l2_compress_successes 4 10503245
l2_compress_zeros 4 0
l2_compress_failures 4 214
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 2
memory_direct_count 4 11726044
memory_indirect_count 4 32105258
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 84665326
arc_meta_used 4 980765376
arc_meta_limit 4 1048576000
arc_meta_max 4 1553576592

780MB free in the machine, I was able to do a dummy malloc/zero of 512MB dropping free memory down to 390MB and didn't stress the machine in the process, so don't know why the kernel would be deciding it can't cache anything. Machine hasn't locked up yet, but I'll probably reboot within a few days.

ryao · 2014-04-14T03:37:03Z

Run time sync on the server and see if that restores it to normal operation for the time being. the sync command might take abnormally long to run. Please tell us what time says it took to run it.

spacelama · 2014-04-14T03:57:54Z

On Sun, 13 Apr 2014, Richard Yao wrote:

Run time sync on the server and see if that restores it to normal operation for the time being. the sync command might take abnormally long to run. Please tell us what time says it took to run it.

Machine is currently back to coping with the load - the load of 6 is
somewhat reasonable, and it's as responsive as usual.

sync hasn't fixed it in the past (was doing it with the drop_cache
loop), but just then it took 0.5 seconds (I've seen it take 15 or more
seconds, but I'm used to sync taking that long on any filesystem).

But a few days ago when it started having troubles and continuuing
through to now, kswapd0 is taking 90-100% of the CPU despite swap only
being 3/5th used. Don't know whether that's linux's dodgy memory
management in general, or ZFS's fault. 3.12 kernel:
Linux fs 3.12-0.bpo.1-amd64 #1 SMP Debian 3.12.9-1~bpo70+1 (2014-02-07) x86_64 GNU/Linux

Tim Connors

behlendorf · 2014-10-16T17:07:57Z

These changes have been merged and we have some additional improvements planned. I'm closing this issue since it sounds like things have improved. We can definitely open a new issue if there is still and issue.

behlendorf closed this as completed Nov 14, 2013

behlendorf reopened this Nov 15, 2013

dweeezil mentioned this issue Nov 26, 2013

zfs hangs during zfs send (and scrub) #1901

Closed

behlendorf modified the milestones: 0.6.4, 0.6.3 May 2, 2014

behlendorf closed this as completed Oct 16, 2014

kernelOfTruth mentioned this issue Mar 1, 2015

[load-related ?] INFO: task arc_adapt:10568 blocked for more than 180 seconds. #3142

Closed

yay more zfs deadlock #1415

yay more zfs deadlock #1415

Comments

spacelama commented Apr 21, 2013

pool alloc free read write read write

tank3 1.14T 1.58T 6 22 290K 1.04M scsi-SATA_WDC_WD30EZRX-00_WD-WCAWZ2607889 1.14T 1.58T 6 22 290K 1.04M cache - - - - - - usb-Corsair_Voyager_3.0_1180800000001384-0:0-part3 1.30G 4.29G 0 3 19.6K 384K

douardda commented Apr 22, 2013

ryao commented Apr 22, 2013

spacelama commented May 14, 2013

behlendorf commented Jul 3, 2013

ryao commented Aug 7, 2013

behlendorf commented Aug 9, 2013

behlendorf commented Oct 29, 2013

behlendorf commented Nov 14, 2013

jwiegley commented Nov 15, 2013

behlendorf commented Nov 15, 2013

eborisch commented Nov 15, 2013

jwiegley commented Nov 15, 2013

wphilips commented Nov 16, 2013

spacelama commented Nov 18, 2013

jwiegley commented Nov 26, 2013

dweeezil commented Nov 26, 2013

eborisch commented Nov 26, 2013

eborisch commented Nov 26, 2013

smitb commented Nov 26, 2013

jwiegley commented Nov 26, 2013

chrisrd commented Nov 26, 2013

dweeezil commented Nov 27, 2013

jwiegley commented Nov 27, 2013

dweeezil commented Nov 27, 2013

dweeezil commented Nov 27, 2013

smitb commented Nov 27, 2013

jwiegley commented Dec 4, 2013

dweeezil commented Dec 4, 2013

jwiegley commented Jan 6, 2014

jwiegley commented Jan 6, 2014

behlendorf commented Jan 14, 2014

dweeezil commented Jan 14, 2014

jwiegley commented Jan 15, 2014

spacelama commented Jan 15, 2014

spacelama commented Jan 15, 2014

behlendorf commented Feb 24, 2014

spacelama commented Mar 2, 2014

spacelama commented Mar 25, 2014

spacelama commented Mar 26, 2014

behlendorf commented Mar 26, 2014

spacelama commented Apr 13, 2014

ryao commented Apr 14, 2014

spacelama commented Apr 14, 2014

behlendorf commented Oct 16, 2014

tank3 1.14T 1.58T 6 22 290K 1.04M
scsi-SATA_WDC_WD30EZRX-00_WD-WCAWZ2607889 1.14T 1.58T 6 22 290K 1.04M
cache - - - - - -
usb-Corsair_Voyager_3.0_1180800000001384-0:0-part3 1.30G 4.29G 0 3 19.6K 384K