Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txg_sync blocked for more than 120 seconds #3835

Closed
tjikkun opened this issue Sep 25, 2015 · 21 comments
Closed

txg_sync blocked for more than 120 seconds #3835

tjikkun opened this issue Sep 25, 2015 · 21 comments
Milestone

Comments

@tjikkun
Copy link

tjikkun commented Sep 25, 2015

Possibly related to #3834

We are seeing regular hangs. Sometimes the system recovers, sometimes we need to reboot.
ZFS version is zfs-0.6.5-1.el6.x86_64
I am 90% sure that this only happens when a zfs send is in progress. (but in our case more often then not a send is in progress)

Example 1:

[795637.811766] INFO: task txg_sync:4416 blocked for more than 120 seconds.
[795637.811800]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[795637.811828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[795637.811852] txg_sync      D ffff881078e74680     0  4416      2    0 0x00000000
[795637.811857]  ffff881075a7f930 0000000000000046 0000000000000000 ffff88107fc33838
[795637.811860]  0000000000000000 0000000000000000 0002d37cc126a207 ffffffff81015019
[795637.811863]  ffff881075a7f8e0 0000000000000000 000000012f5be0bd ffff881078e74c48
[795637.811866] Call Trace:
[795637.811879]  [<ffffffff81015019>] ? read_tsc+0x9/0x20
[795637.811888]  [<ffffffff81537893>] io_schedule+0x73/0xc0
[795637.811906]  [<ffffffffa0174fdf>] cv_wait_common+0xaf/0x130 [spl]
[795637.811913]  [<ffffffff810a9e10>] ? autoremove_wake_function+0x0/0x40
[795637.811920]  [<ffffffffa0175078>] __cv_wait_io+0x18/0x20 [spl]
[795637.811973]  [<ffffffffa039a27b>] zio_wait+0x10b/0x1e0 [zfs]
[795637.811999]  [<ffffffffa039a4eb>] zio_free+0xdb/0x130 [zfs]
[795637.812024]  [<ffffffffa039a54e>] zio_free_zil+0xe/0x10 [zfs]
[795637.812049]  [<ffffffffa0391f59>] zil_sync+0x189/0x470 [zfs]
[795637.812074]  [<ffffffffa033ea5e>] ? spa_taskq_dispatch_ent+0x7e/0xb0 [zfs]
[795637.812099]  [<ffffffffa0396583>] ? zio_taskq_dispatch+0x73/0xa0 [zfs]
[795637.812123]  [<ffffffffa039a64f>] ? zio_nowait+0xff/0x180 [zfs]
[795637.812140]  [<ffffffffa02fe4ae>] dmu_objset_sync+0x23e/0x2e0 [zfs]
[795637.812156]  [<ffffffffa02fc9d0>] ? dmu_objset_write_ready+0x0/0x70 [zfs]
[795637.812173]  [<ffffffffa02fe550>] ? dmu_objset_write_done+0x0/0x70 [zfs]
[795637.812195]  [<ffffffffa0311f29>] dsl_dataset_sync+0x59/0xb0 [zfs]
[795637.812216]  [<ffffffffa0321a6a>] dsl_pool_sync+0x9a/0x440 [zfs]
[795637.812241]  [<ffffffffa0338cae>] spa_sync+0x35e/0xb10 [zfs]
[795637.812246]  [<ffffffff8105b5b9>] ? __wake_up_common+0x59/0x90
[795637.812251]  [<ffffffff8105ff73>] ? __wake_up+0x53/0x70
[795637.812254]  [<ffffffff81015019>] ? read_tsc+0x9/0x20
[795637.812280]  [<ffffffffa034f368>] txg_sync_thread+0x3d8/0x670 [zfs]
[795637.812307]  [<ffffffffa034ef90>] ? txg_sync_thread+0x0/0x670 [zfs]
[795637.812331]  [<ffffffffa034ef90>] ? txg_sync_thread+0x0/0x670 [zfs]
[795637.812337]  [<ffffffffa0170378>] thread_generic_wrapper+0x68/0x80 [spl]
[795637.812342]  [<ffffffffa0170310>] ? thread_generic_wrapper+0x0/0x80 [spl]
[795637.812346]  [<ffffffff810a9a1e>] kthread+0x9e/0xc0
[795637.812351]  [<ffffffff8100c34a>] child_rip+0xa/0x20
[795637.812356]  [<ffffffff810a9980>] ? kthread+0x0/0xc0
[795637.812358]  [<ffffffff8100c340>] ? child_rip+0x0/0x20
[795637.812598] INFO: task zfs:174562 blocked for more than 120 seconds.
[795637.812620]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[795637.812943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[795637.813552] zfs           D ffff880201f70900     0 174562 174550    0 0x00000000
[795637.813556]  ffff8804fb8632a8 0000000000000082 0000000000000000 ffff881078cd8d20
[795637.813559]  0000000000000001 ffff881078cd8d38 0002d37c551fca47 ffff880049e0f7b0
[795637.813562]  ffff8804fb863228 ffff880049e0f7a0 000000012f5bd7fd ffff880201f70ec8
[795637.813565] Call Trace:
[795637.813575]  [<ffffffffa0170935>] taskq_wait_id+0x65/0xa0 [spl]
[795637.813579]  [<ffffffff810a9e10>] ? autoremove_wake_function+0x0/0x40
[795637.813598]  [<ffffffffa03051a0>] ? dump_bytes_strategy+0x0/0x100 [zfs]
[795637.813615]  [<ffffffffa03051a0>] ? dump_bytes_strategy+0x0/0x100 [zfs]
[795637.813640]  [<ffffffffa033e9ad>] spa_taskq_dispatch_sync+0x8d/0xc0 [zfs]
[795637.813658]  [<ffffffffa03052e2>] dump_bytes+0x42/0x50 [zfs]
[795637.813675]  [<ffffffffa030571a>] dump_write+0x20a/0x240 [zfs]
[795637.813700]  [<ffffffffa0306a1c>] backup_cb+0x71c/0x760 [zfs]
[795637.813711]  [<ffffffff8105ff73>] ? __wake_up+0x53/0x70
[795637.813734]  [<ffffffffa0307376>] traverse_visitbp+0x476/0x7b0 [zfs]
[795637.813746]  [<ffffffffa016dd46>] ? spl_kmem_alloc+0x96/0x1a0 [spl]
[795637.813770]  [<ffffffffa03072eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[795637.813795]  [<ffffffffa0307d91>] traverse_dnode+0x71/0xd0 [zfs]
[795637.813819]  [<ffffffffa03075f7>] traverse_visitbp+0x6f7/0x7b0 [zfs]
[795637.813844]  [<ffffffffa03072eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[795637.813869]  [<ffffffffa03072eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[795637.813895]  [<ffffffffa03072eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[795637.813920]  [<ffffffffa03072eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[795637.813945]  [<ffffffffa03072eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[795637.813970]  [<ffffffffa03072eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[795637.813994]  [<ffffffffa0307d91>] traverse_dnode+0x71/0xd0 [zfs]
[795637.814017]  [<ffffffffa0307451>] traverse_visitbp+0x551/0x7b0 [zfs]
[795637.814041]  [<ffffffffa0307df0>] ? traverse_prefetch_thread+0x0/0x170 [zfs]
[795637.814065]  [<ffffffffa0307834>] traverse_impl+0x184/0x400 [zfs]
[795637.814087]  [<ffffffffa03051a0>] ? dump_bytes_strategy+0x0/0x100 [zfs]
[795637.814112]  [<ffffffffa0307b46>] traverse_dataset+0x56/0x60 [zfs]
[795637.814135]  [<ffffffffa0306300>] ? backup_cb+0x0/0x760 [zfs]
[795637.814158]  [<ffffffffa0305abb>] dmu_send_impl+0x36b/0x4f0 [zfs]
[795637.814181]  [<ffffffffa03060d7>] dmu_send_obj+0x157/0x210 [zfs]
[795637.814214]  [<ffffffffa0375297>] zfs_ioc_send+0xa7/0x280 [zfs]
[795637.814246]  [<ffffffffa0378cb5>] zfsdev_ioctl+0x495/0x4d0 [zfs]
[795637.814256]  [<ffffffff811cc5d2>] vfs_ioctl+0x22/0xa0
[795637.814264]  [<ffffffff811cc774>] do_vfs_ioctl+0x84/0x5b0
[795637.814274]  [<ffffffff811cccef>] sys_ioctl+0x4f/0x80
[795637.814283]  [<ffffffff8153b30e>] ? do_device_not_available+0xe/0x10
[795637.814293]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[795637.814401] INFO: task imap:752236 blocked for more than 120 seconds.
[795637.814718]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[795637.815320] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[795637.815924] imap          D ffff8803ba93e6c0     0 752236   8619    0 0x00000000
[795637.815927]  ffff88025dfc5d28 0000000000000086 0000000000000000 0000000000000002
[795637.815930]  ffff8803ba93e6c0 0000000000000000 0002d37fdc1d7bb2 ffff880477ea7838
[795637.815933]  0000000000007805 0000000000000000 000000012f5c131d ffff8803ba93ec88
[795637.815936] Call Trace:
[795637.815945]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[795637.815956]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[795637.815989]  [<ffffffffa039360c>] zil_commit+0x3c/0x830 [zfs]
[795637.815999]  [<ffffffff81538a9e>] ? mutex_lock+0x1e/0x50
[795637.816028]  [<ffffffffa0333332>] ? rrw_enter_read_impl+0xd2/0x180 [zfs]
[795637.816060]  [<ffffffffa03875f2>] zfs_fsync+0x92/0x120 [zfs]
[795637.816090]  [<ffffffffa039f990>] zpl_fsync+0x60/0xa0 [zfs]
[795637.816103]  [<ffffffff811ecb3d>] vfs_fsync_range+0x10d/0x260
[795637.816110]  [<ffffffff811eccfd>] vfs_fsync+0x1d/0x20
[795637.816118]  [<ffffffff811ecd83>] do_fsync+0x83/0xe0
[795637.816125]  [<ffffffff811ecdf3>] sys_fdatasync+0x13/0x20
[795637.816133]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[795637.816168] INFO: task imap:766154 blocked for more than 120 seconds.
[795637.816478]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[795637.817086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[795637.817697] imap          D ffff8809366c8240     0 766154   8619    0 0x00000000
[795637.817700]  ffff880b91f83d28 0000000000000086 0000000000000000 ffffea003ffa6140
[795637.817718]  ffff8809366c8240 0000000000000000 0002d383b1fbcf14 ffff8801238c5b70
[795637.817734]  ffff8803a4d06e00 0000000000000000 000000012f5c5367 ffff8809366c8808
[795637.817751] Call Trace:
[795637.817763]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[795637.817773]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[795637.817805]  [<ffffffffa039360c>] zil_commit+0x3c/0x830 [zfs]
[795637.817815]  [<ffffffff81538a9e>] ? mutex_lock+0x1e/0x50
[795637.817843]  [<ffffffffa0333332>] ? rrw_enter_read_impl+0xd2/0x180 [zfs]
[795637.817875]  [<ffffffffa03875f2>] zfs_fsync+0x92/0x120 [zfs]
[795637.817905]  [<ffffffffa039f990>] zpl_fsync+0x60/0xa0 [zfs]
[795637.817915]  [<ffffffff811ecb3d>] vfs_fsync_range+0x10d/0x260
[795637.817924]  [<ffffffff811eccfd>] vfs_fsync+0x1d/0x20
[795637.817931]  [<ffffffff811ecd83>] do_fsync+0x83/0xe0
[795637.817940]  [<ffffffff8153afe5>] ? page_fault+0x25/0x30
[795637.817948]  [<ffffffff811ece10>] sys_fsync+0x10/0x20
[795637.817956]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[795637.817965] INFO: task imap:766524 blocked for more than 120 seconds.
[795637.818279]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[795637.818885] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[795637.819486] imap          D ffff880a60c26ee0     0 766524   8619    0 0x00000000
[795637.819489]  ffff880a835e1d28 0000000000000082 0000000000000000 ffffffff81158cd1
[795637.819493]  ffff880a60c26ee0 0000000000000000 0002d37f05275e98 ffff8807ea772f38
[795637.819495]  0000000000001a50 0000000000000000 000000012f5c06f6 ffff880a60c274a8
[795637.819498] Call Trace:
[795637.819506]  [<ffffffff81158cd1>] ? lru_cache_add_lru+0x21/0x40
[795637.819512]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[795637.819516]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[795637.819541]  [<ffffffffa039360c>] zil_commit+0x3c/0x830 [zfs]
[795637.819545]  [<ffffffff81538a9e>] ? mutex_lock+0x1e/0x50
[795637.819568]  [<ffffffffa0333332>] ? rrw_enter_read_impl+0xd2/0x180 [zfs]
[795637.819593]  [<ffffffffa03875f2>] zfs_fsync+0x92/0x120 [zfs]
[795637.819618]  [<ffffffffa039f990>] zpl_fsync+0x60/0xa0 [zfs]
[795637.819622]  [<ffffffff811ecb3d>] vfs_fsync_range+0x10d/0x260
[795637.819625]  [<ffffffff811eccfd>] vfs_fsync+0x1d/0x20
[795637.819628]  [<ffffffff811ecd83>] do_fsync+0x83/0xe0
[795637.819631]  [<ffffffff8117b831>] ? sys_munmap+0x61/0x80
[795637.819634]  [<ffffffff811ece10>] sys_fsync+0x10/0x20
[795637.819636]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[795637.819645] INFO: task imap:768041 blocked for more than 120 seconds.
[795637.819962]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[795637.820563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[795637.821167] imap          D ffff8801bf9e2140     0 768041   8619    0 0x00000000
[795637.821170]  ffff8803858c3d28 0000000000000082 0000000000000000 ffffffff81158cd1
[795637.821173]  ffff8801bf9e2140 0000000000000000 0002d37d56af91d6 ffff88105c712ab8
[795637.821176]  000000000000000d 0000000000000000 000000012f5be8e1 ffff8801bf9e2708
[795637.821178] Call Trace:
[795637.821182]  [<ffffffff81158cd1>] ? lru_cache_add_lru+0x21/0x40
[795637.821194]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[795637.821205]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[795637.821236]  [<ffffffffa039360c>] zil_commit+0x3c/0x830 [zfs]
[795637.821243]  [<ffffffff81538a9e>] ? mutex_lock+0x1e/0x50
[795637.821271]  [<ffffffffa0333332>] ? rrw_enter_read_impl+0xd2/0x180 [zfs]
[795637.821304]  [<ffffffffa03875f2>] zfs_fsync+0x92/0x120 [zfs]
[795637.821334]  [<ffffffffa039f990>] zpl_fsync+0x60/0xa0 [zfs]
[795637.821343]  [<ffffffff811ecb3d>] vfs_fsync_range+0x10d/0x260
[795637.821353]  [<ffffffff811eccfd>] vfs_fsync+0x1d/0x20
[795637.821361]  [<ffffffff811ecd83>] do_fsync+0x83/0xe0
[795637.821369]  [<ffffffff811ece10>] sys_fsync+0x10/0x20
[795637.821376]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[795637.821385] INFO: task exim:768076 blocked for more than 120 seconds.
[795637.821703]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[795637.822305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[795637.822909] exim          D ffff8809a7d9e500     0 768076 768069    0 0x00000000
[795637.822913]  ffff880a22efbd28 0000000000000086 0000000000000000 ffffea00228172c0
[795637.822916]  ffff8809a7d9e500 0000000000000000 0002d37d7e00c246 ffff88107c65d010
[795637.822919]  ffff880c09960b80 ffff88107c65d000 000000012f5beb63 ffff8809a7d9eac8
[795637.822921] Call Trace:
[795637.822929]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[795637.822939]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[795637.822971]  [<ffffffffa039360c>] zil_commit+0x3c/0x830 [zfs]
[795637.822979]  [<ffffffff81538a9e>] ? mutex_lock+0x1e/0x50
[795637.823008]  [<ffffffffa0333332>] ? rrw_enter_read_impl+0xd2/0x180 [zfs]
[795637.823039]  [<ffffffffa03875f2>] zfs_fsync+0x92/0x120 [zfs]
[795637.823069]  [<ffffffffa039f990>] zpl_fsync+0x60/0xa0 [zfs]
[795637.823079]  [<ffffffff811ecb3d>] vfs_fsync_range+0x10d/0x260
[795637.823089]  [<ffffffff811eccfd>] vfs_fsync+0x1d/0x20
[795637.823097]  [<ffffffff811ecd83>] do_fsync+0x83/0xe0
[795637.823106]  [<ffffffff8153afe5>] ? page_fault+0x25/0x30
[795637.823115]  [<ffffffff811ece10>] sys_fsync+0x10/0x20
[795637.823124]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[795637.823135] INFO: task imap:768309 blocked for more than 120 seconds.
[795637.823448]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[795637.824071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[795637.824670] imap          D ffff880b2f54d1a0     0 768309   8619    0 0x00000000
[795637.824673]  ffff880bf5c57d28 0000000000000082 ffff880bf5c57cf0 ffffea003ffa7400
[795637.824676]  ffff880bf5c57d88 000000000009bd3b 0002d37e30fddf98 ffff88107c65d010
[795637.824683]  ffff880800000000 ffff88107c65d000 000000012f5bf725 ffff880b2f54d768
[795637.824699] Call Trace:
[795637.824710]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[795637.824721]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[795637.824753]  [<ffffffffa039360c>] zil_commit+0x3c/0x830 [zfs]
[795637.824763]  [<ffffffff81538a9e>] ? mutex_lock+0x1e/0x50
[795637.824792]  [<ffffffffa0333332>] ? rrw_enter_read_impl+0xd2/0x180 [zfs]
[795637.824824]  [<ffffffffa03875f2>] zfs_fsync+0x92/0x120 [zfs]
[795637.824854]  [<ffffffffa039f990>] zpl_fsync+0x60/0xa0 [zfs]
[795637.824865]  [<ffffffff811ecb3d>] vfs_fsync_range+0x10d/0x260
[795637.824876]  [<ffffffff811eccfd>] vfs_fsync+0x1d/0x20
[795637.824885]  [<ffffffff811ecd83>] do_fsync+0x83/0xe0
[795637.824893]  [<ffffffff8117b831>] ? sys_munmap+0x61/0x80
[795637.824902]  [<ffffffff811ece10>] sys_fsync+0x10/0x20
[795637.824910]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[795637.824921] INFO: task exim:768494 blocked for more than 120 seconds.
[795637.825234]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[795637.825839] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[795637.826437] exim          D ffff88092464c080     0 768494 768486    0 0x00000000
[795637.826441]  ffff880bfc201d28 0000000000000082 0000000000000000 ffffea00228172c0
[795637.826444]  ffff88092464c080 0000000000000000 0002d37f5429ccb8 ffff88107c65d010
[795637.826447]  ffff880ce6911580 ffff88107c65d000 000000012f5c0a36 ffff88092464c648
[795637.826449] Call Trace:
[795637.826455]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[795637.826460]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[795637.826485]  [<ffffffffa039360c>] zil_commit+0x3c/0x830 [zfs]
[795637.826488]  [<ffffffff81538a9e>] ? mutex_lock+0x1e/0x50
[795637.826511]  [<ffffffffa0333332>] ? rrw_enter_read_impl+0xd2/0x180 [zfs]
[795637.826537]  [<ffffffffa03875f2>] zfs_fsync+0x92/0x120 [zfs]
[795637.826561]  [<ffffffffa039f990>] zpl_fsync+0x60/0xa0 [zfs]
[795637.826565]  [<ffffffff811ecb3d>] vfs_fsync_range+0x10d/0x260
[795637.826569]  [<ffffffff811eccfd>] vfs_fsync+0x1d/0x20
[795637.826571]  [<ffffffff811ecd83>] do_fsync+0x83/0xe0
[795637.826574]  [<ffffffff8153afe5>] ? page_fault+0x25/0x30
[795637.826577]  [<ffffffff811ece10>] sys_fsync+0x10/0x20
[795637.826579]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[795637.826583] INFO: task exim:768554 blocked for more than 120 seconds.
[795637.826896]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[795637.827496] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[795637.828097] exim          D ffff880d616da980     0 768554 768541    0 0x00000000
[795637.828100]  ffff880ba4993d28 0000000000000082 0000000000000000 ffffea003feb1e00
[795637.828103]  ffff880d616da980 0000000000000000 0002d37fbc0dc8aa ffff880c3ea66318
[795637.828106]  ffff88095a7ed3f0 0000000000000000 000000012f5c10fb ffff880d616daf48
[795637.828108] Call Trace:
[795637.828116]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[795637.828127]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[795637.828159]  [<ffffffffa039360c>] zil_commit+0x3c/0x830 [zfs]
[795637.828169]  [<ffffffff81538a9e>] ? mutex_lock+0x1e/0x50
[795637.828198]  [<ffffffffa0333332>] ? rrw_enter_read_impl+0xd2/0x180 [zfs]
[795637.828230]  [<ffffffffa03875f2>] zfs_fsync+0x92/0x120 [zfs]
[795637.828260]  [<ffffffffa039f990>] zpl_fsync+0x60/0xa0 [zfs]
[795637.828267]  [<ffffffff811ecb3d>] vfs_fsync_range+0x10d/0x260
[795637.828277]  [<ffffffff811eccfd>] vfs_fsync+0x1d/0x20
[795637.828285]  [<ffffffff811ecd83>] do_fsync+0x83/0xe0
[795637.828294]  [<ffffffff8153afe5>] ? page_fault+0x25/0x30
[795637.828302]  [<ffffffff811ece10>] sys_fsync+0x10/0x20
[795637.828311]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b

Example 2:

[708681.946261] INFO: task txg_sync:4341 blocked for more than 120 seconds.
[708681.946584]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[708681.947186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[708681.947776] txg_sync      D ffff88087a624480     0  4341      2    0 0x00000000
[708681.947782]  ffff88087230b930 0000000000000046 ffff88087230b8f8 ffff880873101638
[708681.947786]  0000000000000000 0000000002434348 0002844adc9271d7 ffff88087c56cc10
[708681.947789]  ffff880800000000 ffff88087c56cc00 000000012a2c59e9 ffff88087a624a48
[708681.947792] Call Trace:
[708681.947807]  [<ffffffff81537893>] io_schedule+0x73/0xc0
[708681.947823]  [<ffffffffa02aefdf>] cv_wait_common+0xaf/0x130 [spl]
[708681.947831]  [<ffffffff810a9e10>] ? autoremove_wake_function+0x0/0x40
[708681.947838]  [<ffffffffa02af078>] __cv_wait_io+0x18/0x20 [spl]
[708681.947873]  [<ffffffffa03a627b>] zio_wait+0x10b/0x1e0 [zfs]
[708681.947899]  [<ffffffffa03a64eb>] zio_free+0xdb/0x130 [zfs]
[708681.947923]  [<ffffffffa03a654e>] zio_free_zil+0xe/0x10 [zfs]
[708681.947947]  [<ffffffffa039df59>] zil_sync+0x189/0x470 [zfs]
[708681.947972]  [<ffffffffa034aa5e>] ? spa_taskq_dispatch_ent+0x7e/0xb0 [zfs]
[708681.947997]  [<ffffffffa03a2583>] ? zio_taskq_dispatch+0x73/0xa0 [zfs]
[708681.948021]  [<ffffffffa03a664f>] ? zio_nowait+0xff/0x180 [zfs]
[708681.948038]  [<ffffffffa030a4ae>] dmu_objset_sync+0x23e/0x2e0 [zfs]
[708681.948053]  [<ffffffffa03089d0>] ? dmu_objset_write_ready+0x0/0x70 [zfs]
[708681.948069]  [<ffffffffa030a550>] ? dmu_objset_write_done+0x0/0x70 [zfs]
[708681.948091]  [<ffffffffa031df29>] dsl_dataset_sync+0x59/0xb0 [zfs]
[708681.948111]  [<ffffffffa032da6a>] dsl_pool_sync+0x9a/0x440 [zfs]
[708681.948136]  [<ffffffffa0344cae>] spa_sync+0x35e/0xb10 [zfs]
[708681.948140]  [<ffffffff8105b5b9>] ? __wake_up_common+0x59/0x90
[708681.948145]  [<ffffffff8105ff73>] ? __wake_up+0x53/0x70
[708681.948150]  [<ffffffff81015019>] ? read_tsc+0x9/0x20
[708681.948174]  [<ffffffffa035b368>] txg_sync_thread+0x3d8/0x670 [zfs]
[708681.948208]  [<ffffffffa035af90>] ? txg_sync_thread+0x0/0x670 [zfs]
[708681.948239]  [<ffffffffa035af90>] ? txg_sync_thread+0x0/0x670 [zfs]
[708681.948252]  [<ffffffffa02aa378>] thread_generic_wrapper+0x68/0x80 [spl]
[708681.948264]  [<ffffffffa02aa310>] ? thread_generic_wrapper+0x0/0x80 [spl]
[708681.948274]  [<ffffffff810a9a1e>] kthread+0x9e/0xc0
[708681.948283]  [<ffffffff8100c34a>] child_rip+0xa/0x20
[708681.948293]  [<ffffffff810a9980>] ? kthread+0x0/0xc0
[708681.948303]  [<ffffffff8100c340>] ? child_rip+0x0/0x20
[708681.948464] INFO: task zfs:542741 blocked for more than 120 seconds.
[708681.948768]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[708681.949362] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[708681.949955] zfs           D ffff880d50766500     0 542741 542730    0 0x00000000
[708681.949959]  ffff88010e43d348 0000000000000082 0000000000000000 ffff880873101520
[708681.949962]  0000000000000001 ffff880873101538 000284493a167cce 0000000000000000
[708681.949965]  ffff88010e43d2c8 0000000000000000 000000012a2c3e91 ffff880d50766ac8
[708681.949968] Call Trace:
[708681.949977]  [<ffffffffa02aa935>] taskq_wait_id+0x65/0xa0 [spl]
[708681.949981]  [<ffffffff810a9e10>] ? autoremove_wake_function+0x0/0x40
[708681.949999]  [<ffffffffa03111a0>] ? dump_bytes_strategy+0x0/0x100 [zfs]
[708681.950016]  [<ffffffffa03111a0>] ? dump_bytes_strategy+0x0/0x100 [zfs]
[708681.950039]  [<ffffffffa034a9ad>] spa_taskq_dispatch_sync+0x8d/0xc0 [zfs]
[708681.950057]  [<ffffffffa03112e2>] dump_bytes+0x42/0x50 [zfs]
[708681.950074]  [<ffffffffa031171a>] dump_write+0x20a/0x240 [zfs]
[708681.950091]  [<ffffffffa0312a1c>] backup_cb+0x71c/0x760 [zfs]
[708681.950095]  [<ffffffff8105ff73>] ? __wake_up+0x53/0x70
[708681.950113]  [<ffffffffa0313376>] traverse_visitbp+0x476/0x7b0 [zfs]
[708681.950125]  [<ffffffffa02f7811>] ? arc_read+0x3e1/0xa70 [zfs]
[708681.950142]  [<ffffffffa03125da>] ? backup_cb+0x2da/0x760 [zfs]
[708681.950159]  [<ffffffffa0313d91>] traverse_dnode+0x71/0xd0 [zfs]
[708681.950176]  [<ffffffffa03135f7>] traverse_visitbp+0x6f7/0x7b0 [zfs]
[708681.950202]  [<ffffffffa03132eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[708681.950227]  [<ffffffffa03132eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[708681.950254]  [<ffffffffa03132eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[708681.950279]  [<ffffffffa03132eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[708681.950303]  [<ffffffffa03132eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[708681.950328]  [<ffffffffa03132eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[708681.950352]  [<ffffffffa0313d91>] traverse_dnode+0x71/0xd0 [zfs]
[708681.950375]  [<ffffffffa0313451>] traverse_visitbp+0x551/0x7b0 [zfs]
[708681.950401]  [<ffffffffa0313df0>] ? traverse_prefetch_thread+0x0/0x170 [zfs]
[708681.950428]  [<ffffffffa0313834>] traverse_impl+0x184/0x400 [zfs]
[708681.950451]  [<ffffffffa03111a0>] ? dump_bytes_strategy+0x0/0x100 [zfs]
[708681.950476]  [<ffffffffa0313b46>] traverse_dataset+0x56/0x60 [zfs]
[708681.950499]  [<ffffffffa0312300>] ? backup_cb+0x0/0x760 [zfs]
[708681.950521]  [<ffffffffa0311abb>] dmu_send_impl+0x36b/0x4f0 [zfs]
[708681.950546]  [<ffffffffa03120d7>] dmu_send_obj+0x157/0x210 [zfs]
[708681.950579]  [<ffffffffa0381297>] zfs_ioc_send+0xa7/0x280 [zfs]
[708681.950612]  [<ffffffffa0384cb5>] zfsdev_ioctl+0x495/0x4d0 [zfs]
[708681.950624]  [<ffffffff811cc5d2>] vfs_ioctl+0x22/0xa0
[708681.950633]  [<ffffffff811cc774>] do_vfs_ioctl+0x84/0x5b0
[708681.950642]  [<ffffffff811cccef>] sys_ioctl+0x4f/0x80
[708681.950651]  [<ffffffff8153b30e>] ? do_device_not_available+0xe/0x10
[708681.950662]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b

If you need anything more please let me know!

@tjikkun
Copy link
Author

tjikkun commented Sep 29, 2015

Any ideas regarding this issue? Do you think it is related to #3834? Is there a possible workaround.

@dweeezil
Copy link
Contributor

@tjikkun I suspect this would be fixed by ef5b2e1 which, if you want to try it, will apply cleanly to 0.6.5. Or, you might be able to get things working better by setting spl_taskq_thread_dynamic=0.

@tjikkun
Copy link
Author

tjikkun commented Sep 29, 2015

@dweeezil Is it possible to set spl_taskq_thread_dynamic=0 on a running system, or is it a boot option only?

@dweeezil
Copy link
Contributor

@tjikkun It should be set when the spl module is loaded (modprobe spl spl_taskq_thread_dynamic=0) or, at the very least, before the pool(s) are imported. It won't have much impact to set it on a running system.

@tjikkun
Copy link
Author

tjikkun commented Oct 1, 2015

Just got this on 0.6.5.2:

[15253.812996] INFO: task txg_sync:3134 blocked for more than 120 seconds.
[15253.813029]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[15253.813062] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[15253.813082] txg_sync      D ffff880879e60040     0  3134      2    0 0x00000000
[15253.813089]  ffff880869e29930 0000000000000046 0000000000000000 ffff88107bb98138
[15253.813093]  0000000000000000 0000000000000000 00000daf4b770c2c ffffffff81015019
[15253.813096]  ffff880869e298e0 0000000000000000 0000000100e0d30d ffff880879e60608
[15253.813099] Call Trace:
[15253.813114]  [<ffffffff81015019>] ? read_tsc+0x9/0x20
[15253.813125]  [<ffffffff81537893>] io_schedule+0x73/0xc0
[15253.813144]  [<ffffffffa029bfdf>] cv_wait_common+0xaf/0x130 [spl]
[15253.813151]  [<ffffffff810a9e10>] ? autoremove_wake_function+0x0/0x40
[15253.813160]  [<ffffffffa029c078>] __cv_wait_io+0x18/0x20 [spl]
[15253.813216]  [<ffffffffa03a523b>] zio_wait+0x10b/0x1e0 [zfs]
[15253.813249]  [<ffffffffa03a54ab>] zio_free+0xdb/0x130 [zfs]
[15253.813281]  [<ffffffffa03a550e>] zio_free_zil+0xe/0x10 [zfs]
[15253.813313]  [<ffffffffa039cf19>] zil_sync+0x189/0x470 [zfs]
[15253.813344]  [<ffffffffa034995e>] ? spa_taskq_dispatch_ent+0x7e/0xb0 [zfs]
[15253.813376]  [<ffffffffa03a1543>] ? zio_taskq_dispatch+0x73/0xa0 [zfs]
[15253.813408]  [<ffffffffa03a560f>] ? zio_nowait+0xff/0x180 [zfs]
[15253.813432]  [<ffffffffa030990e>] dmu_objset_sync+0x23e/0x2e0 [zfs]
[15253.813455]  [<ffffffffa03079c0>] ? dmu_objset_write_ready+0x0/0x70 [zfs]
[15253.813479]  [<ffffffffa03099b0>] ? dmu_objset_write_done+0x0/0x70 [zfs]
[15253.813509]  [<ffffffffa031ce29>] dsl_dataset_sync+0x59/0xb0 [zfs]
[15253.813536]  [<ffffffffa032c96a>] dsl_pool_sync+0x9a/0x440 [zfs]
[15253.813568]  [<ffffffffa0343bae>] spa_sync+0x35e/0xb10 [zfs]
[15253.813573]  [<ffffffff8105b5b9>] ? __wake_up_common+0x59/0x90
[15253.813580]  [<ffffffff8105ff73>] ? __wake_up+0x53/0x70
[15253.813583]  [<ffffffff81015019>] ? read_tsc+0x9/0x20
[15253.813616]  [<ffffffffa035a268>] txg_sync_thread+0x3d8/0x670 [zfs]
[15253.813651]  [<ffffffffa0359e90>] ? txg_sync_thread+0x0/0x670 [zfs]
[15253.813682]  [<ffffffffa0359e90>] ? txg_sync_thread+0x0/0x670 [zfs]
[15253.813690]  [<ffffffffa0297378>] thread_generic_wrapper+0x68/0x80 [spl]
[15253.813696]  [<ffffffffa0297310>] ? thread_generic_wrapper+0x0/0x80 [spl]
[15253.813700]  [<ffffffff810a9a1e>] kthread+0x9e/0xc0
[15253.813706]  [<ffffffff8100c34a>] child_rip+0xa/0x20
[15253.813712]  [<ffffffff810a9980>] ? kthread+0x0/0xc0
[15253.813715]  [<ffffffff8100c340>] ? child_rip+0x0/0x20
[15253.813726] INFO: task mysqld:791509 blocked for more than 120 seconds.
[15253.813746]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[15253.813769] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[15253.813799] mysqld        D ffff8802bbaec280     0 791509   6107    0 0x00000000
[15253.813803]  ffff8803c23a7d28 0000000000000086 0000000000000000 ffff8802bbaec280
[15253.813806]  ffff8802bbaec280 0000000000500000 00000daf55ec9180 ffff881059a25cb8
[15253.813809]  000000000033b200 0000000000000000 0000000100e0d3ba ffff8802bbaec848
[15253.813813] Call Trace:
[15253.813821]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[15253.813827]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[15253.813859]  [<ffffffffa039e5cc>] zil_commit+0x3c/0x830 [zfs]
[15253.813863]  [<ffffffff81538a9e>] ? mutex_lock+0x1e/0x50
[15253.813893]  [<ffffffffa033e232>] ? rrw_enter_read_impl+0xd2/0x180 [zfs]
[15253.813937]  [<ffffffffa03925b2>] zfs_fsync+0x92/0x120 [zfs]
[15253.813978]  [<ffffffffa03aa950>] zpl_fsync+0x60/0xa0 [zfs]
[15253.813995]  [<ffffffff811ecb3d>] vfs_fsync_range+0x10d/0x260
[15253.814008]  [<ffffffff811eccfd>] vfs_fsync+0x1d/0x20
[15253.814020]  [<ffffffff811ecd83>] do_fsync+0x83/0xe0
[15253.814031]  [<ffffffff811ecdf3>] sys_fdatasync+0x13/0x20
[15253.814043]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[15253.814400] INFO: task zfs:785245 blocked for more than 120 seconds.
[15253.814419]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[15253.814441] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[15253.814460] zfs           D ffff880d7e7944c0     0 785245 785242    0 0x00000000
[15253.814464]  ffff880a137ff2a8 0000000000000086 0000000000000000 ffff88107bb98720
[15253.814468]  0000000000000001 ffff88107bb98738 00000dae6d8c49c7 0000000000000000
[15253.814471]  ffff880a137ff228 0000000000000000 0000000100e0c521 ffff880d7e794a88
[15253.814474] Call Trace:
[15253.814486]  [<ffffffffa0297935>] taskq_wait_id+0x65/0xa0 [spl]
[15253.814490]  [<ffffffff810a9e10>] ? autoremove_wake_function+0x0/0x40
[15253.814514]  [<ffffffffa03100a0>] ? dump_bytes_strategy+0x0/0x100 [zfs]
[15253.814538]  [<ffffffffa03100a0>] ? dump_bytes_strategy+0x0/0x100 [zfs]
[15253.814569]  [<ffffffffa03498ad>] spa_taskq_dispatch_sync+0x8d/0xc0 [zfs]
[15253.814594]  [<ffffffffa03101e2>] dump_bytes+0x42/0x50 [zfs]
[15253.814618]  [<ffffffffa031061a>] dump_write+0x20a/0x240 [zfs]
[15253.814643]  [<ffffffffa031191c>] backup_cb+0x71c/0x760 [zfs]
[15253.814648]  [<ffffffff8105ff73>] ? __wake_up+0x53/0x70
[15253.814673]  [<ffffffffa0312276>] traverse_visitbp+0x476/0x7b0 [zfs]
[15253.814680]  [<ffffffffa0294d46>] ? spl_kmem_alloc+0x96/0x1a0 [spl]
[15253.814704]  [<ffffffffa03121eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[15253.814731]  [<ffffffffa0312c91>] traverse_dnode+0x71/0xd0 [zfs]
[15253.814755]  [<ffffffffa03124f7>] traverse_visitbp+0x6f7/0x7b0 [zfs]
[15253.814782]  [<ffffffffa03121eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[15253.814808]  [<ffffffffa03121eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[15253.814834]  [<ffffffffa03121eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[15253.814860]  [<ffffffffa03121eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[15253.814886]  [<ffffffffa03121eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[15253.814923]  [<ffffffffa03121eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[15253.814957]  [<ffffffffa0312c91>] traverse_dnode+0x71/0xd0 [zfs]
[15253.814992]  [<ffffffffa0312351>] traverse_visitbp+0x551/0x7b0 [zfs]
[15253.815026]  [<ffffffffa0312cf0>] ? traverse_prefetch_thread+0x0/0x170 [zfs]
[15253.815060]  [<ffffffffa0312734>] traverse_impl+0x184/0x400 [zfs]
[15253.815090]  [<ffffffffa03100a0>] ? dump_bytes_strategy+0x0/0x100 [zfs]
[15253.815126]  [<ffffffffa0312a46>] traverse_dataset+0x56/0x60 [zfs]
[15253.815162]  [<ffffffffa0311200>] ? backup_cb+0x0/0x760 [zfs]
[15253.815194]  [<ffffffffa03109bb>] dmu_send_impl+0x36b/0x4f0 [zfs]
[15253.815228]  [<ffffffffa0310fd7>] dmu_send_obj+0x157/0x210 [zfs]
[15253.815271]  [<ffffffffa0380257>] zfs_ioc_send+0xa7/0x280 [zfs]
[15253.815312]  [<ffffffffa0383c75>] zfsdev_ioctl+0x495/0x4d0 [zfs]
[15253.815326]  [<ffffffff811cc5d2>] vfs_ioctl+0x22/0xa0
[15253.815335]  [<ffffffff811cc774>] do_vfs_ioctl+0x84/0x5b0
[15253.815348]  [<ffffffff811cccef>] sys_ioctl+0x4f/0x80
[15253.815359]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b

@dweeezil
Copy link
Contributor

dweeezil commented Oct 1, 2015

@tjikkun Is this with (the default) spl_taskq_thread_dynamic=1? I had really expected ef5b2e1 (f4943dc in zfs-0.6.5.2) to fix this. Again, assuming you've got dynamic taskqs enabled, it might be interesting to get an inventory of the zio and spl threads with something like ps ax | egrep 'z_|spl'. This sure feels like a case where a new thread isn't being created on a dynamic taskq.

@tjikkun
Copy link
Author

tjikkun commented Oct 1, 2015

Yes, this is with the default setting for spl_taskq_thread_dynamic, I have rebooted this server with spl_taskq_thread_dynamic=0 after the crash.

Output of ps aufx | egrep 'z_|spl' from time of crash:

root        2074  0.1  0.0      0     0 ?        S<   09:58   0:26  \_ [spl_kmem_cache]
root        2075  0.0  0.0      0     0 ?        S<   09:58   0:05  \_ [spl_system_task]
root        2076  0.0  0.0      0     0 ?        S<   09:58   0:03  \_ [spl_dynamic_tas]
root        2082  0.0  0.0      0     0 ?        S    09:58   0:00  \_ [z_unmount]
root        2963  0.0  0.0      0     0 ?        S<   09:58   0:02  \_ [z_null_iss]
root        2964  0.0  0.0      0     0 ?        S<   09:58   0:00  \_ [z_null_int]
root        2965  0.0  0.0      0     0 ?        S<   09:58   0:00  \_ [z_rd_iss]
root        2966  0.0  0.0      0     0 ?        S<   09:58   0:13  \_ [z_rd_int_0]
root        2967  0.0  0.0      0     0 ?        S<   09:58   0:13  \_ [z_rd_int_1]
root        2968  0.0  0.0      0     0 ?        S<   09:58   0:13  \_ [z_rd_int_2]
root        2969  0.0  0.0      0     0 ?        S<   09:58   0:13  \_ [z_rd_int_3]
root        2970  0.0  0.0      0     0 ?        S<   09:58   0:13  \_ [z_rd_int_4]
root        2971  0.0  0.0      0     0 ?        S<   09:58   0:13  \_ [z_rd_int_5]
root        2972  0.0  0.0      0     0 ?        S<   09:58   0:13  \_ [z_rd_int_6]
root        2973  0.0  0.0      0     0 ?        S<   09:58   0:13  \_ [z_rd_int_7]
root        2974  0.2  0.0      0     0 ?        S<   09:58   0:38  \_ [z_wr_iss]
root        2975  0.0  0.0      0     0 ?        S<   09:58   0:00  \_ [z_wr_iss_h]
root        2976  0.1  0.0      0     0 ?        S<   09:58   0:21  \_ [z_wr_int_0]
root        2977  0.1  0.0      0     0 ?        S<   09:58   0:21  \_ [z_wr_int_1]
root        2978  0.1  0.0      0     0 ?        S<   09:58   0:21  \_ [z_wr_int_2]
root        2979  0.1  0.0      0     0 ?        S<   09:58   0:21  \_ [z_wr_int_3]
root        2980  0.1  0.0      0     0 ?        S<   09:58   0:21  \_ [z_wr_int_4]
root        2981  0.1  0.0      0     0 ?        S<   09:58   0:21  \_ [z_wr_int_5]
root        2982  0.1  0.0      0     0 ?        S<   09:58   0:21  \_ [z_wr_int_6]
root        2983  0.1  0.0      0     0 ?        S<   09:58   0:21  \_ [z_wr_int_7]
root        2984  0.0  0.0      0     0 ?        S<   09:58   0:00  \_ [z_wr_int_h]
root        2985  0.0  0.0      0     0 ?        S<   09:58   0:10  \_ [z_fr_iss_0]
root        2986  0.0  0.0      0     0 ?        S<   09:58   0:10  \_ [z_fr_iss_1]
root        2987  0.0  0.0      0     0 ?        S<   09:58   0:10  \_ [z_fr_iss_2]
root        2988  0.0  0.0      0     0 ?        S<   09:58   0:10  \_ [z_fr_iss_3]
root        2989  0.0  0.0      0     0 ?        S<   09:58   0:10  \_ [z_fr_iss_4]
root        2990  0.0  0.0      0     0 ?        S<   09:58   0:10  \_ [z_fr_iss_5]
root        2991  0.0  0.0      0     0 ?        S<   09:58   0:10  \_ [z_fr_iss_6]
root        2992  0.0  0.0      0     0 ?        S<   09:58   0:10  \_ [z_fr_iss_7]
root        2993  0.0  0.0      0     0 ?        S<   09:58   0:00  \_ [z_fr_int]
root        2994  0.0  0.0      0     0 ?        S<   09:58   0:00  \_ [z_cl_iss]
root        2995  0.0  0.0      0     0 ?        S<   09:58   0:00  \_ [z_cl_int]
root        2996  0.0  0.0      0     0 ?        S<   09:58   0:00  \_ [z_ioctl_iss]
root        2997  0.0  0.0      0     0 ?        S<   09:58   0:06  \_ [z_ioctl_int]
root        3017  0.0  0.0      0     0 ?        S    09:58   0:00  \_ [z_iput]

As additional info, at that time a zfs send was running, and it looks like the receiving end was rebooted at about the time of the hang.

@dweeezil
Copy link
Contributor

dweeezil commented Oct 1, 2015

@tjikkun Nothing terribly interesting looking there. What's the memory situation when this is happening (arcstats /proc/vmstat etc.)? I'll try to reproduce this locally on a memory-constrained system. In the mean time, does disabling dynamic taskqs avoid the problem?

@tjikkun
Copy link
Author

tjikkun commented Oct 1, 2015

@dweeezil Next time it happens I'll try and get the contents of /proc/vmstat and the output of arcstat.py
To recover I have to reboot, so I set the dynamic taskq param to 0 in modprobe.conf. Thus far I have had to reboot 3 servers, they haven't crashed again yet, but it is too early to tell if it really helps.

@tjikkun
Copy link
Author

tjikkun commented Oct 2, 2015

Ok, now had a crash where spl_taskq_thread_dynamic=0
Trace looks a bit different, so not sure if it is the same bug, but the impact is the same: need to reboot to restore functionality.

[39755.124593] INFO: task spl_system_task:3209 blocked for more than 120 seconds.
[39755.124630]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[39755.124656] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[39755.124680] spl_system_ta D ffff88087bb0ae60     0  3209      2    0 0x00000000
[39755.124686]  ffff880876fd5630 0000000000000046 0000000000000000 ffffffff8100971d
[39755.124692]  0000000000000000 0000000000000000 000023fc19b038c0 ffff88087c56cc10
[39755.124698]  00ff880876fd55f0 ffff88087c56cc00 000000010256a812 ffff88087bb0b428
[39755.124704] Call Trace:
[39755.124718]  [<ffffffff8100971d>] ? __switch_to+0x7d/0x320
[39755.124737]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[39755.124742]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[39755.124752]  [<ffffffffa0174fe7>] cv_wait_common+0xb7/0x130 [spl]
[39755.124759]  [<ffffffff810a9e10>] ? autoremove_wake_function+0x0/0x40
[39755.124784]  [<ffffffffa02f0a8f>] ? buf_hash_find+0x9f/0x180 [zfs]
[39755.124791]  [<ffffffffa01750b5>] __cv_wait+0x15/0x20 [spl]
[39755.124804]  [<ffffffffa02f84d5>] arc_read+0xb5/0xa70 [zfs]
[39755.124817]  [<ffffffffa02f5340>] ? arc_getbuf_func+0x0/0x80 [zfs]
[39755.124836]  [<ffffffffa031445f>] traverse_visitbp+0x65f/0x7b0 [zfs]
[39755.124855]  [<ffffffffa03141eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[39755.124873]  [<ffffffffa03141eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[39755.124892]  [<ffffffffa03141eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[39755.124911]  [<ffffffffa03141eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[39755.124930]  [<ffffffffa03141eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[39755.124948]  [<ffffffffa03141eb>] traverse_visitbp+0x3eb/0x7b0 [zfs]
[39755.124967]  [<ffffffffa0314c91>] traverse_dnode+0x71/0xd0 [zfs]
[39755.124984]  [<ffffffffa0314351>] traverse_visitbp+0x551/0x7b0 [zfs]
[39755.125003]  [<ffffffffa0314ddd>] traverse_prefetch_thread+0xed/0x170 [zfs]
[39755.125020]  [<ffffffffa0313ba0>] ? traverse_prefetcher+0x0/0x1b0 [zfs]
[39755.125027]  [<ffffffffa017103f>] taskq_thread+0x24f/0x520 [spl]
[39755.125033]  [<ffffffff81066c40>] ? default_wake_function+0x0/0x20
[39755.125038]  [<ffffffffa0170df0>] ? taskq_thread+0x0/0x520 [spl]
[39755.125042]  [<ffffffff810a9a1e>] kthread+0x9e/0xc0
[39755.125046]  [<ffffffff8100c34a>] child_rip+0xa/0x20
[39755.125051]  [<ffffffff810a9980>] ? kthread+0x0/0xc0
[39755.125054]  [<ffffffff8100c340>] ? child_rip+0x0/0x20
[39755.125133] INFO: task txg_sync:7205 blocked for more than 120 seconds.
[39755.125154]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[39755.125179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[39755.125201] txg_sync      D ffff88087788d060     0  7205      2    0 0x00000000
[39755.125204]  ffff880877b17470 0000000000000046 0000000000000000 ffff88087a2ec398
[39755.125207]  ffff880877b17410 ffffffffa03a5d8b 000023fef46516ad ffff88087788d060
[39755.125210]  ffff880374426958 0000000000000000 000000010256d8f9 ffff88087788d628
[39755.125213] Call Trace:
[39755.125244]  [<ffffffffa03a5d8b>] ? zio_add_child+0x14b/0x170 [zfs]
[39755.125251]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[39755.125255]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[39755.125269]  [<ffffffffa02f3324>] arc_release+0xc4/0x4a0 [zfs]
[39755.125280]  [<ffffffffa02f0320>] ? arc_write_physdone+0x0/0x30 [zfs]
[39755.125291]  [<ffffffffa02f23a0>] ? arc_write_done+0x0/0x400 [zfs]
[39755.125304]  [<ffffffffa02fc0e8>] dbuf_release_bp+0x18/0x20 [zfs]
[39755.125316]  [<ffffffffa02fc174>] dbuf_write+0x84/0x3c0 [zfs]
[39755.125329]  [<ffffffffa02fbe70>] ? dbuf_write_ready+0x0/0x220 [zfs]
[39755.125340]  [<ffffffffa02fbe10>] ? dbuf_write_physdone+0x0/0x60 [zfs]
[39755.125352]  [<ffffffffa02fd150>] ? dbuf_write_done+0x0/0x230 [zfs]
[39755.125368]  [<ffffffffa02ff67e>] dbuf_sync_leaf+0x13e/0x380 [zfs]
[39755.125382]  [<ffffffffa03009fb>] dbuf_sync_list+0xcb/0xd0 [zfs]
[39755.125394]  [<ffffffffa02fd150>] ? dbuf_write_done+0x0/0x230 [zfs]
[39755.125408]  [<ffffffffa0300894>] dbuf_sync_indirect+0xf4/0x190 [zfs]
[39755.125421]  [<ffffffffa03009d6>] dbuf_sync_list+0xa6/0xd0 [zfs]
[39755.125433]  [<ffffffffa02fd150>] ? dbuf_write_done+0x0/0x230 [zfs]
[39755.125446]  [<ffffffffa0300894>] dbuf_sync_indirect+0xf4/0x190 [zfs]
[39755.125460]  [<ffffffffa03009d6>] dbuf_sync_list+0xa6/0xd0 [zfs]
[39755.125472]  [<ffffffffa02fd150>] ? dbuf_write_done+0x0/0x230 [zfs]
[39755.125485]  [<ffffffffa0300894>] dbuf_sync_indirect+0xf4/0x190 [zfs]
[39755.125499]  [<ffffffffa03009d6>] dbuf_sync_list+0xa6/0xd0 [zfs]
[39755.125510]  [<ffffffffa02fd150>] ? dbuf_write_done+0x0/0x230 [zfs]
[39755.125532]  [<ffffffffa0300894>] dbuf_sync_indirect+0xf4/0x190 [zfs]
[39755.125552]  [<ffffffffa03009d6>] dbuf_sync_list+0xa6/0xd0 [zfs]
[39755.125571]  [<ffffffffa02fd150>] ? dbuf_write_done+0x0/0x230 [zfs]
[39755.125590]  [<ffffffffa0300894>] dbuf_sync_indirect+0xf4/0x190 [zfs]
[39755.125609]  [<ffffffffa03009d6>] dbuf_sync_list+0xa6/0xd0 [zfs]
[39755.125627]  [<ffffffffa0300894>] dbuf_sync_indirect+0xf4/0x190 [zfs]
[39755.125645]  [<ffffffffa03009d6>] dbuf_sync_list+0xa6/0xd0 [zfs]
[39755.125668]  [<ffffffffa031d05f>] dnode_sync+0x2ef/0x8c0 [zfs]
[39755.125685]  [<ffffffffa02f5b8d>] ? arc_write+0xdd/0x100 [zfs]
[39755.125701]  [<ffffffffa02f1b00>] ? arc_write_ready+0x0/0xd0 [zfs]
[39755.125717]  [<ffffffffa02f0320>] ? arc_write_physdone+0x0/0x30 [zfs]
[39755.125734]  [<ffffffffa02f23a0>] ? arc_write_done+0x0/0x400 [zfs]
[39755.125756]  [<ffffffffa030b7e7>] dmu_objset_sync+0x117/0x2e0 [zfs]
[39755.125777]  [<ffffffffa03099c0>] ? dmu_objset_write_ready+0x0/0x70 [zfs]
[39755.125802]  [<ffffffffa030b9b0>] ? dmu_objset_write_done+0x0/0x70 [zfs]
[39755.125833]  [<ffffffffa031ee29>] dsl_dataset_sync+0x59/0xb0 [zfs]
[39755.125861]  [<ffffffffa032e96a>] dsl_pool_sync+0x9a/0x440 [zfs]
[39755.125894]  [<ffffffffa0345bae>] spa_sync+0x35e/0xb10 [zfs]
[39755.125904]  [<ffffffff8105b5b9>] ? __wake_up_common+0x59/0x90
[39755.125913]  [<ffffffff8105ff73>] ? __wake_up+0x53/0x70
[39755.125923]  [<ffffffff81015019>] ? read_tsc+0x9/0x20
[39755.125956]  [<ffffffffa035c268>] txg_sync_thread+0x3d8/0x670 [zfs]
[39755.125992]  [<ffffffffa035be90>] ? txg_sync_thread+0x0/0x670 [zfs]
[39755.126025]  [<ffffffffa035be90>] ? txg_sync_thread+0x0/0x670 [zfs]
[39755.126037]  [<ffffffffa0170378>] thread_generic_wrapper+0x68/0x80 [spl]
[39755.126048]  [<ffffffffa0170310>] ? thread_generic_wrapper+0x0/0x80 [spl]
[39755.126058]  [<ffffffff810a9a1e>] kthread+0x9e/0xc0
[39755.126068]  [<ffffffff8100c34a>] child_rip+0xa/0x20
[39755.126077]  [<ffffffff810a9980>] ? kthread+0x0/0xc0
[39755.126085]  [<ffffffff8100c340>] ? child_rip+0x0/0x20
[39755.126101] INFO: task mysqld:443260 blocked for more than 120 seconds.
[39755.126126]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[39755.126155] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[39755.126178] mysqld        D ffff8803e749a700     0 443260  10197    0 0x00000000
[39755.126182]  ffff880343635ab8 0000000000000086 0000000000000000 ffffffffa016e306
[39755.126185]  ffff880343635ab8 ffffffffa02fcfb8 0000240bb6b66dfa ffff88087c56cc10
[39755.126188]  ffff8804bfab9068 ffff88087c56cc00 000000010257adc7 ffff8803e749acc8
[39755.126191] Call Trace:
[39755.126197]  [<ffffffffa016e306>] ? spl_kmem_cache_alloc+0x1b6/0x920 [spl]
[39755.126213]  [<ffffffffa02fcfb8>] ? dbuf_rele_and_unlock+0x268/0x400 [zfs]
[39755.126220]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[39755.126225]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[39755.126241]  [<ffffffffa02f147f>] arc_buf_add_ref+0xbf/0x200 [zfs]
[39755.126259]  [<ffffffffa02fedc6>] __dbuf_hold_impl+0x76/0x4f0 [zfs]
[39755.126264]  [<ffffffff8119fc85>] ? __kmalloc_node+0x65/0x70
[39755.126282]  [<ffffffffa02ff2bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
[39755.126312]  [<ffffffffa0300790>] dbuf_hold+0x20/0x30 [zfs]
[39755.126327]  [<ffffffffa0307b05>] dmu_buf_hold_array_by_dnode+0x105/0x490 [zfs]
[39755.126350]  [<ffffffffa0308953>] dmu_read_uio_dnode+0x43/0xd0 [zfs]
[39755.126361]  [<ffffffff81538a9e>] ? mutex_lock+0x1e/0x50
[39755.126380]  [<ffffffffa0308aa1>] dmu_read_uio_dbuf+0x41/0x60 [zfs]
[39755.126411]  [<ffffffffa03979f0>] zfs_read+0x140/0x420 [zfs]
[39755.126421]  [<ffffffff81175a1a>] ? handle_mm_fault+0x22a/0x300
[39755.126432]  [<ffffffff81099d06>] ? virtinfo_notifier_call+0x76/0xb0
[39755.126463]  [<ffffffffa03ad60d>] zpl_read_common_iovec.clone.1+0x7d/0x140 [zfs]
[39755.126495]  [<ffffffffa03ad83a>] zpl_read+0x7a/0xa0 [zfs]
[39755.126506]  [<ffffffff811b6405>] vfs_read+0xb5/0x1a0
[39755.126514]  [<ffffffff811b6732>] sys_pread64+0x82/0xa0
[39755.126525]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[39755.126700] INFO: task dataskq:69329 blocked for more than 120 seconds.
[39755.126721]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[39755.126746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[39755.127056] dataskq       D ffff880ff2670ce0     0 69329  69322    0 0x00000000
[39755.127060]  ffff880a933e5a08 0000000000000086 0000000000000000 ffff880ae5f684e8
[39755.127063]  ffffffffa03bcbc0 ffff88087505a000 00002408f2bebf68 ffff88107c65d010
[39755.127066]  ffff880a933e59e8 ffff88107c65d000 0000000102577f72 ffff880ff26712a8
[39755.127069] Call Trace:
[39755.127077]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[39755.127082]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[39755.127094]  [<ffffffffa02f0a8f>] buf_hash_find+0x9f/0x180 [zfs]
[39755.127107]  [<ffffffffa02f8520>] arc_read+0x100/0xa70 [zfs]
[39755.127123]  [<ffffffffa0300bf4>] dbuf_prefetch+0x1f4/0x2e0 [zfs]
[39755.127127]  [<ffffffffa002e190>] ? avl_find+0x60/0xb0 [zavl]
[39755.127143]  [<ffffffffa0307904>] dmu_prefetch+0x104/0x200 [zfs]
[39755.127169]  [<ffffffffa0394a03>] zfs_readdir+0x3c3/0x510 [zfs]
[39755.127176]  [<ffffffff811b2f64>] ? nameidata_to_filp+0x54/0x70
[39755.127181]  [<ffffffff811c9e40>] ? do_filp_open+0x630/0xd00
[39755.127203]  [<ffffffffa034000c>] ? rrm_exit+0x4c/0xa0 [zfs]
[39755.127228]  [<ffffffffa0393901>] ? zfs_getattr_fast+0x131/0x1e0 [zfs]
[39755.127253]  [<ffffffffa03ac8b2>] zpl_readdir+0x72/0xb0 [zfs]
[39755.127256]  [<ffffffff811cd810>] ? filldir+0x0/0xe0
[39755.127259]  [<ffffffff811cda78>] vfs_readdir+0xa8/0xd0
[39755.127262]  [<ffffffff811cd810>] ? filldir+0x0/0xe0
[39755.127265]  [<ffffffff811cdc09>] sys_getdents+0x89/0xf0
[39755.127271]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[39755.127333] INFO: task verifyDirectAdm:442077 blocked for more than 120 seconds.
[39755.127925]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[39755.128521] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[39755.129108] verifyDirectA D ffff880c5b67e5c0     0 442077 442076    0 0x00000000
[39755.129111]  ffff88091066ba08 0000000000000082 ffff88091066b9d0 ffff880301000001
[39755.129114]  0000000000000001 00000000000feffa 0000240317232455 ffff880049e4f7b0
[39755.129117]  ffff880800000000 ffff880049e4f7a0 0000000102571cec ffff880c5b67eb88
[39755.129120] Call Trace:
[39755.129128]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[39755.129132]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[39755.129145]  [<ffffffffa02f0a8f>] buf_hash_find+0x9f/0x180 [zfs]
[39755.129157]  [<ffffffffa02f8520>] arc_read+0x100/0xa70 [zfs]
[39755.129170]  [<ffffffffa02fc901>] ? dbuf_create+0x371/0x510 [zfs]
[39755.129183]  [<ffffffffa02fd680>] ? dbuf_read_done+0x0/0x130 [zfs]
[39755.129197]  [<ffffffffa02fe738>] dbuf_read+0x238/0x850 [zfs]
[39755.129214]  [<ffffffffa0307b27>] dmu_buf_hold_array_by_dnode+0x127/0x490 [zfs]
[39755.129231]  [<ffffffffa0308953>] dmu_read_uio_dnode+0x43/0xd0 [zfs]
[39755.129234]  [<ffffffff81538a9e>] ? mutex_lock+0x1e/0x50
[39755.129249]  [<ffffffffa0308aa1>] dmu_read_uio_dbuf+0x41/0x60 [zfs]
[39755.129275]  [<ffffffffa03979f0>] zfs_read+0x140/0x420 [zfs]
[39755.129302]  [<ffffffffa03ad60d>] zpl_read_common_iovec.clone.1+0x7d/0x140 [zfs]
[39755.129327]  [<ffffffffa03ad83a>] zpl_read+0x7a/0xa0 [zfs]
[39755.129331]  [<ffffffff811b6405>] vfs_read+0xb5/0x1a0
[39755.129334]  [<ffffffff811b6541>] sys_read+0x51/0x90
[39755.129337]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[39755.129353] INFO: task lsphp:444528 blocked for more than 120 seconds.
[39755.129666]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[39755.130262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[39755.130854] lsphp         D ffff88087b4b8200     0 444528 443879    0 0x00000000
[39755.130858]  ffff8802902855e8 0000000000000082 0000000000000000 ffff8802d894faa0
[39755.130861]  ffffffffa03bcbc0 ffff88087505a000 ffff880290285578 ffffffffa03a3d32
[39755.130864]  ffff8802902855c8 ffff8802d894faa0 0000000000422040 ffff88087b4b87c8
[39755.130867] Call Trace:
[39755.130895]  [<ffffffffa03a3d32>] ? zio_vdev_io_done+0x42/0x190 [zfs]
[39755.130908]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[39755.130919]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[39755.130935]  [<ffffffffa02f0a8f>] buf_hash_find+0x9f/0x180 [zfs]
[39755.130953]  [<ffffffffa02f8520>] arc_read+0x100/0xa70 [zfs]
[39755.130975]  [<ffffffffa0300bf4>] dbuf_prefetch+0x1f4/0x2e0 [zfs]
[39755.131000]  [<ffffffffa0317aae>] dmu_zfetch_dofetch+0xfe/0x170 [zfs]
[39755.131022]  [<ffffffffa0318605>] dmu_zfetch+0x815/0xd90 [zfs]
[39755.131044]  [<ffffffffa02feca9>] dbuf_read+0x7a9/0x850 [zfs]
[39755.131068]  [<ffffffffa031ae60>] dnode_hold_impl+0xe0/0x500 [zfs]
[39755.131087]  [<ffffffffa031b299>] dnode_hold+0x19/0x20 [zfs]
[39755.131102]  [<ffffffffa030707a>] dmu_buf_hold_noread+0x3a/0x140 [zfs]
[39755.131123]  [<ffffffffa03071bb>] dmu_buf_hold+0x3b/0x90 [zfs]
[39755.131153]  [<ffffffffa0370e4a>] zap_lockdir+0x5a/0x770 [zfs]
[39755.131173]  [<ffffffffa0300a98>] ? dbuf_prefetch+0x98/0x2e0 [zfs]
[39755.131203]  [<ffffffffa037218c>] zap_cursor_retrieve+0x19c/0x2d0 [zfs]
[39755.131224]  [<ffffffffa0307917>] ? dmu_prefetch+0x117/0x200 [zfs]
[39755.131256]  [<ffffffffa03947be>] zfs_readdir+0x17e/0x510 [zfs]
[39755.131268]  [<ffffffff811b2f64>] ? nameidata_to_filp+0x54/0x70
[39755.131278]  [<ffffffff811c9e40>] ? do_filp_open+0x630/0xd00
[39755.131289]  [<ffffffff8129cd41>] ? cpumask_any_but+0x31/0x50
[39755.131320]  [<ffffffffa03ac8b2>] zpl_readdir+0x72/0xb0 [zfs]
[39755.131329]  [<ffffffff811cd810>] ? filldir+0x0/0xe0
[39755.131337]  [<ffffffff811cda78>] vfs_readdir+0xa8/0xd0
[39755.131345]  [<ffffffff811cd810>] ? filldir+0x0/0xe0
[39755.131353]  [<ffffffff811cdc09>] sys_getdents+0x89/0xf0
[39755.131362]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[39755.131372] INFO: task verifyDirectAdm:445187 blocked for more than 120 seconds.
[39755.131966]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[39755.132561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[39755.133151] verifyDirectA D ffff880385178540     0 445187 445186    0 0x00000000
[39755.133155]  ffff8803f6def618 0000000000000086 0000000000000000 0000000000000286
[39755.133157]  ffff8803f6def588 ffffffffa0317bc9 00002404b4c17260 ffff88107c65d010
[39755.133160]  ffff8803f6def678 ffff88107c65d000 000000010257384d ffff880385178b08
[39755.133163] Call Trace:
[39755.133182]  [<ffffffffa0317bc9>] ? spl_rw_clear_owner+0x39/0x50 [zfs]
[39755.133188]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[39755.133193]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[39755.133204]  [<ffffffffa02f147f>] arc_buf_add_ref+0xbf/0x200 [zfs]
[39755.133217]  [<ffffffffa02fedc6>] __dbuf_hold_impl+0x76/0x4f0 [zfs]
[39755.133220]  [<ffffffff8119fc85>] ? __kmalloc_node+0x65/0x70
[39755.133234]  [<ffffffffa02ff2bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
[39755.133248]  [<ffffffffa0300790>] dbuf_hold+0x20/0x30 [zfs]
[39755.133262]  [<ffffffffa03070c7>] dmu_buf_hold_noread+0x87/0x140 [zfs]
[39755.133280]  [<ffffffffa03071bb>] dmu_buf_hold+0x3b/0x90 [zfs]
[39755.133305]  [<ffffffffa0370e4a>] zap_lockdir+0x5a/0x770 [zfs]
[39755.133332]  [<ffffffffa03729ba>] zap_lookup_norm+0x4a/0x190 [zfs]
[39755.133358]  [<ffffffffa0372b93>] zap_lookup+0x33/0x40 [zfs]
[39755.133383]  [<ffffffffa037ca87>] zfs_dirent_lock+0x457/0x580 [zfs]
[39755.133409]  [<ffffffffa037cdbb>] zfs_dirlook+0x8b/0x270 [zfs]
[39755.133433]  [<ffffffffa0375fb0>] ? zfs_zaccess+0xa0/0x2d0 [zfs]
[39755.133456]  [<ffffffffa0340232>] ? rrw_enter_read_impl+0xd2/0x180 [zfs]
[39755.133481]  [<ffffffffa039609f>] zfs_lookup+0x2ef/0x340 [zfs]
[39755.133506]  [<ffffffffa03adc1b>] zpl_lookup+0x9b/0x120 [zfs]
[39755.133509]  [<ffffffff811d1a94>] ? __d_alloc+0x184/0x1e0
[39755.133513]  [<ffffffff811c69ad>] do_lookup+0x1dd/0x270
[39755.133517]  [<ffffffff811c75bc>] __link_path_walk+0x9ec/0x11c0
[39755.133530]  [<ffffffff81174ea4>] ? handle_pte_fault+0x664/0xfb0
[39755.133540]  [<ffffffff811c800a>] path_walk+0x6a/0xe0
[39755.133549]  [<ffffffff811c821b>] filename_lookup+0x6b/0xc0
[39755.133559]  [<ffffffff811c9469>] user_path_at+0x59/0xa0
[39755.133571]  [<ffffffff811dae30>] ? mntput_no_expire+0x30/0x110
[39755.133581]  [<ffffffff811bbe24>] ? cp_new_stat+0xe4/0x100
[39755.133590]  [<ffffffff811bbcb0>] vfs_fstatat+0x50/0xa0
[39755.133598]  [<ffffffff811bbd1e>] vfs_lstat+0x1e/0x20
[39755.133607]  [<ffffffff811bbeb4>] sys_newlstat+0x24/0x50
[39755.133617]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[39755.133627] INFO: task lsphp:445664 blocked for more than 120 seconds.
[39755.133935]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[39755.134530] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[39755.135122] lsphp         D ffff880e46c08840     0 445664 445662    0 0x00000000
[39755.135125]  ffff880dfa31f5b8 0000000000000082 0000000000000000 0000000000000000
[39755.135128]  0000000000000003 ffff880e46c08840 0000240bc24ff7a0 ffffffffa01751d7
[39755.135131]  ffff880dfa31f568 0000000000000000 000000010257ae92 ffff880e46c08e08
[39755.135134] Call Trace:
[39755.135141]  [<ffffffffa01751d7>] ? __cv_init+0x37/0x60 [spl]
[39755.135147]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[39755.135152]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[39755.135163]  [<ffffffffa02f0a8f>] buf_hash_find+0x9f/0x180 [zfs]
[39755.135176]  [<ffffffffa02f8520>] arc_read+0x100/0xa70 [zfs]
[39755.135190]  [<ffffffffa02fd680>] ? dbuf_read_done+0x0/0x130 [zfs]
[39755.135204]  [<ffffffffa02fe738>] dbuf_read+0x238/0x850 [zfs]
[39755.135221]  [<ffffffffa03071e8>] dmu_buf_hold+0x68/0x90 [zfs]
[39755.135246]  [<ffffffffa0370e4a>] zap_lockdir+0x5a/0x770 [zfs]
[39755.135252]  [<ffffffff81158c60>] ? __lru_cache_add+0x40/0x90
[39755.135278]  [<ffffffffa03729ba>] zap_lookup_norm+0x4a/0x190 [zfs]
[39755.135304]  [<ffffffffa0372b93>] zap_lookup+0x33/0x40 [zfs]
[39755.135329]  [<ffffffffa037ca87>] zfs_dirent_lock+0x457/0x580 [zfs]
[39755.135355]  [<ffffffffa037cdbb>] zfs_dirlook+0x8b/0x270 [zfs]
[39755.135379]  [<ffffffffa0375fb0>] ? zfs_zaccess+0xa0/0x2d0 [zfs]
[39755.135402]  [<ffffffffa0340232>] ? rrw_enter_read_impl+0xd2/0x180 [zfs]
[39755.135427]  [<ffffffffa039609f>] zfs_lookup+0x2ef/0x340 [zfs]
[39755.135452]  [<ffffffffa03adc1b>] zpl_lookup+0x9b/0x120 [zfs]
[39755.135455]  [<ffffffff811d1a94>] ? __d_alloc+0x184/0x1e0
[39755.135459]  [<ffffffff811c69ad>] do_lookup+0x1dd/0x270
[39755.135463]  [<ffffffff811c75bc>] __link_path_walk+0x9ec/0x11c0
[39755.135468]  [<ffffffff81174ea4>] ? handle_pte_fault+0x664/0xfb0
[39755.135473]  [<ffffffff811c800a>] path_walk+0x6a/0xe0
[39755.135476]  [<ffffffff811c821b>] filename_lookup+0x6b/0xc0
[39755.135480]  [<ffffffff811c9469>] user_path_at+0x59/0xa0
[39755.135503]  [<ffffffffa033fea5>] ? rrw_exit+0x65/0x180 [zfs]
[39755.135526]  [<ffffffffa034000c>] ? rrm_exit+0x4c/0xa0 [zfs]
[39755.135535]  [<ffffffff8129ced5>] ? _atomic_dec_and_lock+0x55/0x80
[39755.135545]  [<ffffffff811bbcb0>] vfs_fstatat+0x50/0xa0
[39755.135554]  [<ffffffff811dae30>] ? mntput_no_expire+0x30/0x110
[39755.135562]  [<ffffffff811bbd3b>] vfs_stat+0x1b/0x20
[39755.135570]  [<ffffffff811bbf04>] sys_newstat+0x24/0x50
[39755.135578]  [<ffffffff811b7be5>] ? fput+0x25/0x30
[39755.135587]  [<ffffffff811b247d>] ? filp_close+0x5d/0x90
[39755.135596]  [<ffffffff811b2555>] ? sys_close+0xa5/0x100
[39755.135604]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[39755.135613] INFO: task getcontrolpanel:445779 blocked for more than 120 seconds.
[39755.136220]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[39755.136813] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[39755.137401] getcontrolpan D ffff88042376ee60     0 445779 445727    0 0x00000000
[39755.137404]  ffff880425839b28 0000000000000086 0000000000000000 ffff8810798d8b40
[39755.137407]  ffff880425839b88 ffffffffa016e1c1 0000240a440e6905 ffff880049e0f7b0
[39755.137410]  ffff880425839ab8 ffff880049e0f7a0 0000000102579558 ffff88042376f428
[39755.137413] Call Trace:
[39755.137418]  [<ffffffffa016e1c1>] ? spl_kmem_cache_alloc+0x71/0x920 [spl]
[39755.137424]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[39755.137428]  [<ffffffff811d1757>] ? __d_lookup+0xa7/0x150
[39755.137431]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[39755.137434]  [<ffffffff811c6923>] do_lookup+0x153/0x270
[39755.137438]  [<ffffffff811c75bc>] __link_path_walk+0x9ec/0x11c0
[39755.137462]  [<ffffffffa03a727f>] ? zio_wait+0x14f/0x1e0 [zfs]
[39755.137469]  [<ffffffff811c800a>] path_walk+0x6a/0xe0
[39755.137472]  [<ffffffff811c821b>] filename_lookup+0x6b/0xc0
[39755.137476]  [<ffffffff811c9469>] user_path_at+0x59/0xa0
[39755.137501]  [<ffffffffa03ad67c>] ? zpl_read_common_iovec.clone.1+0xec/0x140 [zfs]
[39755.137505]  [<ffffffff811bbcb0>] vfs_fstatat+0x50/0xa0
[39755.137529]  [<ffffffffa03ad845>] ? zpl_read+0x85/0xa0 [zfs]
[39755.137538]  [<ffffffff811bbd3b>] vfs_stat+0x1b/0x20
[39755.137547]  [<ffffffff811bbf04>] sys_newstat+0x24/0x50
[39755.137554]  [<ffffffff811b6541>] ? sys_read+0x51/0x90
[39755.137563]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b
[39755.137573] INFO: task lsphp:446018 blocked for more than 120 seconds.
[39755.137880]       Tainted: P           ---------------    2.6.32-604.16.2.lve1.3.54.el6.x86_64 #1
[39755.138471] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[39755.139062] lsphp         D ffff8809b9ba33a0     0 446018 445985    0 0x00000000
[39755.139065]  ffff880fd8c19618 0000000000000082 0000000000000000 ffff8807fe602000
[39755.139068]  ffff880fd8c19588 ffffffff81539da6 0000240ab5d04704 ffff88107c65d010
[39755.139071]  ffff880fd8c19678 ffff88107c65d000 0000000102579cf0 ffff8809b9ba3968
[39755.139074] Call Trace:
[39755.139079]  [<ffffffff81539da6>] ? down_read+0x16/0x30
[39755.139091]  [<ffffffff81538f86>] __mutex_lock_slowpath+0x96/0x210
[39755.139101]  [<ffffffff81538aab>] mutex_lock+0x2b/0x50
[39755.139119]  [<ffffffffa02f147f>] arc_buf_add_ref+0xbf/0x200 [zfs]
[39755.139138]  [<ffffffffa02fedc6>] __dbuf_hold_impl+0x76/0x4f0 [zfs]
[39755.139148]  [<ffffffff8119fc85>] ? __kmalloc_node+0x65/0x70
[39755.139168]  [<ffffffffa02ff2bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
[39755.139187]  [<ffffffffa0300790>] dbuf_hold+0x20/0x30 [zfs]
[39755.139208]  [<ffffffffa03070c7>] dmu_buf_hold_noread+0x87/0x140 [zfs]
[39755.139229]  [<ffffffffa03071bb>] dmu_buf_hold+0x3b/0x90 [zfs]
[39755.139260]  [<ffffffffa0370e4a>] zap_lockdir+0x5a/0x770 [zfs]
[39755.139291]  [<ffffffffa0343293>] ? sa_lookup_impl+0x13/0x20 [zfs]
[39755.139318]  [<ffffffffa03433d9>] ? sa_lookup+0x89/0xb0 [zfs]
[39755.139349]  [<ffffffffa03757ef>] ? zfs_acl_node_read.clone.1+0x2cf/0x340 [zfs]
[39755.139382]  [<ffffffffa03729ba>] zap_lookup_norm+0x4a/0x190 [zfs]
[39755.139411]  [<ffffffffa0372b93>] zap_lookup+0x33/0x40 [zfs]
[39755.139441]  [<ffffffffa037ca87>] zfs_dirent_lock+0x457/0x580 [zfs]
[39755.139474]  [<ffffffffa037cdbb>] zfs_dirlook+0x8b/0x270 [zfs]
[39755.139504]  [<ffffffffa0375fb0>] ? zfs_zaccess+0xa0/0x2d0 [zfs]
[39755.139531]  [<ffffffffa0340232>] ? rrw_enter_read_impl+0xd2/0x180 [zfs]
[39755.139564]  [<ffffffffa039609f>] zfs_lookup+0x2ef/0x340 [zfs]
[39755.139595]  [<ffffffffa03adc1b>] zpl_lookup+0x9b/0x120 [zfs]
[39755.139603]  [<ffffffff811d1a94>] ? __d_alloc+0x184/0x1e0
[39755.139612]  [<ffffffff811c69ad>] do_lookup+0x1dd/0x270
[39755.139622]  [<ffffffff811c75bc>] __link_path_walk+0x9ec/0x11c0
[39755.139633]  [<ffffffff81174ea4>] ? handle_pte_fault+0x664/0xfb0
[39755.139644]  [<ffffffff811c800a>] path_walk+0x6a/0xe0
[39755.139653]  [<ffffffff811c821b>] filename_lookup+0x6b/0xc0
[39755.139663]  [<ffffffff811c9469>] user_path_at+0x59/0xa0
[39755.139672]  [<ffffffff81068e93>] ? perf_event_task_sched_out+0x33/0x70
[39755.139684]  [<ffffffff81009912>] ? __switch_to+0x272/0x320
[39755.139693]  [<ffffffff811bbcb0>] vfs_fstatat+0x50/0xa0
[39755.139702]  [<ffffffff81536f86>] ? thread_return+0x4e/0x8e8
[39755.139711]  [<ffffffff811bbd3b>] vfs_stat+0x1b/0x20
[39755.139720]  [<ffffffff811bbf04>] sys_newstat+0x24/0x50
[39755.139729]  [<ffffffff811b6ab1>] ? sys_writev+0x51/0xb0
[39755.139737]  [<ffffffff8100b122>] system_call_fastpath+0x16/0x1b

[root@server ~]# cat /proc/vmstat
nr_free_pages 2968571
nr_inactive_anon 234256
nr_active_anon 4978573
nr_inactive_file 307283
nr_active_file 1417029
nr_unevictable 1
nr_mlock 1
nr_anon_pages 5112518
nr_mapped 105650
nr_file_pages 1824573
nr_dirty 22
nr_writeback 5
nr_slab_reclaimable 252949
nr_slab_unreclaimable 5039374
nr_page_table_pages 214594
nr_kernel_stack 3599
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 100263
nr_vswap 0
numa_hit 3016917215
numa_miss 0
numa_foreign 0
numa_interleave 47918
numa_local 3016825669
numa_other 91546
nr_anon_transparent_hugepages 0
pgpgin 111526583
pgpgout 250163423
pswpin 0
pswpout 0
vswpin 0
vswpout 0
pgalloc_dma 0
pgalloc_dma32 6
pgalloc_normal 3039992292
pgalloc_movable 0
pgfree 3042964733
pgactivate 15089198
pgdeactivate 4295765
pgfault 4052619110
pgmajfault 1329132
pgrefill_dma 0
pgrefill_dma32 0
pgrefill_normal 4908612
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 0
pgsteal_normal 0
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 0
pgscan_kswapd_normal 0
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 0
pgscan_direct_normal 0
pgscan_direct_movable 0
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 0
kswapd_steal 0
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 0
kswapd_high_wmark_hit_quickly 0
kswapd_skip_congestion_wait 0
pageoutrun 1482
allocstall 0
pgrotated 596
compact_blocks_moved 0
compact_pages_moved 0
compact_pagemigrate_failed 0
compact_stall 0
compact_fail 0
compact_success 0
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 22425
unevictable_pgs_scanned 0
unevictable_pgs_rescued 27532
unevictable_pgs_mlocked 27533
unevictable_pgs_munlocked 27532
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 0
thp_fault_alloc 0
thp_fault_fallback 0
thp_collapse_alloc 0
thp_collapse_alloc_failed 0
thp_split 0

[root@server ~]# cat /proc/spl/kstat/zfs/arcstats
6 1 0x01 91 4368 17036603638 41333504523375
name                            type data
hits                            4    700763344
misses                          4    12174637
demand_data_hits                4    529697513
demand_data_misses              4    4090224
demand_metadata_hits            4    165031944
demand_metadata_misses          4    4227373
prefetch_data_hits              4    103649
prefetch_data_misses            4    1103943
prefetch_metadata_hits          4    5930238
prefetch_metadata_misses        4    2753097
mru_hits                        4    22729150
mru_ghost_hits                  4    1141021
mfu_hits                        4    672114353
mfu_ghost_hits                  4    445226
deleted                         4    6590100
mutex_miss                      4    280
evict_skip                      4    45755954
evict_not_enough                4    346703
evict_l2_cached                 4    0
evict_l2_eligible               4    187584292864
evict_l2_ineligible             4    112870275072
evict_l2_skip                   4    0
hash_elements                   4    712867
hash_elements_max               4    1847449
hash_collisions                 4    1548908
hash_chains                     4    28701
hash_chain_max                  4    6
p                               4    20132659200
c                               4    21474836480
c_min                           4    33554432
c_max                           4    21474836480
size                            4    16307729528
hdr_size                        4    293406304
data_size                       4    2835162624
metadata_size                   4    8990077440
other_size                      4    4189083160
anon_size                       4    260597760
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    2523607552
mru_evictable_data              4    1386842624
mru_evictable_metadata          4    133142016
mru_ghost_size                  4    15372288
mru_ghost_evictable_data        4    15306240
mru_ghost_evictable_metadata    4    66048
mfu_size                        4    9041034752
mfu_evictable_data              4    1229933056
mfu_evictable_metadata          4    7186359296
mfu_ghost_size                  4    1478144
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    1478144
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    8798208
arc_loaned_bytes                4    0
arc_prune                       4    27421
arc_meta_used                   4    13472566904
arc_meta_limit                  4    12884901888
arc_meta_max                    4    13472567312
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    1051623424

@dweeezil
Copy link
Contributor

dweeezil commented Oct 2, 2015

@tjikkun There's plenty of free memory but the arc is overshooting its meta limit. Are any of the system tasks such as arc_prune or arc_reclaim consuming lots of CPU when the system is in this condition?

@tjikkun
Copy link
Author

tjikkun commented Oct 2, 2015

@dweeezil
No, not really, there were some other cpu and I/O intensive loads running, but with arc processes it look like not much cpu is going on there:

root        3216  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3217  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3218  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3219  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3220  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3221  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3222  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3223  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3224  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3225  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3226  0.1  0.0      0     0 ?        S    Oct01   0:43  \_ [arc_prune]
root        3227  0.1  0.0      0     0 ?        S    Oct01   0:43  \_ [arc_prune]
root        3228  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3229  0.1  0.0      0     0 ?        S    Oct01   0:43  \_ [arc_prune]
root        3230  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3231  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3232  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3233  0.1  0.0      0     0 ?        S    Oct01   0:45  \_ [arc_prune]
root        3234  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3235  0.1  0.0      0     0 ?        S    Oct01   0:45  \_ [arc_prune]
root        3236  0.1  0.0      0     0 ?        S    Oct01   0:43  \_ [arc_prune]
root        3237  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3238  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3239  0.1  0.0      0     0 ?        S    Oct01   0:44  \_ [arc_prune]
root        3240  0.1  0.0      0     0 ?        D    Oct01   1:01  \_ [arc_reclaim]
root        3241  0.0  0.0      0     0 ?        S    Oct01   0:27  \_ [arc_user_evicts]
root        3242  0.0  0.0      0     0 ?        S    Oct01   0:00  \_ [l2arc_feed]

@dweeezil
Copy link
Contributor

dweeezil commented Oct 2, 2015

@tjikkun If you weren't having problems in 0.6.4, I suspect this is related to the SB shrinker changes. For the 2.6 EL kernels, 0.6.4 used shrink_dcache_parent() whereas 0.6.5 uses d_prune_aliases(). The latter clearly isn't effective in your case; the metadata has overshot the limit. Also, BTW, the 40+s of time accumulated by all your arc_prune tasks is actually pretty high for a system which undoubtedly hasn't been booted for very long. You'll probably see the arc_prune kstat (in arcstats) climbing while this is going on.

@dweeezil
Copy link
Contributor

dweeezil commented Oct 2, 2015

@tjikkun It might be interesting to echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable and then grep /proc/spl/kstat/dbgmsg for the "pruning, nr_to_scan..." messages. You'll be able to see how much progress it's making.

@behlendorf behlendorf added this to the 0.7.0 milestone Oct 2, 2015
@tjikkun
Copy link
Author

tjikkun commented Oct 6, 2015

@dweeezil In our production environment we had to rollback to 0.6.4 because of this issue. Thus far I haven't been able to reproduce it in a test setup.
What we did notice:

  • With 0.6.5* operations involving large directories seem to be much slower.
  • the crashes increased over time. Can they in some way be related to the user space that is miscounted?
  • When we turned of snapshot send/receive the crashed seem to disappear.

@dweeezil
Copy link
Contributor

dweeezil commented Oct 6, 2015

@tjikkun The large directory issue should be fixed by 5592404 which will be in the next tag. The space accounting bug should not cause any crashes, however, a3000f9 has been reverted in current master code which will restore the proper space accounting. #3874 has been proposed to fix the underlying problem which a3000f9 attempted to solve without breaking the space accounting. The hang condition you've described in this issue appears to be something completely separate. Were you ever able to get the "pruning, nr_to_scan" debugging messages while this problem was happening?

@tjikkun
Copy link
Author

tjikkun commented Oct 6, 2015

@dweeezil
Ok, We tried 0.6.5.2 as well, so the problem is probably not related to slow large dirs.
Unfortunately we had to roll-back before I had a chance to get the "pruning, nr_to_scan" debugging messages.

@muzyk10
Copy link

muzyk10 commented Oct 15, 2015

I also had send | recieve hangs and spl_taskq_thread_dynamic=0 did woked, /proc/spl/kstat/dbgmsg hasn't show anything.
0.6.5.2 version

@behlendorf
Copy link
Contributor

@muzyk10 this issue should have been resolved in the more recent 0.6.5.x tags. I'm closing this for now but I'm happy to reopen it if you're still having issues.

@fractalram
Copy link

Just wanted to add that we are seeing a similar issue in 0.6.5.7-1 also with the following trace :
Apr 6 03:28:48 ITRDBKP02 kernel: INFO: task txg_sync:1749 blocked for more than 120 seconds.
Apr 6 03:28:48 ITRDBKP02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 6 03:28:48 ITRDBKP02 kernel: txg_sync D 0000000000000001 0 1749 2 0x00000000
Apr 6 03:28:48 ITRDBKP02 kernel: ffff8800879d7b20 0000000000000046 ffff8804579db980 ffff8800879d7fd8
Apr 6 03:28:48 ITRDBKP02 kernel: ffff8800879d7fd8 ffff8800879d7fd8 ffff8804579db980 ffff88046fc14780
Apr 6 03:28:48 ITRDBKP02 kernel: 0000000000000000 7fffffffffffffff ffff8803ab3ac4f0 0000000000000001
Apr 6 03:28:48 ITRDBKP02 kernel: Call Trace:
Apr 6 03:28:48 ITRDBKP02 kernel: [] schedule+0x29/0x70
Apr 6 03:28:48 ITRDBKP02 kernel: [] schedule_timeout+0x209/0x2d0
Apr 6 03:28:48 ITRDBKP02 kernel: [] ? default_wake_function+0x12/0x20
Apr 6 03:28:48 ITRDBKP02 kernel: [] ? __wake_up_common+0x58/0x90
Apr 6 03:28:48 ITRDBKP02 kernel: [] ? read_tsc+0x9/0x10
Apr 6 03:28:48 ITRDBKP02 kernel: [] io_schedule_timeout+0xae/0x130
Apr 6 03:28:48 ITRDBKP02 kernel: [] ? prepare_to_wait_exclusive+0x56/0x90
Apr 6 03:28:48 ITRDBKP02 kernel: [] io_schedule+0x18/0x20
Apr 6 03:28:48 ITRDBKP02 kernel: [] cv_wait_common+0xb0/0x150 [spl]
Apr 6 03:28:48 ITRDBKP02 kernel: [] ? wake_up_atomic_t+0x30/0x30
Apr 6 03:28:48 ITRDBKP02 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Apr 6 03:28:48 ITRDBKP02 kernel: [] zio_wait+0x123/0x210 [zfs]
Apr 6 03:28:48 ITRDBKP02 kernel: [] spa_sync+0x3c5/0xb70 [zfs]
Apr 6 03:28:48 ITRDBKP02 kernel: [] ? autoremove_wake_function+0x2b/0x40
Apr 6 03:28:48 ITRDBKP02 kernel: [] txg_sync_thread+0x3cc/0x640 [zfs]
Apr 6 03:28:48 ITRDBKP02 kernel: [] ? txg_fini+0x2a0/0x2a0 [zfs]
Apr 6 03:28:48 ITRDBKP02 kernel: [] thread_generic_wrapper+0x71/0x80 [spl]
Apr 6 03:28:48 ITRDBKP02 kernel: [] ? __thread_exit+0x20/0x20 [spl]
Apr 6 03:28:48 ITRDBKP02 kernel: [] kthread+0xcf/0xe0
Apr 6 03:28:48 ITRDBKP02 kernel: [] ? kthread_create_on_node+0x140/0x140
Apr 6 03:28:48 ITRDBKP02 kernel: [] ret_from_fork+0x58/0x90
Apr 6 03:28:48 ITRDBKP02 kernel: [] ? kthread_create_on_node+0x140/0x140

@yuqingdeng
Copy link

I got this too with 4.9.15-040915-generic, zfs 0.6.5.9-1

"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2296.766479] txg_sync D 0 3259 2 0x00000000
[ 2296.766483] ffff8cec73d587c0 0000000000000000 ffff8cecb12b8000 ffff8cecff459240
[ 2296.766502] ffff8cecf27fc440 ffffb8c0b8e3fac8 ffffffffa2491cb3 ffff8cecb4782080
[ 2296.766506] 00ffffffc2029310 ffff8cecff459240 0000000000000000 ffff8cecb12b8000
[ 2296.766510] Call Trace:
[ 2296.766519] [] ? __schedule+0x233/0x6e0
[ 2296.766522] [] schedule+0x36/0x80
[ 2296.766526] [] schedule_timeout+0x22a/0x3f0
[ 2296.766607] [] ? zio_taskq_dispatch+0x91/0xa0 [zfs]
[ 2296.766660] [] ? zio_issue_async+0x12/0x20 [zfs]
[ 2296.766711] [] ? zio_nowait+0x79/0xf0 [zfs]
[ 2296.766715] [] ? ktime_get+0x41/0xb0
[ 2296.766718] [] io_schedule_timeout+0xa4/0x110
[ 2296.766728] [] cv_wait_common+0xb2/0x130 [spl]
[ 2296.766733] [] ? wake_atomic_t_function+0x60/0x60
[ 2296.766740] [] __cv_wait_io+0x18/0x20 [spl]
[ 2296.766792] [] zio_wait+0xac/0x130 [zfs]
[ 2296.766839] [] dsl_pool_sync+0xb8/0x480 [zfs]
[ 2296.766891] [] spa_sync+0x37f/0xb30 [zfs]
[ 2296.766897] [] ? default_wake_function+0x12/0x20
[ 2296.766952] [] txg_sync_thread+0x3a5/0x610 [zfs]
[ 2296.767005] [] ? txg_quiesce_thread+0x370/0x370 [zfs]
[ 2296.767012] [] thread_generic_wrapper+0x72/0x80 [spl]
[ 2296.767018] [] ? __thread_exit+0x20/0x20 [spl]
[ 2296.767022] [] kthread+0xd9/0xf0
[ 2296.767025] [] ? kthread_park+0x60/0x60
[ 2296.767029] [] ret_from_fork+0x25/0x30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants