Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft-deadlock observed on on swap operations #1526

Closed
ryao opened this issue Jun 17, 2013 · 5 comments
Closed

Soft-deadlock observed on on swap operations #1526

ryao opened this issue Jun 17, 2013 · 5 comments

Comments

@ryao
Copy link
Contributor

ryao commented Jun 17, 2013

My system hung momentarily a few times before finally hanging for good when doing operations on swap backed by a zvol. I was able to get dmesg output before it hung for good:

[26160.412029] INFO: task kswapd0:384 blocked for more than 120 seconds.
[26160.412038] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26160.412040] kswapd0         D ffff88022fc12c40     0   384      2 0x00000000
[26160.412049]  ffff880220657890 0000000000000046 ffff88022616a350 ffff880220657fd8
[26160.412058]  ffff880220657fd8 ffff880220657fd8 ffff88022616a350 ffff88022fc134a0
[26160.412061]  ffff88022616a350 7fffffffffffffff ffff88022616a350 000000000039e008
[26160.412070] Call Trace:
[26160.412083]  [<ffffffff81436914>] schedule+0x24/0x70
[26160.412091]  [<ffffffff81433d89>] schedule_timeout+0x239/0x340
[26160.412094]  [<ffffffff81268116>] ? queue_unplugged+0x36/0xe0
[26160.412102]  [<ffffffff814357b6>] io_schedule_timeout+0x96/0xf0
[26160.412105]  [<ffffffff814359ec>] wait_for_completion_io+0xac/0x110
[26160.412113]  [<ffffffff81071810>] ? wake_up_state+0x10/0x10
[26160.412116]  [<ffffffff8126f249>] blkdev_issue_discard+0x289/0x2b0
[26160.412125]  [<ffffffff8111fd37>] scan_swap_map+0x447/0x610
[26160.412128]  [<ffffffff8111ffe9>] get_swap_page+0xe9/0x1e0
[26160.412134]  [<ffffffff8111d49f>] add_to_swap+0xf/0x70
[26160.412138]  [<ffffffff810f70a8>] shrink_page_list+0x438/0x900
[26160.412146]  [<ffffffff810f7b05>] shrink_inactive_list+0x1e5/0x420
[26160.412148]  [<ffffffff810f8571>] shrink_zone+0x4e1/0x600
[26160.412156]  [<ffffffff810eadb3>] ? zone_watermark_ok_safe+0x53/0xd0
[26160.412159]  [<ffffffff810f922c>] kswapd+0x4fc/0x990
[26160.412167]  [<ffffffff810f8d30>] ? shrink_all_memory+0xb0/0xb0
[26160.412175]  [<ffffffff81061a0d>] kthread+0xcd/0xe0
[26160.412178]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412187]  [<ffffffff8143d7fc>] ret_from_fork+0x7c/0xb0
[26160.412189]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412201] INFO: task zvol/0:2178 blocked for more than 120 seconds.
[26160.412203] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26160.412204] zvol/0          D ffff88022fc12c40     0  2178      2 0x00000000
[26160.412210]  ffff88021f2a1c98 0000000000000046 ffff880226ac46a0 ffff88021f2a1fd8
[26160.412218]  ffff88021f2a1fd8 ffff88021f2a1fd8 ffff880226ac46a0 ffff8801f18da240
[26160.412222]  ffff8801f18da000 ffff8801f18da248 0000000000000000 ffff88004a9bbb80
[26160.412225] Call Trace:
[26160.412231]  [<ffffffff81436914>] schedule+0x24/0x70
[26160.412264]  [<ffffffffa019ea35>] __cv_destroy+0x1f5/0x210 [spl]
[26160.412267]  [<ffffffff81062790>] ? wake_up_bit+0x30/0x30
[26160.412276]  [<ffffffffa019ea60>] __cv_wait+0x10/0x20 [spl]
[26160.412301]  [<ffffffffa048fa4b>] zil_add_block+0x50b/0xbd0 [zfs]
[26160.412316]  [<ffffffffa04906c2>] zil_commit+0x12/0x20 [zfs]
[26160.412332]  [<ffffffffa049db22>] zrl_owner+0x1102/0x1820 [zfs]
[26160.412334]  [<ffffffff8143645a>] ? __schedule+0x2fa/0x790
[26160.412340]  [<ffffffffa019bba8>] taskq_cancel_id+0x2e8/0x4a0 [spl]
[26160.412342]  [<ffffffff81071810>] ? wake_up_state+0x10/0x10
[26160.412347]  [<ffffffffa019b9f0>] ? taskq_cancel_id+0x130/0x4a0 [spl]
[26160.412350]  [<ffffffff81061a0d>] kthread+0xcd/0xe0
[26160.412353]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412355]  [<ffffffff8143d7fc>] ret_from_fork+0x7c/0xb0
[26160.412358]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412360] INFO: task zvol/2:2180 blocked for more than 120 seconds.
[26160.412361] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26160.412362] zvol/2          D ffff88022fd12c40     0  2180      2 0x00000000
[26160.412370]  ffff88021f2a5c98 0000000000000046 ffff880226ac0710 ffff88021f2a5fd8
[26160.412373]  ffff88021f2a5fd8 ffff88021f2a5fd8 ffff880226ac0710 ffff8801f18da240
[26160.412380]  ffff8801f18da000 ffff8801f18da248 0000000000000000 ffff88004a9bbce0
[26160.412383] Call Trace:
[26160.412387]  [<ffffffff81436914>] schedule+0x24/0x70
[26160.412395]  [<ffffffffa019ea35>] __cv_destroy+0x1f5/0x210 [spl]
[26160.412402]  [<ffffffff81062790>] ? wake_up_bit+0x30/0x30
[26160.412407]  [<ffffffffa019ea60>] __cv_wait+0x10/0x20 [spl]
[26160.412421]  [<ffffffffa048fa4b>] zil_add_block+0x50b/0xbd0 [zfs]
[26160.412432]  [<ffffffffa04906c2>] zil_commit+0x12/0x20 [zfs]
[26160.412441]  [<ffffffffa049db22>] zrl_owner+0x1102/0x1820 [zfs]
[26160.412445]  [<ffffffff8143645a>] ? __schedule+0x2fa/0x790
[26160.412450]  [<ffffffffa019bba8>] taskq_cancel_id+0x2e8/0x4a0 [spl]
[26160.412456]  [<ffffffff81071810>] ? wake_up_state+0x10/0x10
[26160.412460]  [<ffffffffa019b9f0>] ? taskq_cancel_id+0x130/0x4a0 [spl]
[26160.412466]  [<ffffffff81061a0d>] kthread+0xcd/0xe0
[26160.412469]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412477]  [<ffffffff8143d7fc>] ret_from_fork+0x7c/0xb0
[26160.412479]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412487] INFO: task zvol/4:2182 blocked for more than 120 seconds.
[26160.412488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26160.412489] zvol/4          D ffff88022fc12c40     0  2182      2 0x00000000
[26160.412497]  ffff88021f2a9c98 0000000000000046 ffff880226ac3880 ffff88021f2a9fd8
[26160.412500]  ffff88021f2a9fd8 ffff88021f2a9fd8 ffff880226ac3880 ffff8801f18da240
[26160.412507]  ffff8801f18da000 ffff8801f18da248 0000000000000000 ffff88004a9bb340
[26160.412510] Call Trace:
[26160.412512]  [<ffffffff81436914>] schedule+0x24/0x70
[26160.412521]  [<ffffffffa019ea35>] __cv_destroy+0x1f5/0x210 [spl]
[26160.412529]  [<ffffffff81062790>] ? wake_up_bit+0x30/0x30
[26160.412534]  [<ffffffffa019ea60>] __cv_wait+0x10/0x20 [spl]
[26160.412547]  [<ffffffffa048fa4b>] zil_add_block+0x50b/0xbd0 [zfs]
[26160.412558]  [<ffffffffa04906c2>] zil_commit+0x12/0x20 [zfs]
[26160.412567]  [<ffffffffa049db22>] zrl_owner+0x1102/0x1820 [zfs]
[26160.412571]  [<ffffffff8143645a>] ? __schedule+0x2fa/0x790
[26160.412577]  [<ffffffffa019bba8>] taskq_cancel_id+0x2e8/0x4a0 [spl]
[26160.412582]  [<ffffffff81071810>] ? wake_up_state+0x10/0x10
[26160.412587]  [<ffffffffa019b9f0>] ? taskq_cancel_id+0x130/0x4a0 [spl]
[26160.412592]  [<ffffffff81061a0d>] kthread+0xcd/0xe0
[26160.412595]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412603]  [<ffffffff8143d7fc>] ret_from_fork+0x7c/0xb0
[26160.412606]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412612] INFO: task zvol/6:2184 blocked for more than 120 seconds.
[26160.412614] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26160.412615] zvol/6          D ffff88022fd12c40     0  2184      2 0x00000000
[26160.412622]  ffff88021f2adc98 0000000000000046 ffff880226baf810 ffff88021f2adfd8
[26160.412626]  ffff88021f2adfd8 ffff88021f2adfd8 ffff880226baf810 ffff8801f18da240
[26160.412634]  ffff8801f18da000 ffff8801f18da248 0000000000000000 ffff88004a9ba420
[26160.412637] Call Trace:
[26160.412640]  [<ffffffff81436914>] schedule+0x24/0x70
[26160.412648]  [<ffffffffa019ea35>] __cv_destroy+0x1f5/0x210 [spl]
[26160.412656]  [<ffffffff81062790>] ? wake_up_bit+0x30/0x30
[26160.412661]  [<ffffffffa019ea60>] __cv_wait+0x10/0x20 [spl]
[26160.412675]  [<ffffffffa048fa4b>] zil_add_block+0x50b/0xbd0 [zfs]
[26160.412685]  [<ffffffffa04906c2>] zil_commit+0x12/0x20 [zfs]
[26160.412695]  [<ffffffffa049db22>] zrl_owner+0x1102/0x1820 [zfs]
[26160.412699]  [<ffffffff8143645a>] ? __schedule+0x2fa/0x790
[26160.412704]  [<ffffffffa019bba8>] taskq_cancel_id+0x2e8/0x4a0 [spl]
[26160.412710]  [<ffffffff81071810>] ? wake_up_state+0x10/0x10
[26160.412714]  [<ffffffffa019b9f0>] ? taskq_cancel_id+0x130/0x4a0 [spl]
[26160.412720]  [<ffffffff81061a0d>] kthread+0xcd/0xe0
[26160.412723]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412731]  [<ffffffff8143d7fc>] ret_from_fork+0x7c/0xb0
[26160.412733]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412740] INFO: task zvol/7:2185 blocked for more than 120 seconds.
[26160.412741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26160.412742] zvol/7          D ffff88022fd92c40     0  2185      2 0x00000000
[26160.412750]  ffff88021f2afc98 0000000000000046 ffff88021e6eb880 ffff88021f2affd8
[26160.412753]  ffff88021f2affd8 ffff88021f2affd8 ffff88021e6eb880 ffff8801f18da1d8
[26160.412756]  ffff8801f18da000 ffff8801f18da1e0 0000000000000000 ffff88004a9badc0
[26160.412759] Call Trace:
[26160.412762]  [<ffffffff81436914>] schedule+0x24/0x70
[26160.412767]  [<ffffffffa019ea35>] __cv_destroy+0x1f5/0x210 [spl]
[26160.412773]  [<ffffffff81062790>] ? wake_up_bit+0x30/0x30
[26160.412778]  [<ffffffffa019ea60>] __cv_wait+0x10/0x20 [spl]
[26160.412791]  [<ffffffffa048fa4b>] zil_add_block+0x50b/0xbd0 [zfs]
[26160.412802]  [<ffffffffa04906c2>] zil_commit+0x12/0x20 [zfs]
[26160.412811]  [<ffffffffa049db22>] zrl_owner+0x1102/0x1820 [zfs]
[26160.412815]  [<ffffffff8143645a>] ? __schedule+0x2fa/0x790
[26160.412821]  [<ffffffffa019bba8>] taskq_cancel_id+0x2e8/0x4a0 [spl]
[26160.412826]  [<ffffffff81071810>] ? wake_up_state+0x10/0x10
[26160.412834]  [<ffffffffa019b9f0>] ? taskq_cancel_id+0x130/0x4a0 [spl]
[26160.412836]  [<ffffffff81061a0d>] kthread+0xcd/0xe0
[26160.412839]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412847]  [<ffffffff8143d7fc>] ret_from_fork+0x7c/0xb0
[26160.412850]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412857] INFO: task zvol/8:2186 blocked for more than 120 seconds.
[26160.412858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26160.412860] zvol/8          D ffff88022fc12c40     0  2186      2 0x00000000
[26160.412868]  ffff88021f2b1c58 0000000000000046 ffff88021e6ed4c0 ffff88021f2b1fd8
[26160.412871]  ffff88021f2b1fd8 ffff88021f2b1fd8 ffff88021e6ed4c0 ffff88022fc134a0
[26160.412873]  ffff88009fd45528 ffff88009fd45580 0000000000000001 0000000000000000
[26160.412878] Call Trace:
[26160.412881]  [<ffffffff81436c28>] io_schedule+0x98/0x120
[26160.412886]  [<ffffffffa019e9c7>] __cv_destroy+0x187/0x210 [spl]
[26160.412890]  [<ffffffff81062790>] ? wake_up_bit+0x30/0x30
[26160.412895]  [<ffffffffa019eaa3>] __cv_wait_io+0x13/0x150 [spl]
[26160.412910]  [<ffffffffa0494cf3>] zio_wait+0x123/0x1d0 [zfs]
[26160.412920]  [<ffffffffa048fc2b>] zil_add_block+0x6eb/0xbd0 [zfs]
[26160.412930]  [<ffffffffa04906c2>] zil_commit+0x12/0x20 [zfs]
[26160.412939]  [<ffffffffa049db22>] zrl_owner+0x1102/0x1820 [zfs]
[26160.412943]  [<ffffffff8143645a>] ? __schedule+0x2fa/0x790
[26160.412948]  [<ffffffffa019bba8>] taskq_cancel_id+0x2e8/0x4a0 [spl]
[26160.412953]  [<ffffffff81071810>] ? wake_up_state+0x10/0x10
[26160.412958]  [<ffffffffa019b9f0>] ? taskq_cancel_id+0x130/0x4a0 [spl]
[26160.412964]  [<ffffffff81061a0d>] kthread+0xcd/0xe0
[26160.412967]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412974]  [<ffffffff8143d7fc>] ret_from_fork+0x7c/0xb0
[26160.412977]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.412984] INFO: task zvol/10:2188 blocked for more than 120 seconds.
[26160.412985] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26160.412986] zvol/10         D ffff88022fd12c40     0  2188      2 0x00000000
[26160.412994]  ffff88021f2b5c98 0000000000000046 ffff88021f2b8000 ffff88021f2b5fd8
[26160.412997]  ffff88021f2b5fd8 ffff88021f2b5fd8 ffff88021f2b8000 ffff8801f18da240
[26160.413000]  ffff8801f18da000 ffff8801f18da248 0000000000000000 ffff88021864f080
[26160.413014] Call Trace:
[26160.413017]  [<ffffffff81436914>] schedule+0x24/0x70
[26160.413022]  [<ffffffffa019ea35>] __cv_destroy+0x1f5/0x210 [spl]
[26160.413026]  [<ffffffff81062790>] ? wake_up_bit+0x30/0x30
[26160.413032]  [<ffffffffa019ea60>] __cv_wait+0x10/0x20 [spl]
[26160.413042]  [<ffffffffa048fa4b>] zil_add_block+0x50b/0xbd0 [zfs]
[26160.413052]  [<ffffffffa04906c2>] zil_commit+0x12/0x20 [zfs]
[26160.413062]  [<ffffffffa049db22>] zrl_owner+0x1102/0x1820 [zfs]
[26160.413077]  [<ffffffffa019bba8>] taskq_cancel_id+0x2e8/0x4a0 [spl]
[26160.413080]  [<ffffffff81071810>] ? wake_up_state+0x10/0x10
[26160.413085]  [<ffffffffa019b9f0>] ? taskq_cancel_id+0x130/0x4a0 [spl]
[26160.413088]  [<ffffffff81061a0d>] kthread+0xcd/0xe0
[26160.413091]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.413094]  [<ffffffff8143d7fc>] ret_from_fork+0x7c/0xb0
[26160.413096]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.413099] INFO: task zvol/12:2190 blocked for more than 120 seconds.
[26160.413100] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26160.413102] zvol/12         D ffff88022fc12c40     0  2190      2 0x00000000
[26160.413105]  ffff88021f2c1c98 0000000000000046 ffff88021f2b8e20 ffff88021f2c1fd8
[26160.413108]  ffff88021f2c1fd8 ffff88021f2c1fd8 ffff88021f2b8e20 ffff8801f18da240
[26160.413111]  ffff8801f18da000 ffff8801f18da248 0000000000000000 ffff88004a9bb760
[26160.413115] Call Trace:
[26160.413118]  [<ffffffff81436914>] schedule+0x24/0x70
[26160.413128]  [<ffffffffa019ea35>] __cv_destroy+0x1f5/0x210 [spl]
[26160.413136]  [<ffffffff81062790>] ? wake_up_bit+0x30/0x30
[26160.413141]  [<ffffffffa019ea60>] __cv_wait+0x10/0x20 [spl]
[26160.413154]  [<ffffffffa048fa4b>] zil_add_block+0x50b/0xbd0 [zfs]
[26160.413164]  [<ffffffffa04906c2>] zil_commit+0x12/0x20 [zfs]
[26160.413173]  [<ffffffffa049db22>] zrl_owner+0x1102/0x1820 [zfs]
[26160.413177]  [<ffffffff8143645a>] ? __schedule+0x2fa/0x790
[26160.413182]  [<ffffffffa019bba8>] taskq_cancel_id+0x2e8/0x4a0 [spl]
[26160.413188]  [<ffffffff81071810>] ? wake_up_state+0x10/0x10
[26160.413193]  [<ffffffffa019b9f0>] ? taskq_cancel_id+0x130/0x4a0 [spl]
[26160.413198]  [<ffffffff81061a0d>] kthread+0xcd/0xe0
[26160.413201]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.413204]  [<ffffffff8143d7fc>] ret_from_fork+0x7c/0xb0
[26160.413206]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.413209] INFO: task zvol/14:2192 blocked for more than 120 seconds.
[26160.413210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26160.413211] zvol/14         D ffff88022fd12c40     0  2192      2 0x00000000
[26160.413219]  ffff88021f2c5c98 0000000000000046 ffff880226ba9c40 ffff88021f2c5fd8
[26160.413222]  ffff88021f2c5fd8 ffff88021f2c5fd8 ffff880226ba9c40 ffff8801f18da240
[26160.413231]  ffff8801f18da000 ffff8801f18da248 0000000000000000 ffff88004a9ba160
[26160.413234] Call Trace:
[26160.413237]  [<ffffffff81436914>] schedule+0x24/0x70
[26160.413244]  [<ffffffffa019ea35>] __cv_destroy+0x1f5/0x210 [spl]
[26160.413252]  [<ffffffff81062790>] ? wake_up_bit+0x30/0x30
[26160.413257]  [<ffffffffa019ea60>] __cv_wait+0x10/0x20 [spl]
[26160.413270]  [<ffffffffa048fa4b>] zil_add_block+0x50b/0xbd0 [zfs]
[26160.413281]  [<ffffffffa04906c2>] zil_commit+0x12/0x20 [zfs]
[26160.413290]  [<ffffffffa049db22>] zrl_owner+0x1102/0x1820 [zfs]
[26160.413294]  [<ffffffff8143645a>] ? __schedule+0x2fa/0x790
[26160.413300]  [<ffffffffa019bba8>] taskq_cancel_id+0x2e8/0x4a0 [spl]
[26160.413305]  [<ffffffff81071810>] ? wake_up_state+0x10/0x10
[26160.413309]  [<ffffffffa019b9f0>] ? taskq_cancel_id+0x130/0x4a0 [spl]
[26160.413316]  [<ffffffff81061a0d>] kthread+0xcd/0xe0
[26160.413319]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120
[26160.413326]  [<ffffffff8143d7fc>] ret_from_fork+0x7c/0xb0
[26160.413329]  [<ffffffff81061940>] ? kthread_create_on_node+0x120/0x120

The memory allocation in zil_add_block() appears to hang while we have the mutex, which prevents all other zvol threads from making forward progress. This also prevents kswapd from making progress.

This is unusual because swap operations are supposed to be synchronous and therefore, they should bypass the ZIL, but that appears to not be the case. Another unusual aspect of these stack traces is that zrl_owner() is on the stack, but nothing in the code calls it, so it should be deadcode.

Swap operations involve read, write and discard requests, so something appears to be wrong with the backtraces. However, they are fairly consistent. I have reproduced this issue twice. The first time was 3 months ago with Linux 3.6.8 and the second time was today with Linux 3.9.6.

@behlendorf
Copy link
Contributor

@ryao You're right the stacks aren't very clear as to what happened here. I actually suspect the zil_add_block call is bogus and most of the threads are blocked on zil_commit->__cv_wait but it's impossible to say for sure. That would make some sense since it seems like zvol/8:2186 has an outstanding zio. But it's all very unclear.

@ryao
Copy link
Contributor Author

ryao commented Mar 30, 2014

I just got something similar today when putting my system under severe stress without swap.

[87723.068597] INFO: task Chrome_DBThread:5243 blocked for more than 120 seconds.
[87723.068602]       Tainted: G           O 3.13.6 #7
[87723.068604] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87723.068607] Chrome_DBThread D ffff880d9e9fe1d0     0  5243   4960 0x00000000
[87723.068613]  ffff880f8eaabdc0 0000000000000002 000000000000a000 ffff880f9038ce40
[87723.068619]  ffff880f8eaabfd8 ffff880f9038ce40 0000000000000002 0000000000000000
[87723.068624]  ffff880f8eaabd68 0000000000000046 0000000000000000 ffffffff81092417
[87723.068629] Call Trace:
[87723.068640]  [<ffffffff81092417>] ? prepare_to_wait_exclusive+0x27/0x90
[87723.068646]  [<ffffffff81468f5d>] ? _raw_spin_unlock_irqrestore+0x5d/0x70
[87723.068651]  [<ffffffff814677b5>] ? __mutex_unlock_slowpath+0xd5/0x180
[87723.068656]  [<ffffffff810982b6>] ? trace_hardirqs_on_caller+0x16/0x1e0
[87723.068661]  [<ffffffff8109848d>] ? trace_hardirqs_on+0xd/0x10
[87723.068664]  [<ffffffff81464689>] schedule+0x29/0x70
[87723.068675]  [<ffffffffa0293f95>] __cv_destroy+0x1f5/0x240 [spl]
[87723.068679]  [<ffffffff81092760>] ? abort_exclusive_wait+0xb0/0xb0
[87723.068686]  [<ffffffffa0293ff5>] __cv_wait+0x15/0x20 [spl]
[87723.068710]  [<ffffffffa03925db>] zil_add_block+0x53b/0xc80 [zfs]
[87723.068715]  [<ffffffff8109848d>] ? trace_hardirqs_on+0xd/0x10
[87723.068735]  [<ffffffffa0393337>] zil_commit+0x17/0x20 [zfs]
[87723.068755]  [<ffffffffa0389bff>] zfs_fsync+0x7f/0x100 [zfs]
[87723.068774]  [<ffffffffa039de93>] zpl_putpage+0x303/0x8d0 [zfs]
[87723.068779]  [<ffffffff811b7d71>] do_fsync+0x51/0x80
[87723.068783]  [<ffffffff811b8043>] SyS_fdatasync+0x13/0x20
[87723.068787]  [<ffffffff814721e4>] tracesys+0xdd/0xe2
[87723.068793] INFO: task BrowserBlocking:5263 blocked for more than 120 seconds.
[87723.068795]       Tainted: G           O 3.13.6 #7
[87723.068797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87723.068799] BrowserBlocking D ffff880f9038ce40     0  5263   4960 0x00000000
[87723.068807]  ffff880f8d8cfdc0 0000000000000002 000000000000a000 ffff880f8eaf1390
[87723.068812]  ffff880f8d8cffd8 ffff880f8eaf1390 0000000000000002 0000000000000000
[87723.068817]  ffff880f8d8cfd68 0000000000000046 0000000000000000 ffffffff81092417
[87723.068822] Call Trace:
[87723.068827]  [<ffffffff81092417>] ? prepare_to_wait_exclusive+0x27/0x90
[87723.068831]  [<ffffffff81468f5d>] ? _raw_spin_unlock_irqrestore+0x5d/0x70
[87723.068835]  [<ffffffff814677b5>] ? __mutex_unlock_slowpath+0xd5/0x180
[87723.068839]  [<ffffffff810982b6>] ? trace_hardirqs_on_caller+0x16/0x1e0
[87723.068843]  [<ffffffff8109848d>] ? trace_hardirqs_on+0xd/0x10
[87723.068847]  [<ffffffff81464689>] schedule+0x29/0x70
[87723.068855]  [<ffffffffa0293f95>] __cv_destroy+0x1f5/0x240 [spl]
[87723.068859]  [<ffffffff81092760>] ? abort_exclusive_wait+0xb0/0xb0
[87723.068866]  [<ffffffffa0293ff5>] __cv_wait+0x15/0x20 [spl]
[87723.068885]  [<ffffffffa03925db>] zil_add_block+0x53b/0xc80 [zfs]
[87723.068889]  [<ffffffff8109848d>] ? trace_hardirqs_on+0xd/0x10
[87723.068908]  [<ffffffffa0393337>] zil_commit+0x17/0x20 [zfs]
[87723.068926]  [<ffffffffa0389bff>] zfs_fsync+0x7f/0x100 [zfs]
[87723.068945]  [<ffffffffa039de93>] zpl_putpage+0x303/0x8d0 [zfs]
[87723.068949]  [<ffffffff811b7d71>] do_fsync+0x51/0x80
[87723.068953]  [<ffffffff811b8043>] SyS_fdatasync+0x13/0x20
[87723.068956]  [<ffffffff814721e4>] tracesys+0xdd/0xe2
[87723.068961] INFO: task BrowserBlocking:5267 blocked for more than 120 seconds.
[87723.068963]       Tainted: G           O 3.13.6 #7
[87723.068965] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87723.068967] BrowserBlocking D ffff881037020000     0  5267   4960 0x00000000
[87723.068972]  ffff880f8bdc7dc0 0000000000000002 000000000000a000 ffff880f8eaf3ab0
[87723.068976]  ffff880f8bdc7fd8 ffff880f8eaf3ab0 0000000000000002 0000000000000000
[87723.068981]  ffff880f8bdc7d68 0000000000000046 0000000000000000 ffffffff81092417
[87723.068987] Call Trace:
[87723.068991]  [<ffffffff81092417>] ? prepare_to_wait_exclusive+0x27/0x90
[87723.068995]  [<ffffffff81468f5d>] ? _raw_spin_unlock_irqrestore+0x5d/0x70
[87723.068999]  [<ffffffff814677b5>] ? __mutex_unlock_slowpath+0xd5/0x180
[87723.069003]  [<ffffffff810982b6>] ? trace_hardirqs_on_caller+0x16/0x1e0
[87723.069011]  [<ffffffff8109848d>] ? trace_hardirqs_on+0xd/0x10
[87723.069015]  [<ffffffff81464689>] schedule+0x29/0x70
[87723.069023]  [<ffffffffa0293f95>] __cv_destroy+0x1f5/0x240 [spl]
[87723.069026]  [<ffffffff81092760>] ? abort_exclusive_wait+0xb0/0xb0
[87723.069033]  [<ffffffffa0293ff5>] __cv_wait+0x15/0x20 [spl]
[87723.069051]  [<ffffffffa03925db>] zil_add_block+0x53b/0xc80 [zfs]
[87723.069055]  [<ffffffff8109848d>] ? trace_hardirqs_on+0xd/0x10
[87723.069072]  [<ffffffffa0393337>] zil_commit+0x17/0x20 [zfs]
[87723.069090]  [<ffffffffa0389bff>] zfs_fsync+0x7f/0x100 [zfs]
[87723.069108]  [<ffffffffa039de93>] zpl_putpage+0x303/0x8d0 [zfs]
[87723.069112]  [<ffffffff811b7d71>] do_fsync+0x51/0x80
[87723.069115]  [<ffffffff811b8043>] SyS_fdatasync+0x13/0x20
[87723.069119]  [<ffffffff814721e4>] tracesys+0xdd/0xe2
[87723.069124] INFO: task Chrome_HistoryT:5273 blocked for more than 120 seconds.
[87723.069126]       Tainted: G           O 3.13.6 #7
[87723.069128] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87723.069130] Chrome_HistoryT D ffff880ff63fe1d0     0  5273   4960 0x00000000
[87723.069134]  ffff880f89ed1dc0 0000000000000002 000000000000a000 ffff880f8eaf4e40
[87723.069139]  ffff880f89ed1fd8 ffff880f8eaf4e40 0000000000000002 0000000000000000
[87723.069143]  ffff880f89ed1d68 0000000000000046 0000000000000000 ffffffff81092417
[87723.069148] Call Trace:
[87723.069152]  [<ffffffff81092417>] ? prepare_to_wait_exclusive+0x27/0x90
[87723.069156]  [<ffffffff81468f5d>] ? _raw_spin_unlock_irqrestore+0x5d/0x70
[87723.069160]  [<ffffffff814677b5>] ? __mutex_unlock_slowpath+0xd5/0x180
[87723.069164]  [<ffffffff810982b6>] ? trace_hardirqs_on_caller+0x16/0x1e0
[87723.069168]  [<ffffffff8109848d>] ? trace_hardirqs_on+0xd/0x10
[87723.069171]  [<ffffffff81464689>] schedule+0x29/0x70
[87723.069179]  [<ffffffffa0293f95>] __cv_destroy+0x1f5/0x240 [spl]
[87723.069182]  [<ffffffff81092760>] ? abort_exclusive_wait+0xb0/0xb0
[87723.069190]  [<ffffffffa0293ff5>] __cv_wait+0x15/0x20 [spl]
[87723.069207]  [<ffffffffa03925db>] zil_add_block+0x53b/0xc80 [zfs]
[87723.069211]  [<ffffffff8109848d>] ? trace_hardirqs_on+0xd/0x10
[87723.069232]  [<ffffffffa0393337>] zil_commit+0x17/0x20 [zfs]
[87723.069249]  [<ffffffffa0389bff>] zfs_fsync+0x7f/0x100 [zfs]
[87723.069267]  [<ffffffffa039de93>] zpl_putpage+0x303/0x8d0 [zfs]
[87723.069271]  [<ffffffff811b7d71>] do_fsync+0x51/0x80
[87723.069274]  [<ffffffff811b8043>] SyS_fdatasync+0x13/0x20
[87723.069278]  [<ffffffff814721e4>] tracesys+0xdd/0xe2
[87723.069322] INFO: task emerge:17869 blocked for more than 120 seconds.
[87723.069324]       Tainted: G           O 3.13.6 #7
[87723.069326] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87723.069327] emerge          D ffff881037ace1d0     0 17869  17674 0x00000000
[87723.069332]  ffff880053b93c80 0000000000000002 000000000000a000 ffff880d004b1390
[87723.069337]  ffff880053b93fd8 ffff880d004b1390 ffffffff81088d08 ffff880053b93c08
[87723.069341]  ffffffff8108be77 ffff880053b93bf0 ffffffff81086e55 ffff88103efd39c0
[87723.069346] Call Trace:
[87723.069350]  [<ffffffff81088d08>] ? __enqueue_entity+0x78/0x80
[87723.069353]  [<ffffffff8108be77>] ? enqueue_task_fair+0x407/0xb80
[87723.069359]  [<ffffffff81086e55>] ? sched_clock_cpu+0xb5/0x100
[87723.069363]  [<ffffffff81468f38>] ? _raw_spin_unlock_irqrestore+0x38/0x70
[87723.069367]  [<ffffffff81094c1f>] ? trace_hardirqs_off_caller+0x1f/0xc0
[87723.069371]  [<ffffffff81468f42>] ? _raw_spin_unlock_irqrestore+0x42/0x70
[87723.069374]  [<ffffffff81464689>] schedule+0x29/0x70
[87723.069378]  [<ffffffff81463579>] schedule_timeout+0x209/0x310
[87723.069381]  [<ffffffff8146504e>] ? wait_for_common+0x3e/0x170
[87723.069385]  [<ffffffff81468f9c>] ? _raw_spin_unlock_irq+0x2c/0x60
[87723.069389]  [<ffffffff81081ddd>] ? get_parent_ip+0xd/0x50
[87723.069392]  [<ffffffff8146dd4b>] ? preempt_count_sub+0x6b/0xf0
[87723.069396]  [<ffffffff81465127>] wait_for_common+0x117/0x170
[87723.069400]  [<ffffffff81083ad0>] ? wake_up_process+0x40/0x40
[87723.069404]  [<ffffffff8146519d>] wait_for_completion+0x1d/0x20
[87723.069408]  [<ffffffff811b23aa>] sync_inodes_sb+0x15a/0x2a0
[87723.069413]  [<ffffffff811b7e2b>] sync_filesystem+0x5b/0xa0
[87723.069417]  [<ffffffff811b7fc8>] SyS_syncfs+0x48/0x90
[87723.069420]  [<ffffffff814721e4>] tracesys+0xdd/0xe2

I was running make -j12 clean in several Linux kernel directories to free space, emerge-webrsync and repo sync -j64 on the Google Chrome repository. Things started to hang for a few minutes before returning to normalcy. I had the fix in #2212 applied. The problem feels similar, but since I did not have swap, the system did not deadlock.

That being said, cv_wait_common is declared static, which causes us to see __cv_destroy() in the stacks where we should see cv_wait_common() because the compiler is not emitting symbol information for use by dump_stack(). It seems to do that in the mainline kernel. It is not clear to me what the difference is that causes Linux mainline to have this information and us to lack it at this time.

We appear to have something causing us to wait an exorbitant amount of time, but I was not prepared for it, so the partial system freeze until it cleared prevented me from capturing information. I will do what I can to catch it.

@ryao
Copy link
Contributor Author

ryao commented Mar 31, 2014

@behlendorf I was able to capture complete stacks today of this phenomena and there is NOTHING out of the ordinary. I forcibly restarted my system. Pool import proceeded to hang until ZIL replay finished. This took a noticeably long amount of time.

What appears to be happening is that we have a large amount of data in ZIL, someone calls syncfs() and then everything stops on global ZIL write-out. This is related to issue #2190. I think resolving this requires taking a look at how Illumos does this. Offhand, I know that Illumos has a per-filesystem sync, which could partially explain things.

@ryao
Copy link
Contributor Author

ryao commented Mar 31, 2014

Disregard my previous (now deleted) note about stack traces captured using ps -ef; for i in /proc/[0-9]*/stack; do echo $i; cat $i; done;. The stack traces were not committed to storage, so I lost them. My look before they were lost revealed nothing out of the ordinary.

@ryao
Copy link
Contributor Author

ryao commented Jul 11, 2014

This report is too vague to be actionable. I have since corrected stack traces on my system and there are plenty of more actionable reports, so I am going to close this.

@ryao ryao closed this as completed Jul 11, 2014
@behlendorf behlendorf modified the milestone: 0.6.7 Nov 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants