Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in drm_atomic_helper_commit_hw_done for 4.13 #110

Open
nullr0ute opened this issue Sep 12, 2017 · 3 comments
Open

Crash in drm_atomic_helper_commit_hw_done for 4.13 #110

nullr0ute opened this issue Sep 12, 2017 · 3 comments

Comments

@nullr0ute
Copy link

Seeing a new crash regression in the 4.13 kernel on both ARMv7 and aarch64 when running GNOME.

I've included both dumps for reference even though they're similar as aarch64 also includes other bits.

[  383.027151] ------------[ cut here ]------------
[  383.032253] WARNING: CPU: 2 PID: 714 at drivers/gpu/drm/drm_atomic_helper.c:1682 drm_atomic_helper_commit_hw_done+0xb0/0xb8 [drm_kms_helper]
[  383.045070] Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat vc4 snd_soc_core ac97_bus snd_pcm_dmaengine snd_seq smsc95xx snd_seq_device snd_pcm usbnet mii snd_timer snd soundcore drm_kms_helper drm fb_sys_fops crc32_ce syscopyarea sysfillrect sysimgblt brcmfmac brcmutil cfg80211 rfkill bcm2835_rng bcm2835_wdt leds_gpio
[  383.118053]  mmc_block sdhci_iproc sdhci_pltfm dwc2 udc_core sdhci pwm_bcm2835 i2c_bcm2835 bcm2835 bcm2835_dma phy_generic
[  383.129411] CPU: 2 PID: 714 Comm: gnome-shell Tainted: G        W       4.13.1-302.fc27.aarch64 #1
[  383.138519] Hardware name: raspberrypi rpi/rpi, BIOS 2017.09-rc2 08/25/2017
[  383.145604] task: ffff8000114ca700 task.stack: ffff800011b04000
[  383.151764] PC is at drm_atomic_helper_commit_hw_done+0xb0/0xb8 [drm_kms_helper]
[  383.159375] LR is at vc4_atomic_complete_commit+0x7c/0xd8 [vc4]
[  383.165401] pc : [] lr : [] pstate: 20400145
[  383.172924] sp : ffff800011b07b00
[  383.176304] x29: ffff800011b07b00 x28: ffff800011b07d80 
[  383.181735] x27: ffff800011759000 x26: 0000000000000018 
[  383.187163] x25: 0000000000000000 x24: 0000000000000000 
[  383.192595] x23: 0000000000000038 x22: ffff800011d6ba00 
[  383.198025] x21: ffff80000360ae00 x20: 0000000000000002 
[  383.203454] x19: ffff800011d6ba00 x18: 0000000000000020 
[  383.208881] x17: 0000000000001f07 x16: 0000000000000000 
[  383.214301] x15: ffff000008ad0648 x14: 000004d3000004b9 
[  383.219723] x13: 000004b3000004d3 x12: 000004b0000004b0 
[  383.225144] x11: 0000000000000820 x10: 000007d0000007b0 
[  383.230569] x9 : 0000082000000780 x8 : 0000003c00000000 
[  383.235998] x7 : 0000000000000000 x6 : 00000000000034c2 
[  383.241428] x5 : ffff80001158eb00 x4 : ffff00000b8b2114 
[  383.246860] x3 : ffff8000391ff000 x2 : ffff800011d6b970 
[  383.252289] x1 : ffff80003720f700 x0 : ffff8000391f9028 
[  383.257719] Call trace:
[  383.260221] Exception stack(0xffff800011b07920 to 0xffff800011b07a50)
[  383.266783] 7920: ffff800011d6ba00 0001000000000000 00000000021e4000 ffff000000e51fa8
[  383.274756] 7940: 0000000020400145 000000000000003d ffff8000114ca700 ffff8000391ff3e8
[  383.282727] 7960: ffff800011b079a0 ffff00000814a514 ffff8000114ca700 0000000000000001
[  383.290699] 7980: ffff000008ab0f2c ffff8000391ff3c0 ffff000000df4dd8 00000000000051bc
[  383.298667] 79a0: ffff800011b079d0 ffff00000814a5f8 0000000000000140 ffff8000391ff3d0
[  383.306637] 79c0: ffff800011b07a50 ffff000000ddb04c ffff8000391f9028 ffff80003720f700
[  383.314606] 79e0: ffff800011d6b970 ffff8000391ff000 ffff00000b8b2114 ffff80001158eb00
[  383.322575] 7a00: 00000000000034c2 0000000000000000 0000003c00000000 0000082000000780
[  383.330546] 7a20: 000007d0000007b0 0000000000000820 000004b0000004b0 000004b3000004d3
[  383.338513] 7a40: 000004d3000004b9 ffff000008ad0648
[  383.343618] [] drm_atomic_helper_commit_hw_done+0xb0/0xb8 [drm_kms_helper]
[  383.352294] [] vc4_atomic_complete_commit+0x7c/0xd8 [vc4]
[  383.359475] [] vc4_atomic_commit+0x150/0x1c8 [vc4]
[  383.366221] [] drm_atomic_commit+0x54/0x70 [drm]
[  383.372653] [] drm_atomic_helper_connector_dpms+0xe0/0x178 [drm_kms_helper]
[  383.381604] [] drm_mode_connector_set_obj_prop+0x48/0xa0 [drm]
[  383.389404] [] drm_mode_obj_set_property_ioctl+0x14c/0x190 [drm]
[  383.397364] [] drm_ioctl_kernel+0x70/0xe0 [drm]
[  383.403835] [] drm_ioctl+0x1b4/0x400 [drm]
[  383.409617] [] do_vfs_ioctl+0xb0/0x8d8
[  383.415037] [] SyS_ioctl+0x8c/0xa8
[  383.420105] [] el0_svc_naked+0x24/0x28
[  383.425521] ---[ end trace cc9dc8be96ef614d ]---
[  393.437717] [drm:drm_atomic_helper_commit_cleanup_done [drm_kms_helper]] *ERROR* [CRTC:66:crtc-2] flip_done timed out



[  396.137715] ------------[ cut here ]------------
[  396.137883] WARNING: CPU: 1 PID: 865 at drivers/gpu/drm/drm_atomic_helper.c:1682 drm_atomic_helper_commit_hw_done+0x6c/0x94 [drm_kms_helper]
[  396.137900] Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables bnep sunrpc vc4 snd_soc_core ac97_bus snd_pcm_dmaengine snd_seq snd_seq_device snd_pcm snd_timer hci_uart snd btbcm brcmfmac btqca soundcore btintel drm_kms_helper brcmutil bluetooth drm cfg80211 fb_sys_fops syscopyarea ecdh_generic sysfillrect sysimgblt rfkill bcm2835_wdt
[  396.138551]  bcm2835_thermal bcm2835_rng leds_gpio smsc95xx usbnet mii mmc_block sdhci_iproc crc32_arm_ce dwc2 sdhci_pltfm udc_core sdhci pwm_bcm2835 bcm2835 i2c_bcm2835 bcm2835_dma phy_generic
[  396.138717] CPU: 1 PID: 865 Comm: gnome-shell Not tainted 4.13.1-302.fc27.armv7hl #1
[  396.138732] Hardware name: Generic DT based system
[  396.138777] [] (unwind_backtrace) from [] (show_stack+0x18/0x1c)
[  396.138805] [] (show_stack) from [] (dump_stack+0xa0/0xd8)
[  396.138833] [] (dump_stack) from [] (__warn+0xe4/0x104)
[  396.138858] [] (__warn) from [] (warn_slowpath_null+0x24/0x2c)
[  396.139000] [] (warn_slowpath_null) from [] (drm_atomic_helper_commit_hw_done+0x6c/0x94 [drm_kms_helper])
[  396.139214] [] (drm_atomic_helper_commit_hw_done [drm_kms_helper]) from [] (vc4_atomic_complete_commit+0x70/0xbc [vc4])
[  396.139369] [] (vc4_atomic_complete_commit [vc4]) from [] (vc4_atomic_commit+0x180/0x190 [vc4])
[  396.139569] [] (vc4_atomic_commit [vc4]) from [] (drm_atomic_helper_connector_dpms+0xc0/0x13c [drm_kms_helper])
[  396.139987] [] (drm_atomic_helper_connector_dpms [drm_kms_helper]) from [] (drm_mode_connector_set_obj_prop+0x40/0x80 [drm])
[  396.140542] [] (drm_mode_connector_set_obj_prop [drm]) from [] (drm_mode_obj_set_property_ioctl+0xdc/0x124 [drm])
[  396.141081] [] (drm_mode_obj_set_property_ioctl [drm]) from [] (drm_ioctl+0x2c4/0x364 [drm])
[  396.141367] [] (drm_ioctl [drm]) from [] (vfs_ioctl+0x28/0x3c)
[  396.141416] [] (vfs_ioctl) from [] (do_vfs_ioctl+0x9c/0x968)
[  396.141441] [] (do_vfs_ioctl) from [] (SyS_ioctl+0x58/0x74)
[  396.141470] [] (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x1c)
[  396.141548] ---[ end trace 76ad8b5fc31ef359 ]---
[  406.514381] [drm:drm_atomic_helper_commit_cleanup_done [drm_kms_helper]] *ERROR* [CRTC:66:crtc-2] flip_done timed out
@lategoodbye
Copy link

lategoodbye commented Sep 12, 2017

@nullr0ute Since these are kernel warnings, could you please describe the indication of this issue and the exact way to reproduce it?

@nullr0ute
Copy link
Author

I was just booting Fedora 27 to a GNOME login, it has mesa 7.2.0 plus the latest upstream GNOME dev. I can provide links to images another other details if you need them

anholt pushed a commit that referenced this issue Nov 13, 2017
commit 692b482 upstream.

Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
lockdep:

 [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
 [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
 [ 1270.473240] -----------------------------------------------------
 [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
 [ 1270.474994]
 [ 1270.474994] and this task is already holding:
 [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
 [ 1270.476046] which would create a new lock dependency:
 [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
 [ 1270.476949]
 [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
 [ 1270.477553]  (&pool->lock/1){-.-.}
 ...
 [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
 [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
 ...
 [ 1270.494735]  Possible interrupt unsafe locking scenario:
 [ 1270.494735]
 [ 1270.495250]        CPU0                    CPU1
 [ 1270.495600]        ----                    ----
 [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
 [ 1270.496295]                                local_irq_disable();
 [ 1270.496753]                                lock(&pool->lock/1);
 [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
 [ 1270.497744]   <Interrupt>
 [ 1270.497948]     lock(&pool->lock/1);

, which will cause a irq inversion deadlock if the above lock scenario
happens.

The root cause of this safe -> unsafe lock order is the
mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
held.

Unlocking mutex while holding an irq spinlock was never safe and this
problem has been around forever but it never got noticed because the
only time the mutex is usually trylocked while holding irqlock making
actual failures very unlikely and lockdep annotation missed the
condition until the recent b9c16a0 ("locking/mutex: Fix
lockdep_assert_held() fail").

Using mutex for pool->manager_arb has always been a bit of stretch.
It primarily is an mechanism to arbitrate managership between workers
which can easily be done with a pool flag.  The only reason it became
a mutex is that pool destruction path wants to exclude parallel
managing operations.

This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
and make the destruction path wait for the current manager on a wait
queue.

v2: Drop unnecessary flag clearing before pool destruction as
    suggested by Boqun.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Josef Bacik <[email protected]>
Reviewed-by: Lai Jiangshan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
anholt pushed a commit that referenced this issue Nov 21, 2017
Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
lockdep:

 [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
 [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
 [ 1270.473240] -----------------------------------------------------
 [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
 [ 1270.474994]
 [ 1270.474994] and this task is already holding:
 [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
 [ 1270.476046] which would create a new lock dependency:
 [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
 [ 1270.476949]
 [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
 [ 1270.477553]  (&pool->lock/1){-.-.}
 ...
 [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
 [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
 ...
 [ 1270.494735]  Possible interrupt unsafe locking scenario:
 [ 1270.494735]
 [ 1270.495250]        CPU0                    CPU1
 [ 1270.495600]        ----                    ----
 [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
 [ 1270.496295]                                local_irq_disable();
 [ 1270.496753]                                lock(&pool->lock/1);
 [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
 [ 1270.497744]   <Interrupt>
 [ 1270.497948]     lock(&pool->lock/1);

, which will cause a irq inversion deadlock if the above lock scenario
happens.

The root cause of this safe -> unsafe lock order is the
mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
held.

Unlocking mutex while holding an irq spinlock was never safe and this
problem has been around forever but it never got noticed because the
only time the mutex is usually trylocked while holding irqlock making
actual failures very unlikely and lockdep annotation missed the
condition until the recent b9c16a0 ("locking/mutex: Fix
lockdep_assert_held() fail").

Using mutex for pool->manager_arb has always been a bit of stretch.
It primarily is an mechanism to arbitrate managership between workers
which can easily be done with a pool flag.  The only reason it became
a mutex is that pool destruction path wants to exclude parallel
managing operations.

This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
and make the destruction path wait for the current manager on a wait
queue.

v2: Drop unnecessary flag clearing before pool destruction as
    suggested by Boqun.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Josef Bacik <[email protected]>
Reviewed-by: Lai Jiangshan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: [email protected]
@lategoodbye
Copy link

@nullr0ute Is this issue still reproducible with Fedora 29?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants