Vulkan hangs after a certain render sequence, and either panics, hangs or loses device #983

Dinnerbone · 2020-10-13T22:13:03Z

Description
It's a little hard to turn this into a small repro case so please forgive the vagueness.

After submitting a frame to wgpu whilst using Vulkan backend, Vulkan seems to become unstable and this manifests itself in a few ways:

A hang of the application
Graphics device crashes and PC dies (this happened to me a few times)
The submit seemingly returns okay but nothing actually happened, and the device will become lost the next time we try to draw a frame

Our application has two ways to reproduce the bug:

When rendering to a window, we repeatedly submit frames as a typical game would. This often just locks up but sometimes will gracefully give you an error about the device being lost.
Rendering one single frame to a texture, saved to disk.

In this second case, we perform the following sequence of events:

Create a texture
Draw a frame to a command encoder
Submit the command encoder to the queue
Copy the texture buffer to disk, much like the capture example in wgpu-rs

Seemingly the submit returns okay but the texture is completely empty, when we'd expect to see some graphics in it. The application then freezes (at least, for me on windows - this seems to vary) when dropping wgpu::Instance. For reference, the image it spits out should be identical to this one.

I've taken a trace of this single-frame capture and had to manually close the toml as the recording can't finish. This seems to freeze when played back, but I'm unable to get renderdoc to play nice and see anything from it.

This worked for us in the past, I think as soon as 24 days ago I was running this without issues. The same code, unchanged, no longer works today.

Repro steps
I haven't been able to create a minimal reproducible example, but you can see it in our project with the following steps:

Grab this swf
Clone Ruffle
cargo run --package=ruffle_desktop -- test.swf if you want to see it visually, with multiple frames
cargo run --package=exporter -- test.swf if you want to see the single frame saved to a texture on disk

You can apply this commit to reduce the amount of rendering done to the bare minimum that still crashes, with that particular swf: Dinnerbone/ruffle@b4f173d

Expected vs observed behavior
I expect to either get an error describing how we're using wgpu wrong, or for it to work :D

Extra materials

A trace.zip of saving one frame to a texture

Platform
Reproduced on Windows.
Only affects Vulkan backend. We're seeing some instability with DX12 but not certain it's related yet.
Reproduced on wgpu 0.6 and gfx-rs/wgpu-rs@e3eadca

The text was updated successfully, but these errors were encountered:

984: Fix locking of device lifetime tracker on resource drop r=kvark a=kvark **Connections** Fixes the player hang in #983 **Description** It turns out the current type-level protection from locking device's lifetime tracker is not working properly. TODO is left. **Testing** Tested on the trace from #983 Co-authored-by: Dzmitry Malyshau <[email protected]>

Herschel · 2020-10-14T01:42:16Z

I took some time to bisect the driver version where the hang occurs on my machine:
Geforce RTX 2080 Ti
Win 10 64-bit
GeForce Game Ready Driver 452.06 (Aug 17) works
GeForce Game Ready Driver 456.38 (Sep 17) and later hangs

Dinnerbone · 2020-10-14T17:25:51Z

It looks like there's validation errors when it panics running desktop (swap chain + multiple frames). The panic happens after 3 minutes of the first frame submission.

[2020-10-14T17:09:27Z INFO  ruffle_core::player] Loaded SWF version 15, with a resolution of 550x400
[2020-10-14T17:12:33Z ERROR gfx_backend_vulkan] 
    VALIDATION [VUID-vkDestroyImageView-imageView-01026 (1672225264)] : Validation Error: [ VUID-vkDestroyImageView-imageView-01026 ] Object 0: handle = 0x217aa76d4b8, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0x63ac21f0 | Cannot call vkDestroyImageView on VkImageView 0x731f0f000000000a[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to imageView must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.148.0/windows/1.2-extensions/vkspec.html#VUID-vkDestroyImageView-imageView-01026)
    object info: (type: DEVICE, hndl: 2300667417784)
    
[2020-10-14T17:12:33Z ERROR gfx_backend_vulkan] 
    VALIDATION [VUID-vkDestroyFramebuffer-framebuffer-00892 (-617577710)] : Validation Error: [ VUID-vkDestroyFramebuffer-framebuffer-00892 ] Object 0: handle = 0x217aa76d4b8, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xdb308312 | Cannot call vkDestroyFramebuffer on VkFramebuffer 0xfba8190000000804[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to framebuffer must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.148.0/windows/1.2-extensions/vkspec.html#VUID-vkDestroyFramebuffer-framebuffer-00892)
    object info: (type: DEVICE, hndl: 2300667417784)
    
[2020-10-14T17:12:33Z ERROR gfx_backend_vulkan] 
    VALIDATION [VUID-vkDestroyFramebuffer-framebuffer-00892 (-617577710)] : Validation Error: [ VUID-vkDestroyFramebuffer-framebuffer-00892 ] Object 0: handle = 0x217aa76d4b8, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xdb308312 | Cannot call vkDestroyFramebuffer on VkFramebuffer 0xeaf23a0000000807[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to framebuffer must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.148.0/windows/1.2-extensions/vkspec.html#VUID-vkDestroyFramebuffer-framebuffer-00892)
    object info: (type: DEVICE, hndl: 2300667417784)
    
[2020-10-14T17:12:33Z ERROR gfx_backend_vulkan] 
    VALIDATION [VUID-vkDestroySemaphore-semaphore-01137 (-1588160456)] : Validation Error: [ VUID-vkDestroySemaphore-semaphore-01137 ] Object 0: handle = 0x217aa76d4b8, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xa1569838 | Cannot call vkDestroySemaphore on VkSemaphore 0xe81828000000000d[] that is currently in use by a command buffer. The Vulkan spec states: All submitted batches that refer to semaphore must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.148.0/windows/1.2-extensions/vkspec.html#VUID-vkDestroySemaphore-semaphore-01137)
    object info: (type: DEVICE, hndl: 2300667417784)
    
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `Ok(())`,
 right: `Err(ERROR_DEVICE_LOST)`', C:\Users\dinne\.cargo\registry\src\github.aaakk.us.kg-1ecc6299db9ec823\gfx-backend-vulkan-0.6.1\src\lib.rs:1516:9
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39\/library\std\src\panicking.rs:475
   1: std::panicking::begin_panic_fmt
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39\/library\std\src\panicking.rs:429
   2: gfx_backend_vulkan::{{impl}}::submit<gfx_backend_vulkan::command::CommandBuffer,core::iter::adapters::chain::Chain<core::option::IntoIter<gfx_backend_vulkan::command::CommandBuffer*>, core::iter::adapters::flatten::FlatMap<core::slice::Iter<wgpu_core::id:
             at C:\Users\dinne\.cargo\registry\src\github.aaakk.us.kg-1ecc6299db9ec823\gfx-backend-vulkan-0.6.1\src\lib.rs:1516
   3: wgpu_core::hub::Global<wgpu_core::hub::IdentityManagerFactory>::queue_submit<wgpu_core::hub::IdentityManagerFactory,gfx_backend_vulkan::Backend>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-3d51dad24d4bec0d\3b76651\wgpu-core\src\device\queue.rs:629
   4: wgpu::backend::direct::{{impl}}::queue_submit<core::iter::adapters::Map<alloc::vec::IntoIter<wgpu::CommandBuffer>, closure-0>>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-rs-40ea39809c03c5d8\e3eadca\src\backend\direct.rs:1507
   5: wgpu::Queue::submit<alloc::vec::Vec<wgpu::CommandBuffer>>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-rs-40ea39809c03c5d8\e3eadca\src\lib.rs:2601
   6: ruffle_render_wgpu::target::{{impl}}::submit<alloc::vec::Vec<wgpu::CommandBuffer>>
             at .\render\wgpu\src\target.rs:99
   7: ruffle_render_wgpu::{{impl}}::end_frame<ruffle_render_wgpu::target::SwapChainTarget>
             at .\render\wgpu\src\lib.rs:1304
   8: ruffle_core::player::Player::render
             at .\core\src\player.rs:832
   9: ruffle_desktop::run_player::{{closure}}
             at .\desktop\src\main.rs:180

Edit: Actually, that was when testing with #987, with commit kvark@3b76651. One commit before, kvark@3be2c45, there's no validation errors:

[2020-10-14T17:41:58Z INFO  ruffle_core::player] Loaded SWF version 15, with a resolution of 550x400
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `Ok(())`,
 right: `Err(ERROR_DEVICE_LOST)`', C:\Users\dinne\.cargo\registry\src\github.aaakk.us.kg-1ecc6299db9ec823\gfx-backend-vulkan-0.6.1\src\lib.rs:1516:9
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39\/library\std\src\panicking.rs:475
   1: std::panicking::begin_panic_fmt
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39\/library\std\src\panicking.rs:429
   2: gfx_backend_vulkan::{{impl}}::submit<gfx_backend_vulkan::command::CommandBuffer,core::iter::adapters::chain::Chain<core::option::IntoIter<gfx_backend_vulkan::command::CommandBuffer*>, core::iter::adapters::flatten::FlatMap<core::slice::Iter<wgpu_core::id:
             at C:\Users\dinne\.cargo\registry\src\github.aaakk.us.kg-1ecc6299db9ec823\gfx-backend-vulkan-0.6.1\src\lib.rs:1516
   3: wgpu_core::hub::Global<wgpu_core::hub::IdentityManagerFactory>::queue_submit<wgpu_core::hub::IdentityManagerFactory,gfx_backend_vulkan::Backend>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-3d51dad24d4bec0d\3be2c45\wgpu-core\src\device\queue.rs:629
   4: wgpu::backend::direct::{{impl}}::queue_submit<core::iter::adapters::Map<alloc::vec::IntoIter<wgpu::CommandBuffer>, closure-0>>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-rs-40ea39809c03c5d8\e3eadca\src\backend\direct.rs:1507
   5: wgpu::Queue::submit<alloc::vec::Vec<wgpu::CommandBuffer>>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-rs-40ea39809c03c5d8\e3eadca\src\lib.rs:2601
   6: ruffle_render_wgpu::target::{{impl}}::submit<alloc::vec::Vec<wgpu::CommandBuffer>>
             at .\render\wgpu\src\target.rs:99
   7: ruffle_render_wgpu::{{impl}}::end_frame<ruffle_render_wgpu::target::SwapChainTarget>
             at .\render\wgpu\src\lib.rs:1304
   8: ruffle_core::player::Player::render
             at .\core\src\player.rs:832
   9: ruffle_desktop::run_player::{{closure}}
             at .\desktop\src\main.rs:180

kvark · 2020-10-14T17:55:27Z

Oh interesting, thank you! I'll try leaving it running for 5 minutes, I guess :)

986: [0.6] Allign stencil reference between state setting and pipeline creation r=cwfitzgerald a=kvark **Connections** Found this when running Ruffle from #983 **Description** There are 2 problems: 1. setting the stencil reference can't be valid if the pipeline doesn't expect this (d'oh) 2. our code that created the pipeline used slightly stricter conditions than the code setting the stencil, so these diverged a tiny bit **Testing** Tested on Ruffle Co-authored-by: Dzmitry Malyshau <[email protected]>

kvark · 2020-10-15T15:19:58Z

Is this still an issue, now that we landed the gfx-memory fix?

Dinnerbone · 2020-10-15T15:43:54Z

We didn't use 0.2.1 for this, we were locked on 0.2.0.

I just upgraded to 0.2.2 just in case but the issue still persists.

I think that was a separate issue that I initially confused with this because cargo install doesn't respect lockfiles by default 😅

kvark · 2020-10-15T15:48:14Z

What exactly are the repro steps now? Run cargo run --package=ruffle_desktop -- test.swf and wait N minutes?

Dinnerbone · 2020-10-15T15:53:54Z

Two repro steps but it looks like you need to be on windows with a geforce driver >= 456.38

cargo run --package=ruffle_desktop -- test.swf this will make it freeze immediately, and panic after 3 minutes.
cargo run --package=exporter -- test.swf this will spit out an image test.png which is incorrectly empty, and then hang the program as it tries to drop wgpu::Instance

kvark · 2020-10-15T22:46:50Z

@Dinnerbone finally got to test this on Windows/NV GTX 1050 Ti/Vulkan. It runs fine... Although my driver version is 443, and it's the latest Lenovo considers valid for this Thinkpad X1 Extreme. Force-installing anything fresher may invite for more trouble than it's worth. Looks like you found a genuine NVidia Vulkan bug. Looking forward to see if they respond!

Herschel · 2020-11-11T22:57:12Z

Hang is unfortunately still occurring in the latest latest 2 Nvidia driver versions, 457.09 and 457.30 (November 2020) and gfx-rs/wgpu-rs@2563f20

Herschel · 2020-11-12T00:10:31Z

Here are minimal repro traces that hang on my machine in player.

wgpu-vulkan-hang.zip

trace_good draws a single red triangle and replays successfully.
trace_bad tries to draw a second red triangle. This displays a blank screen and hangs during recording & playback.
(trace_bad was manually edited to add the closing ] to the RON file).

This only occurs with the vulkan backend. The traces using the dx12 backend replay correctly.

The diff between the two traces boils down to the additional buffer creation and render pass.

Removing the buffer copy in trace-bad trace.ron line 442 causes the replay to run successfully (but only displays one triangle):

    CopyBufferToBuffer(
        src: Id(7, 1, Vulkan),
        src_offset: 0,
        dst: Id(1, 1, Vulkan),
        dst_offset: 0,
        size: 64,
    ),

How to create the trace:

git clone -b wgpu-hang-repro https://github.com/Herschel/ruffle
cd ruffle/desktop
cargo run --features=render_trace -- -g vulkan --trace-path=trace-bad test.swf

Running the trace using wgpu/player:
cargo run --features=winit -- trace-bad

Windows 10 64-bit
Nvidia Geforce 2080 Ti, Game Ready Driver 457.30
VulkanSDK 1.2.154.1
gfx-rs/wgpu-rs@2563f20

kvark · 2021-04-13T03:36:54Z

Do you guys have an API trace for a fresh version of wgpu by any chance to reproduce it?

kvark · 2021-04-13T03:46:20Z

Interestingly, the "trace-bad" is replayed without issues here on "GTX 1050 Ti Max-Q" driver version 27.21.14.5256.
@Herschel could you check if you are still seeing it?

cwfitzgerald · 2024-12-11T22:39:38Z

I'm going to close this as out of date, if you still having issues, plese file a new bug.

kvark added help required We need community help to make this happen. type: bug Something isn't working labels Oct 13, 2020

kvark mentioned this issue Oct 13, 2020

Fix locking of device lifetime tracker on resource drop #984

Merged

kvark mentioned this issue Oct 14, 2020

[0.6] Allign stencil reference between state setting and pipeline creation #986

Merged

kvark added external: upstream Issues happening in lower level APIs or platforms and removed help required We need community help to make this happen. labels Oct 15, 2020

Dinnerbone mentioned this issue Oct 17, 2020

Access violation or debug assertion when using dx12 #993

Closed

Herschel mentioned this issue Dec 1, 2020

[desktop] Hang with vulkan backend on Nvidia Geforce driver 456.38+ ruffle-rs/ruffle#1799

Closed

cwfitzgerald added external: driver-bug A driver is causing the bug, though we may still want to work around it and removed external: upstream Issues happening in lower level APIs or platforms labels Dec 1, 2020

cwfitzgerald added backend: vulkan Issues with Vulkan and removed type: bug Something isn't working labels Jun 5, 2022

cwfitzgerald closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan hangs after a certain render sequence, and either panics, hangs or loses device #983

Vulkan hangs after a certain render sequence, and either panics, hangs or loses device #983

Dinnerbone commented Oct 13, 2020 •

edited

Loading

Herschel commented Oct 14, 2020

Dinnerbone commented Oct 14, 2020 •

edited

Loading

kvark commented Oct 14, 2020

kvark commented Oct 15, 2020

Dinnerbone commented Oct 15, 2020 •

edited

Loading

kvark commented Oct 15, 2020

Dinnerbone commented Oct 15, 2020

kvark commented Oct 15, 2020

Herschel commented Nov 11, 2020 •

edited

Loading

Herschel commented Nov 12, 2020 •

edited

Loading

kvark commented Apr 13, 2021

kvark commented Apr 13, 2021

cwfitzgerald commented Dec 11, 2024

Vulkan hangs after a certain render sequence, and either panics, hangs or loses device #983

Vulkan hangs after a certain render sequence, and either panics, hangs or loses device #983

Comments

Dinnerbone commented Oct 13, 2020 • edited Loading

Herschel commented Oct 14, 2020

Dinnerbone commented Oct 14, 2020 • edited Loading

kvark commented Oct 14, 2020

kvark commented Oct 15, 2020

Dinnerbone commented Oct 15, 2020 • edited Loading

kvark commented Oct 15, 2020

Dinnerbone commented Oct 15, 2020

kvark commented Oct 15, 2020

Herschel commented Nov 11, 2020 • edited Loading

Herschel commented Nov 12, 2020 • edited Loading

kvark commented Apr 13, 2021

kvark commented Apr 13, 2021

cwfitzgerald commented Dec 11, 2024

Dinnerbone commented Oct 13, 2020 •

edited

Loading

Dinnerbone commented Oct 14, 2020 •

edited

Loading

Dinnerbone commented Oct 15, 2020 •

edited

Loading

Herschel commented Nov 11, 2020 •

edited

Loading

Herschel commented Nov 12, 2020 •

edited

Loading