Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan hangs after a certain render sequence, and either panics, hangs or loses device #983

Closed
Dinnerbone opened this issue Oct 13, 2020 · 13 comments
Labels
backend: vulkan Issues with Vulkan external: driver-bug A driver is causing the bug, though we may still want to work around it

Comments

@Dinnerbone
Copy link
Contributor

Dinnerbone commented Oct 13, 2020

Description
It's a little hard to turn this into a small repro case so please forgive the vagueness.

After submitting a frame to wgpu whilst using Vulkan backend, Vulkan seems to become unstable and this manifests itself in a few ways:

  • A hang of the application
  • Graphics device crashes and PC dies (this happened to me a few times)
  • The submit seemingly returns okay but nothing actually happened, and the device will become lost the next time we try to draw a frame

Our application has two ways to reproduce the bug:

  • When rendering to a window, we repeatedly submit frames as a typical game would. This often just locks up but sometimes will gracefully give you an error about the device being lost.
  • Rendering one single frame to a texture, saved to disk.

In this second case, we perform the following sequence of events:

  • Create a texture
  • Draw a frame to a command encoder
  • Submit the command encoder to the queue
  • Copy the texture buffer to disk, much like the capture example in wgpu-rs

Seemingly the submit returns okay but the texture is completely empty, when we'd expect to see some graphics in it. The application then freezes (at least, for me on windows - this seems to vary) when dropping wgpu::Instance. For reference, the image it spits out should be identical to this one.

I've taken a trace of this single-frame capture and had to manually close the toml as the recording can't finish. This seems to freeze when played back, but I'm unable to get renderdoc to play nice and see anything from it.

This worked for us in the past, I think as soon as 24 days ago I was running this without issues. The same code, unchanged, no longer works today.

Repro steps
I haven't been able to create a minimal reproducible example, but you can see it in our project with the following steps:

  • Grab this swf
  • Clone Ruffle
  • cargo run --package=ruffle_desktop -- test.swf if you want to see it visually, with multiple frames
  • cargo run --package=exporter -- test.swf if you want to see the single frame saved to a texture on disk

You can apply this commit to reduce the amount of rendering done to the bare minimum that still crashes, with that particular swf: Dinnerbone/ruffle@b4f173d

Expected vs observed behavior
I expect to either get an error describing how we're using wgpu wrong, or for it to work :D

Extra materials

Platform
Reproduced on Windows.
Only affects Vulkan backend. We're seeing some instability with DX12 but not certain it's related yet.
Reproduced on wgpu 0.6 and gfx-rs/wgpu-rs@e3eadca

@kvark kvark added help required We need community help to make this happen. type: bug Something isn't working labels Oct 13, 2020
bors bot added a commit that referenced this issue Oct 13, 2020
984: Fix locking of device lifetime tracker on resource drop r=kvark a=kvark

**Connections**
Fixes the player hang in #983

**Description**
It turns out the current type-level protection from locking device's lifetime tracker is not working properly. TODO is left.

**Testing**
Tested on the trace from #983

Co-authored-by: Dzmitry Malyshau <[email protected]>
@Herschel
Copy link
Contributor

I took some time to bisect the driver version where the hang occurs on my machine:
Geforce RTX 2080 Ti
Win 10 64-bit
GeForce Game Ready Driver 452.06 (Aug 17) works
GeForce Game Ready Driver 456.38 (Sep 17) and later hangs

@Dinnerbone
Copy link
Contributor Author

Dinnerbone commented Oct 14, 2020

It looks like there's validation errors when it panics running desktop (swap chain + multiple frames). The panic happens after 3 minutes of the first frame submission.

[2020-10-14T17:09:27Z INFO  ruffle_core::player] Loaded SWF version 15, with a resolution of 550x400
[2020-10-14T17:12:33Z ERROR gfx_backend_vulkan] 
    VALIDATION [VUID-vkDestroyImageView-imageView-01026 (1672225264)] : Validation Error: [ VUID-vkDestroyImageView-imageView-01026 ] Object 0: handle = 0x217aa76d4b8, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0x63ac21f0 | Cannot call vkDestroyImageView on VkImageView 0x731f0f000000000a[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to imageView must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.148.0/windows/1.2-extensions/vkspec.html#VUID-vkDestroyImageView-imageView-01026)
    object info: (type: DEVICE, hndl: 2300667417784)
    
[2020-10-14T17:12:33Z ERROR gfx_backend_vulkan] 
    VALIDATION [VUID-vkDestroyFramebuffer-framebuffer-00892 (-617577710)] : Validation Error: [ VUID-vkDestroyFramebuffer-framebuffer-00892 ] Object 0: handle = 0x217aa76d4b8, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xdb308312 | Cannot call vkDestroyFramebuffer on VkFramebuffer 0xfba8190000000804[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to framebuffer must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.148.0/windows/1.2-extensions/vkspec.html#VUID-vkDestroyFramebuffer-framebuffer-00892)
    object info: (type: DEVICE, hndl: 2300667417784)
    
[2020-10-14T17:12:33Z ERROR gfx_backend_vulkan] 
    VALIDATION [VUID-vkDestroyFramebuffer-framebuffer-00892 (-617577710)] : Validation Error: [ VUID-vkDestroyFramebuffer-framebuffer-00892 ] Object 0: handle = 0x217aa76d4b8, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xdb308312 | Cannot call vkDestroyFramebuffer on VkFramebuffer 0xeaf23a0000000807[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to framebuffer must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.148.0/windows/1.2-extensions/vkspec.html#VUID-vkDestroyFramebuffer-framebuffer-00892)
    object info: (type: DEVICE, hndl: 2300667417784)
    
[2020-10-14T17:12:33Z ERROR gfx_backend_vulkan] 
    VALIDATION [VUID-vkDestroySemaphore-semaphore-01137 (-1588160456)] : Validation Error: [ VUID-vkDestroySemaphore-semaphore-01137 ] Object 0: handle = 0x217aa76d4b8, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xa1569838 | Cannot call vkDestroySemaphore on VkSemaphore 0xe81828000000000d[] that is currently in use by a command buffer. The Vulkan spec states: All submitted batches that refer to semaphore must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.148.0/windows/1.2-extensions/vkspec.html#VUID-vkDestroySemaphore-semaphore-01137)
    object info: (type: DEVICE, hndl: 2300667417784)
    
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `Ok(())`,
 right: `Err(ERROR_DEVICE_LOST)`', C:\Users\dinne\.cargo\registry\src\github.aaakk.us.kg-1ecc6299db9ec823\gfx-backend-vulkan-0.6.1\src\lib.rs:1516:9
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39\/library\std\src\panicking.rs:475
   1: std::panicking::begin_panic_fmt
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39\/library\std\src\panicking.rs:429
   2: gfx_backend_vulkan::{{impl}}::submit<gfx_backend_vulkan::command::CommandBuffer,core::iter::adapters::chain::Chain<core::option::IntoIter<gfx_backend_vulkan::command::CommandBuffer*>, core::iter::adapters::flatten::FlatMap<core::slice::Iter<wgpu_core::id:
             at C:\Users\dinne\.cargo\registry\src\github.aaakk.us.kg-1ecc6299db9ec823\gfx-backend-vulkan-0.6.1\src\lib.rs:1516
   3: wgpu_core::hub::Global<wgpu_core::hub::IdentityManagerFactory>::queue_submit<wgpu_core::hub::IdentityManagerFactory,gfx_backend_vulkan::Backend>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-3d51dad24d4bec0d\3b76651\wgpu-core\src\device\queue.rs:629
   4: wgpu::backend::direct::{{impl}}::queue_submit<core::iter::adapters::Map<alloc::vec::IntoIter<wgpu::CommandBuffer>, closure-0>>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-rs-40ea39809c03c5d8\e3eadca\src\backend\direct.rs:1507
   5: wgpu::Queue::submit<alloc::vec::Vec<wgpu::CommandBuffer>>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-rs-40ea39809c03c5d8\e3eadca\src\lib.rs:2601
   6: ruffle_render_wgpu::target::{{impl}}::submit<alloc::vec::Vec<wgpu::CommandBuffer>>
             at .\render\wgpu\src\target.rs:99
   7: ruffle_render_wgpu::{{impl}}::end_frame<ruffle_render_wgpu::target::SwapChainTarget>
             at .\render\wgpu\src\lib.rs:1304
   8: ruffle_core::player::Player::render
             at .\core\src\player.rs:832
   9: ruffle_desktop::run_player::{{closure}}
             at .\desktop\src\main.rs:180

Edit: Actually, that was when testing with #987, with commit kvark@3b76651. One commit before, kvark@3be2c45, there's no validation errors:

[2020-10-14T17:41:58Z INFO  ruffle_core::player] Loaded SWF version 15, with a resolution of 550x400
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `Ok(())`,
 right: `Err(ERROR_DEVICE_LOST)`', C:\Users\dinne\.cargo\registry\src\github.aaakk.us.kg-1ecc6299db9ec823\gfx-backend-vulkan-0.6.1\src\lib.rs:1516:9
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39\/library\std\src\panicking.rs:475
   1: std::panicking::begin_panic_fmt
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39\/library\std\src\panicking.rs:429
   2: gfx_backend_vulkan::{{impl}}::submit<gfx_backend_vulkan::command::CommandBuffer,core::iter::adapters::chain::Chain<core::option::IntoIter<gfx_backend_vulkan::command::CommandBuffer*>, core::iter::adapters::flatten::FlatMap<core::slice::Iter<wgpu_core::id:
             at C:\Users\dinne\.cargo\registry\src\github.aaakk.us.kg-1ecc6299db9ec823\gfx-backend-vulkan-0.6.1\src\lib.rs:1516
   3: wgpu_core::hub::Global<wgpu_core::hub::IdentityManagerFactory>::queue_submit<wgpu_core::hub::IdentityManagerFactory,gfx_backend_vulkan::Backend>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-3d51dad24d4bec0d\3be2c45\wgpu-core\src\device\queue.rs:629
   4: wgpu::backend::direct::{{impl}}::queue_submit<core::iter::adapters::Map<alloc::vec::IntoIter<wgpu::CommandBuffer>, closure-0>>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-rs-40ea39809c03c5d8\e3eadca\src\backend\direct.rs:1507
   5: wgpu::Queue::submit<alloc::vec::Vec<wgpu::CommandBuffer>>
             at C:\Users\dinne\.cargo\git\checkouts\wgpu-rs-40ea39809c03c5d8\e3eadca\src\lib.rs:2601
   6: ruffle_render_wgpu::target::{{impl}}::submit<alloc::vec::Vec<wgpu::CommandBuffer>>
             at .\render\wgpu\src\target.rs:99
   7: ruffle_render_wgpu::{{impl}}::end_frame<ruffle_render_wgpu::target::SwapChainTarget>
             at .\render\wgpu\src\lib.rs:1304
   8: ruffle_core::player::Player::render
             at .\core\src\player.rs:832
   9: ruffle_desktop::run_player::{{closure}}
             at .\desktop\src\main.rs:180

@kvark
Copy link
Member

kvark commented Oct 14, 2020

Oh interesting, thank you! I'll try leaving it running for 5 minutes, I guess :)

bors bot added a commit that referenced this issue Oct 14, 2020
986: [0.6] Allign stencil reference between state setting and pipeline creation r=cwfitzgerald a=kvark

**Connections**
Found this when running Ruffle from #983 

**Description**
There are 2 problems:
  1. setting the stencil reference can't be valid if the pipeline doesn't expect this (d'oh)
  2. our code that created the pipeline used slightly stricter conditions than the code setting the stencil, so these diverged a tiny bit

**Testing**
Tested on Ruffle

Co-authored-by: Dzmitry Malyshau <[email protected]>
@kvark
Copy link
Member

kvark commented Oct 15, 2020

Is this still an issue, now that we landed the gfx-memory fix?

@Dinnerbone
Copy link
Contributor Author

Dinnerbone commented Oct 15, 2020

We didn't use 0.2.1 for this, we were locked on 0.2.0.

I just upgraded to 0.2.2 just in case but the issue still persists.

I think that was a separate issue that I initially confused with this because cargo install doesn't respect lockfiles by default 😅

@kvark
Copy link
Member

kvark commented Oct 15, 2020

What exactly are the repro steps now? Run cargo run --package=ruffle_desktop -- test.swf and wait N minutes?

@Dinnerbone
Copy link
Contributor Author

Two repro steps but it looks like you need to be on windows with a geforce driver >= 456.38

cargo run --package=ruffle_desktop -- test.swf this will make it freeze immediately, and panic after 3 minutes.
cargo run --package=exporter -- test.swf this will spit out an image test.png which is incorrectly empty, and then hang the program as it tries to drop wgpu::Instance

@kvark
Copy link
Member

kvark commented Oct 15, 2020

@Dinnerbone finally got to test this on Windows/NV GTX 1050 Ti/Vulkan. It runs fine... Although my driver version is 443, and it's the latest Lenovo considers valid for this Thinkpad X1 Extreme. Force-installing anything fresher may invite for more trouble than it's worth. Looks like you found a genuine NVidia Vulkan bug. Looking forward to see if they respond!

@kvark kvark added external: upstream Issues happening in lower level APIs or platforms and removed help required We need community help to make this happen. labels Oct 15, 2020
@Herschel
Copy link
Contributor

Herschel commented Nov 11, 2020

Hang is unfortunately still occurring in the latest latest 2 Nvidia driver versions, 457.09 and 457.30 (November 2020) and gfx-rs/wgpu-rs@2563f20

@Herschel
Copy link
Contributor

Herschel commented Nov 12, 2020

Here are minimal repro traces that hang on my machine in player.

wgpu-vulkan-hang.zip

trace_good draws a single red triangle and replays successfully.
trace_bad tries to draw a second red triangle. This displays a blank screen and hangs during recording & playback.
(trace_bad was manually edited to add the closing ] to the RON file).

This only occurs with the vulkan backend. The traces using the dx12 backend replay correctly.

The diff between the two traces boils down to the additional buffer creation and render pass.

Removing the buffer copy in trace-bad trace.ron line 442 causes the replay to run successfully (but only displays one triangle):

    CopyBufferToBuffer(
        src: Id(7, 1, Vulkan),
        src_offset: 0,
        dst: Id(1, 1, Vulkan),
        dst_offset: 0,
        size: 64,
    ),

How to create the trace:

  • git clone -b wgpu-hang-repro https://github.com/Herschel/ruffle
  • cd ruffle/desktop
  • cargo run --features=render_trace -- -g vulkan --trace-path=trace-bad test.swf

Running the trace using wgpu/player:
cargo run --features=winit -- trace-bad

Windows 10 64-bit
Nvidia Geforce 2080 Ti, Game Ready Driver 457.30
VulkanSDK 1.2.154.1
gfx-rs/wgpu-rs@2563f20

@cwfitzgerald cwfitzgerald added external: driver-bug A driver is causing the bug, though we may still want to work around it and removed external: upstream Issues happening in lower level APIs or platforms labels Dec 1, 2020
@kvark
Copy link
Member

kvark commented Apr 13, 2021

Do you guys have an API trace for a fresh version of wgpu by any chance to reproduce it?

@kvark
Copy link
Member

kvark commented Apr 13, 2021

Interestingly, the "trace-bad" is replayed without issues here on "GTX 1050 Ti Max-Q" driver version 27.21.14.5256.
@Herschel could you check if you are still seeing it?

@cwfitzgerald cwfitzgerald added backend: vulkan Issues with Vulkan and removed type: bug Something isn't working labels Jun 5, 2022
@cwfitzgerald
Copy link
Member

I'm going to close this as out of date, if you still having issues, plese file a new bug.

@cwfitzgerald cwfitzgerald closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: vulkan Issues with Vulkan external: driver-bug A driver is causing the bug, though we may still want to work around it
Projects
None yet
Development

No branches or pull requests

4 participants