Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

device.create_compute_pipeline hangs #2529

Open
peters-david opened this issue Mar 8, 2022 · 22 comments
Open

device.create_compute_pipeline hangs #2529

peters-david opened this issue Mar 8, 2022 · 22 comments
Labels
area: correctness We're behaving incorrectly backend: metal Issues with Metal backend: vulkan Issues with Vulkan type: bug Something isn't working

Comments

@peters-david
Copy link

I want to run a compute shader. Everything until and including device.create_shader_module runs without problems, no validation errors.
The next step is to call device.create_compute_pipeline which hangs. From system monitor i see that this call uses around 3 GB of RAM.

I created a repo where you can reproduce this issue: Issue repo

I probably just misunderstood something about wgsl but i am not sure how to find the problem since naga doesn't give me any errors.

Is this related to the compute shader being rather big? Could you give me some pointers how to find the problem?

Tried this on different systems:
Linux Ubuntu / Mesa Intel Iris Plus Graphics (ICL GT2)
Linux Ubuntu / NVIDIA Quadro M4000
Windows 10 / NVIDIA Quadro M4000

@kvark
Copy link
Member

kvark commented Mar 9, 2022

So it hands on 3 different systems, technically? That's definitely unexpected!

@kvark kvark added area: correctness We're behaving incorrectly type: bug Something isn't working labels Mar 9, 2022
@peters-david
Copy link
Author

Yes, also tested on Ubuntu / Intel HD Graphics 3000 (SNB GT2) with the same result.

@teoxoy teoxoy added this to the WebGPU Specification V1 milestone Dec 5, 2022
@JasonS05
Copy link

JasonS05 commented May 17, 2023

Any progress on this? I tried running the WebGPU samples on the latest Firefox Nightly a few days ago and all samples with computer shaders seemed to suffer the same issue. No WebGPU graphics displayed (even the portions not requiring compute shaders, if there were any), and the RAM usage was anomalously high. 3 GB sounds about right. I would estimate maybe 4 GB from my memory of what the RAM graph looked like, but that was a few days ago. The system was a laptop with Ubuntu 22.04 and an NVIDIA GPU and an Intel CPU. I think it was probably an NVIDIA RTX 30xx series card, but I don't know exactly which one. The whole computer was only $800 dollars (2020) with 8 GB RAM and 256 GB SSD, so nothing too high-end.

Edit: turns out if was a GTX 1650

@JasonS05
Copy link

I tested just now on an iMac running Mojave (1.14.6) with a "4 GHz Intel Core i7" and "AMD Radeon R9 M295X 4 GB" and when the WebGPU samples page was open the memory usage of Firefox Nightly rose steadily but slowly without apparent limit. At 9 GB I switched to a different tab and memory usage stopped rising. Then I switched back and it resumed rising. Closing the WebGPU tab instantly dropped the memory usage to only a few hundred MB. The memory leak began when I opened the Cornell Box sample (that one specifically, other compute shader ones didn't do it) and kept leaking even when I switched to other samples, even the simplest one. Only closing the tab cured the leak. This particular sample also gave an error regarding usage of an unsupported texture format (I think "bgra8unorm" or something). Also, in order to get any of the samples to run, I had to enabled the "gfx.webgpu.ignore-blocklist" setting in about:config and restart the browser.

@ErichDonGubler
Copy link
Member

Does this issue also happen in Google Chrome?

@JasonS05
Copy link

In the WebGPU samples website all samples worked on Chrome on my iMac except the Cornell Box. So the compute capabilities function fine. As for the Cornell Box memory leak, I'll test that tomorrow on Chrome.

@JasonS05
Copy link

Ok that's strange. Today on Chrome WebGPU isn't working at all on my iMac. I'm just getting TypeError: Cannot read properties of null (reading 'requestDevice'). I even tried enabling WebGPU developer features and no luck. Chrome version is 113.0.5672.92. But I know it definitely, positively worked a few days ago. Either that or I'm seriously hallucinating.

@teoxoy
Copy link
Member

teoxoy commented Jun 5, 2023

@JasonS05 the issues you are facing might not be related to this bug report. Please try to reproduce this issue by trying to run the repo linked in the description or file a bug here for Firefox issues.

@teoxoy
Copy link
Member

teoxoy commented Jun 5, 2023

I'm hitting #4393 while trying to run this on the DX12 backend.
On Vulkan, it doesn't hang but takes minutes for the pipeline to be created.

@peters-david was the call to create_compute_pipeline just slow or was it really hanging (never completing)?

@teoxoy teoxoy added the backend: vulkan Issues with Vulkan label Jun 5, 2023
@peters-david
Copy link
Author

@teoxoy It never completed for me. The longest I waited was around 30 minutes. It may have completed after that but I didn't bother to wait longer.

@teoxoy
Copy link
Member

teoxoy commented Jun 5, 2023

Did you notice the RAM usage continuously increasing? that's what I noticed while it was creating the pipeline.

Also, this issue is one year old, do you have any new findings?

@peters-david
Copy link
Author

@teoxoy Yes, ram usage increased. I didn't really work on it since then, sorry, but I can test again if it helps.

@teoxoy
Copy link
Member

teoxoy commented Jun 5, 2023

I see, np. If you'd be able to narrow down the slowness to a specific section of code within wgpu by profiling the test app that would be appreciated.

@JasonS05
Copy link

JasonS05 commented Jun 6, 2023

@JasonS05 the issues you are facing might not be related to this bug report. Please try to reproduce this issue by trying to run the repo linked in the description or file a bug here for Firefox issues.

I tried compiling the linked repo just now but it had several compile errors. As I am not familiar with rust I do not know how to proceed. These are the error messages:

Click to expand

   Compiling test v0.1.0 (/home/jason/Desktop/Coding Stuff/github/peters-david.test)
error[E0308]: mismatched types
    --> src/gpu.rs:19:60
     |
19   |         let instance: wgpu::Instance = wgpu::Instance::new(wgpu::Backends::all());
     |                                        ------------------- ^^^^^^^^^^^^^^^^^^^^^ expected struct `InstanceDescriptor`, found struct `Backends`
     |                                        |
     |                                        arguments to this function are incorrect
     |
note: associated function defined here
    --> /home/jason/.cargo/git/checkouts/wgpu-53e70f8674b08dd4/8b6599b/wgpu/src/lib.rs:1343:12
     |
1343 |     pub fn new(instance_desc: InstanceDescriptor) -> Self {
     |            ^^^

error[E0308]: mismatched types
    --> src/gpu.rs:67:46
     |
67   |       let shader = device.create_shader_module(&wgpu::ShaderModuleDescriptor {
     |  _________________________--------------------_^
     | |                         |
     | |                         arguments to this function are incorrect
68   | |         label: None,
69   | |         source: wgpu::ShaderSource::Wgsl(Cow::Borrowed(include_str!("shader.wgsl"))),
70   | |     });
     | |_____^ expected struct `ShaderModuleDescriptor`, found `&ShaderModuleDescriptor<'_>`
     |
note: associated function defined here
    --> /home/jason/.cargo/git/checkouts/wgpu-53e70f8674b08dd4/8b6599b/wgpu/src/lib.rs:1948:12
     |
1948 |     pub fn create_shader_module(&self, desc: ShaderModuleDescriptor) -> ShaderModule {
     |            ^^^^^^^^^^^^^^^^^^^^
help: consider removing the borrow
     |
67   -     let shader = device.create_shader_module(&wgpu::ShaderModuleDescriptor {
67   +     let shader = device.create_shader_module(wgpu::ShaderModuleDescriptor {
     |

error[E0599]: no method named `dispatch` found for struct `ComputePass` in the current scope
   --> src/gpu.rs:130:22
    |
130 |         compute_pass.dispatch(1, 1, 1); // Number of cells to run, the (x,y,z) size of item being processed
    |                      ^^^^^^^^ method not found in `ComputePass<'_>`

error[E0061]: this function takes 2 arguments but 1 argument was supplied
    --> src/gpu.rs:142:54
     |
142  |     let cpu_buffer_out_future = cpu_buffer_out_slice.map_async(wgpu::MapMode::Read);
     |                                                      ^^^^^^^^^--------------------- an argument is missing
     |
note: associated function defined here
    --> /home/jason/.cargo/git/checkouts/wgpu-53e70f8674b08dd4/8b6599b/wgpu/src/lib.rs:2547:12
     |
2547 |     pub fn map_async(
     |            ^^^^^^^^^
help: provide the argument
     |
142  |     let cpu_buffer_out_future = cpu_buffer_out_slice.map_async(wgpu::MapMode::Read, /* value */);
     |                                                               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

error[E0277]: `()` is not a future
   --> src/gpu.rs:146:42
    |
146 |     if let Ok(()) = cpu_buffer_out_future.await {
    |                                          ^^^^^^
    |                                          |
    |                                          `()` is not a future
    |                                          help: remove the `.await`
    |
    = help: the trait `Future` is not implemented for `()`
    = note: () must be a future or must implement `IntoFuture` to be awaited
    = note: required for `()` to implement `IntoFuture`

Some errors have detailed explanations: E0061, E0277, E0308, E0599.
For more information about an error, try `rustc --explain E0061`.
error: could not compile `test` due to 5 previous errors

As for filing a bug report with Bugzilla, I do not have an account there so I won't be posting a bug report there for the moment.

@teoxoy
Copy link
Member

teoxoy commented Jun 6, 2023

Add rev = "0ac9ce002656565ccd05b889f5856f4e2c38fa73" (it was the latest commit on the day the bug was filed) to the wgpu entry in Cargo.toml.

As for filing a bug report with Bugzilla, I do not have an account there so I won't be posting a bug report there for the moment.

Logging in via github should work - but up to you.

@JasonS05
Copy link

JasonS05 commented Jun 8, 2023

Unfortunately, that made a new error. Something about no suitable version of web-sys. Full error here:

Click to expand

    Updating git repository `https://github.com/gfx-rs/wgpu`
    Updating crates.io index
    Updating git repository `https://github.com/gfx-rs/naga`
    Updating git repository `https://github.com/gfx-rs/metal-rs`
error: failed to select a version for `web-sys`.
    ... required by package `wgpu v0.12.0 (https://github.com/gfx-rs/wgpu?rev=0ac9ce002656565ccd05b889f5856f4e2c38fa73#0ac9ce00)`
    ... which satisfies git dependency `wgpu` of package `test v0.1.0 (/home/jason/Desktop/Coding Stuff/github/peters-david.test)`
versions that meet the requirements `^0.3.53` (locked to 0.3.63) are: 0.3.63

the package `wgpu` depends on `web-sys`, with features: `GpuBufferUsage` but `web-sys` does not have these features.


failed to select a version for `web-sys` which could resolve this conflict

@teoxoy
Copy link
Member

teoxoy commented Jun 8, 2023

Run cargo clean and delete the Cargo.lock file (at least that's what I did).

@JasonS05
Copy link

JasonS05 commented Jun 8, 2023

Ok, it works now. When I ran the program it seemed to hang at the described spot with a total system memory usage hovering around 5.8 GiB. After a couple minutes it finished whatever it was doing and the program exited normally leaving the system memory at 3.9 GiB. Running the program again does not reproduce the hang and the whole thing executes in under a second.

This is with my Ubuntu 22.04.2 LTS, GTX 1650 system

@chancehudson
Copy link

chancehudson commented Nov 18, 2023

I ran into this issue on a Macbook Air with an M1 processor. I found the cause to be a large multi-dimensional array in the workgroup memory space. I made a minimal repro case here: https://github.com/vimwitch/webgpu-hang-repro

Some things I noticed during testing:

  • Problem does not occur for storage memory
  • Problem occurs with single dimensional arrays
  • After waiting for the pipeline to be created once, subsequent creations do not hang. Changing the size of the array causes the next creation to hang. Reverting to the previous value after waiting for the changed size to be created does not result in another hang.
  • Changing shader logic causes the hang to occur again
  • Changing workgroup size does not cause hang to occur again
  • If the shader logic does not touch the array the hang does not occur
  • During the hang system memory and CPU use is unaffected
  • During the hang the program CPU use is 0, memory use is constant at 3.9 MB

Apple M1 Macbook Air OSX 12.5

@chancehudson
Copy link

I profiled the repro above: https://share.firefox.dev/3G3Al3W

image

@teoxoy teoxoy added the backend: metal Issues with Metal label Nov 20, 2023
@Forpee
Copy link

Forpee commented Dec 15, 2023

Is there any progress on this issue? I'm encountering a similar problem where my program stalls on the device.create_compute_pipeline line.

To me, it looks like the compute pipeline pre-runs the shader on the first look, which causes this long stall before it completes. As I noticed the more time-intensive functions I ran in my main function the longer the pipeline took to complete

@jimblandy
Copy link
Member

Assigning Teo to try to reproduce, investigate cause, and estimate size.

@teoxoy teoxoy removed their assignment Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: correctness We're behaving incorrectly backend: metal Issues with Metal backend: vulkan Issues with Vulkan type: bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

8 participants