Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CommandEncoder::run_render_pass takes much longer to execute when there are many vertex and index buffers allocated, regardless of the number submitted to the RenderPass #5514

Closed
RedMindZ opened this issue Apr 10, 2024 · 3 comments
Labels
area: performance How fast things go

Comments

@RedMindZ
Copy link

RedMindZ commented Apr 10, 2024

Description
The time it takes to execute CommandEncoder::run_render_pass is heavily impacted by the total number of allocated vertex and index buffers, regardless of how many of them are actually submitted to a RenderPass.

Expected vs observed behavior

  • With 2 vertex buffers and 2 index buffers allocated and submitted to the render pass, the call takes ~50 microseconds.
  • With ~2000 vertex buffers and ~2000 index buffers allocated and only 2 vertex buffers and 2 index buffers submitted to the render pass, the call takes ~500 microseconds.

This slowdown can also be observed with a large number of unused bind groups.

I have also confirmed with RenderDoc that the exact same render commands are sent to the GPU.

I am not familiar with the internals of wgpu, but I assumed the time it takes to encode a render pass should only depend on the content of the render pass, and be independent of the total number of GPU objects in existence.

Repro steps
Allocate a small number (e.g. 2) of vertex and index buffers, submit them to a RenderPass, and measure the time it takes for the Drop implementation of RenderPass.
Then, allocate a large number (e.g. 2000) of vertex and index buffers, submit the same number as the previous render pass to a RenderPass, and measure the time it takes for the Drop implementation of RenderPass.

Extra materials
I can provide the traces if necessary, but they are very large (over 4GB), so let me know if they are actually necessary.

Platform
bevy 0.13 with wgpu 0.19.1 running on Windows 10 version 10.0.19045 with a GTX 1080 and driver version 536.23. Tested with both DX12 and Vulkan.

@Wumpf
Copy link
Member

Wumpf commented Apr 10, 2024

interesting, sounds like hash lookups for the underlying vertex/index buffer ids just take a bit longer or some other cache-miss heavy thing occus. Not aware of anything else that would make this slower by design. 500us seems crazy long though for this.
Minimal repro code would be appreciated if possible!

@Wumpf Wumpf added the area: performance How fast things go label Apr 10, 2024
@RedMindZ
Copy link
Author

RedMindZ commented Apr 11, 2024

I have a created a minimal example based on the hello_triangle example here: https://gist.github.com/RedMindZ/eb1033b0b903d35f5cfba6919bcad25d

The important changes are in main.rs on lines 57-65 and lines 142-145:
On lines 57-65 we simply allocate 2M vertex buffers, without ever using them.
On lines 142-145 we just measure the time it takes to drop the render pass.
The rest of the lines are effectively the same as the hello_triangle example.
You can resize the window to get it to redraw and encode another render pass.
With 2M vertex buffers it takes about 16ms to encode the render pass.

This minimal example enabled me to profile the code effectively, and the profiler points to the UsageScope struct:
Specifically, dropping the UsageScope and calling new on it. The call to new sets the size on the buffers and textures fields, which in turn allocate a vector with the specified size.
It seems that since the size is 2M, it takes a while to allocate and drop the UsageScope struct.

It also looks to me like this is already addressed on the main branch (trunk), where the UsageScope is using a pool to allocate those vectors, but this is an issue in 0.19. Would it be reasonable for me to just use the main branch, or is it too unstable?

Edit: Seems like #5414 was created exactly to address this issue.

@Wumpf
Copy link
Member

Wumpf commented Apr 12, 2024

ah 🤦 didn't realize the connection with #5414 despite having reviewed it myself, was too fixated on that being about allocating and didn't make the connection of when and where said allocations are happening. Thank you so much for following up!
But that also means that we can close this ticket as it's solved on trunk (plz reopen if your testing shows otherwise).

We actually wanted to do a release very soon, but some of the issues on https://github.com/gfx-rs/wgpu/milestone/19 are still blocking. If you're not bothered by that I'd even enourage you to use trunk - users's that do are often the only way we can be reasonably certain that trunk is ready to release

@Wumpf Wumpf closed this as not planned Won't fix, can't repro, duplicate, stale Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: performance How fast things go
Projects
None yet
Development

No branches or pull requests

2 participants