Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V0.19 seems taking toooo long on preparing BindGroupLayout #5196

Closed
cryscan opened this issue Feb 5, 2024 · 9 comments
Closed

V0.19 seems taking toooo long on preparing BindGroupLayout #5196

cryscan opened this issue Feb 5, 2024 · 9 comments
Labels
area: performance How fast things go

Comments

@cryscan
Copy link

cryscan commented Feb 5, 2024

Description
I try to upgrade web-rwkv, an LLM inferencing backend using compute shaders. to v0.19. However after upgrading, I find that when running the model, it gets slower and slower, and most of the time, the GPU is idle. I suspect that internally, the CPU side is waiting on something.

This does not happen in v0.18.

Repro steps
Try to upgrade web-rwkv to v0.19 without touch anything other than adding .into_iter() when selecting adapters and run the model via

$ cargo run --example -r chat -- -m .\assets\models\rwkv-x060-3b-world-v2-28%trained-20231208-ctx4k.st -t

Expected vs observed behavior

  • Expected: the inference goes at a constant speed, since the model is an RNN.
  • Observed: it goes slower and slower, with GPU usage going down.

Extra materials
I did a framegraph (attached) and found that comparing to v0.18, v0.19 spent a lot of time on wgpu::ComputePipeline::get_bind_group_layout.

v0.18

v0 18

v0.19

v0 19

Platform

  • OS: Windows 10
  • WGPU: v0.19.1
  • GPU: NVidia RTX 3080, 4090
@nical
Copy link
Contributor

nical commented Feb 5, 2024

@cwfitzgerald looks like it might have been caused by the bindgroup layout dedup refactor ?

@cwfitzgerald
Copy link
Member

Double checked the trace - I think this is actually arcanization.

Based on the trace provided, wgpu_core::identity::IdentityValues::alloc seems to be calling max_by on a slice. This is what is taking the time. I think this is one of a few leaks combined with linear behavior during identity value allocation.

@cwfitzgerald
Copy link
Member

For those hitting this problem, are you calling get_bind_group_layout every frame? To be clear this is a bug on our side, but reducing calls to get_bind_group_layout should reduce the problem.

@nathanielsimard
Copy link

@cwfitzgerald, we are indeed calling get_bind_group_layout not just at every frame, but for every compute kernel that we execute (can be thousands or more each second). See here: https://github.com/tracel-ai/burn/blob/3eab14160875ddaa1d0527247c09d6f37f8c75c7/burn-wgpu/src/compute/server.rs#L333

We are still waiting to have many ComputePipeline instances before creating the compute pass and submitting work to the GPU. We are actually caching the ComputePipeline based on kernel id (compute shader id). Do you think there are any obvious improvements that we should make to reduce CPU overhead and better utilize the GPU?

@cwfitzgerald
Copy link
Member

If you cache the bind group layouts alongside the compute pipeline, the problem should mostly go away. Currently (with the bugs) I believe performance is ~O(n^2) where n = calls to get_bind_group_layout.

@nathanielsimard
Copy link

Just to clarify, instead of calling get_bind_group_layout on the cached ComputePipeline when we want to execute it with different buffers, we should cache the BindGroupLayout along with the ComputePipeline when we first create it to avoid the need for the subsequent call to get_bind_group_layout, is that correct?

I actually tried it, and it didn't impact the performance significantly: https://github.com/tracel-ai/burn/blob/57cc3ffe60f8526a218404d433373128c3b24f17/burn-wgpu/src/compute/server.rs#L344

@cwfitzgerald
Copy link
Member

Yeah that is what I meant - that's a bit unexpected. Could you try using cargo flamegraph and uploading the generated flamegraph (like OP did).

@nathanielsimard
Copy link

my_flamegraph

This is profiled in the middle of a training run, since it needs to run for a few seconds before becoming slow.

@cwfitzgerald
Copy link
Member

This should be worked around in 0.19.2, and a full fix should land in 0.20. The leak still exists, but you shouldn't notice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: performance How fast things go
Projects
None yet
Development

No branches or pull requests

4 participants