Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Web] Demucs model won't run in both WASM and WGPU #22031

Closed
gianlourbano opened this issue Sep 9, 2024 · 41 comments
Closed

[Web] Demucs model won't run in both WASM and WGPU #22031

gianlourbano opened this issue Sep 9, 2024 · 41 comments
Labels
ep:WebGPU ort-web webgpu provider platform:web issues related to ONNX Runtime web; typically submitted using template

Comments

@gianlourbano
Copy link

Describe the issue

I converted the model from pytorch to onnx as described here, with some issues. The model works in onnx python, but in wasm /webgpu the runtime dies without error. The optimized version of the model runs in wasm, but not webgpu. I don't know if this problem is related to the model conversion or the runtime. I have tested with both @latest and @dev.

To reproduce

Here's a link to a sample repo, instructions in README.

Urgency

Urgent, as this project is related to my thesis

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.2, 1.20.0-dev.20240907-ad9afbb042

Execution Provider

'wasm'/'cpu' (WebAssembly CPU), 'webgpu' (WebGPU)

@gianlourbano gianlourbano added the platform:web issues related to ONNX Runtime web; typically submitted using template label Sep 9, 2024
@github-actions github-actions bot added the ep:WebGPU ort-web webgpu provider label Sep 9, 2024
@gyagp
Copy link

gyagp commented Sep 10, 2024

For WebGPU EP, the problem is related to op unsqueeze. According the ONNX spec (https://onnx.ai/onnx/operators/onnx__Unsqueeze.html), axes of unsqueeze is a list of integers, but in your model, it's just a scalar "1".

@gianlourbano
Copy link
Author

So the problem is related to the dynamo export of torch?

@fs-eire
Copy link
Contributor

fs-eire commented Sep 11, 2024

Technically the axes should always be a 1D tensor. However, in reality, the CPU code has loosen the limit:

https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/tensor/unsqueeze.cc#L60-L62

perhaps webgpu should have same behavior to CPU.

#22054

gyagp pushed a commit to gyagp/onnxruntime that referenced this issue Sep 12, 2024
This is to fix issue microsoft#22031 to run model demucs.
For conv-transpose, outputPadding.length could be 1, while spatialRank
is 2. The fix is to append enough 0s to outputPadding.
For conv, the issue is similar. kernelShape.length sometimes could be 1,
while inputs[1].dims.length is 4. The fix is also to append enough 0s to
kernelShape.
fs-eire pushed a commit that referenced this issue Sep 17, 2024
This is to fix issue #22031 to run model demucs.
For conv-transpose, outputPadding.length could be 1, while spatialRank
is 2. The fix is to append enough 0s to outputPadding. For conv, the
issue is similar. kernelShape.length sometimes could be 1, while
inputs[1].dims.length is 4. The fix is also to append enough 0s to
kernelShape.
@gianlourbano
Copy link
Author

@gyagp with latest 1.20.0-dev.20240917-afd642a194, that should include both fixes, i still cannot run the model in webgpu, the runtime just aborts after displaying the wgpu experimental warning

@gyagp
Copy link

gyagp commented Sep 19, 2024

I also hit some issue with the latest code, and I will take a further look.
BTW, I manually modified the model to work around the unsqueeze issue before, and it seems that model can run. I uploaded it to https://huggingface.co/webai-community/models/tree/main (click "download file" after demucs.onnx).

@gianlourbano
Copy link
Author

gianlourbano commented Sep 19, 2024

Your model succesfully runs with latest @dev, with timings (60s of audio with 10s chunks):

wasm:
step 0: 12656 ms
step 1: 12864 ms
step 2: 13211 ms
step 3: 13164 ms
step 4: 13643 ms
step 5: 13687 ms

wgpu:
step 0: 10226 ms
step 1: 9612 ms
step 2: 9628 ms
step 3: 9647 ms
step 4: 9600 ms
step 5: 9562 ms

onnx python cpu:
step 0: 4.9 s
step 1: 4.9 s
step 2: 4.6 s
step 3: 4.9 s
step 4: 4.8 s
step 5: 4.6 s

On ryzen 4600H

@gianlourbano
Copy link
Author

I have also tried on a macbook m1 pro with an average wgpu step of ~2.8s

gyagp pushed a commit to gyagp/onnxruntime that referenced this issue Sep 29, 2024
While allowing axes in unsqueeze to be scalar, its shape couldn't be
always accessed like a vector. This PR fixes issue microsoft#22031 so that the
original model could run well.
fs-eire pushed a commit that referenced this issue Sep 30, 2024
While allowing axes in unsqueeze to be scalar, its shape couldn't be
always accessed like a vector. This PR fixes issue #22031 so that the
original model could run well.
@gianlourbano
Copy link
Author

@gyagp After implementening pre and post processing for the demuxing of a whole track, i have noticed that Wgpu outputs are way different from wasm ones. In wasm, the model works as expected, while in gpu the stems are all mixed up, apart from the bass one: i suspect the lower frequencies are preserved while with the higher ones something strange happens. Maybe an error in some kernel?

If you want i can upload somewhere the stems of a 10s chunks for wasm/wpgu inference, to see the difference. I'm certain the problem is not with the pre/post processing, as the outputs of the model with the two backends are different.

Also, any update on the MatMul problem?

@gyagp
Copy link

gyagp commented Oct 18, 2024

Sorry to hear that you got different results from wasm and WebGPU. If you may upload your case somewhere, I can take a look next week.
What's the MatMul problem?

@gianlourbano
Copy link
Author

I'll update the sample repo in this issue so that it computes on the same random arrays both on wasm and wgpu, to demonstrate that the outputs are different based on the backend used.

The matmul problem is the one you mentioned here, i.e the performance of the wgpu model is not that great

@gyagp
Copy link

gyagp commented Oct 18, 2024

Ah, sorry that it's a bit buried by other tasks. I will ask someone from my team to look into it next week.

qjia7 added a commit to qjia7/onnxruntime that referenced this issue Oct 29, 2024
BUG microsoft#22031

Optimize below two situations:
1. Increase workgroupSize if only one workgroup is dispatched.
2. Avoid transpose if not necessary.

The overall time of demucs model becomes 106.36 ms from 154.60 ms on my
dGPUs with this PR and PR microsoft#22577
guschmue pushed a commit that referenced this issue Oct 30, 2024
BUG #22031

Optimize below two situations:
1. Increase workgroupSize if only one workgroup is dispatched.
2. Avoid transpose if not necessary.

The overall time of demucs model becomes 106.36 ms from 154.60 ms on my
dGPUs with this PR and PR #22577
@gianlourbano
Copy link
Author

gianlourbano commented Oct 31, 2024

Thank you very much @qjia7 ! On my macbook pro m1 the step is now 1.9s from 2.8s. I'm still seeing wrong outputs for the model in wgpu, while on wasm it works fine. If you want i can upload some stems to the sample repo so you can see the difference

@qjia7
Copy link
Contributor

qjia7 commented Nov 1, 2024

@gianlourbano Will look at the wrong outputs issue. And the optimization isn't over yet. There are still several places that need to be optimized.

@qjia7
Copy link
Contributor

qjia7 commented Nov 1, 2024

@gianlourbano I did a debug for this model. The incorrect result is because the MatMul shader key is not unique which results the wrong compute pipeline is loaded. PR #22536 may fix the issue. You can have a try once this PR is merged.

guschmue pushed a commit that referenced this issue Nov 1, 2024
### Description
<!-- Describe your changes. -->
BUG #22031

In the demucs model, there are lots of MatMul ops with shapes like
below:
`input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32,
output[0]: [3448,1,1536] | float32`

We can see that for this kind of shape, the batch size is a big value,
but M = 1. Our current algorithm is based on [M, N] to partition tiles,
which is not efficient for such kind of shapes. This PR reshapes the
inputs to improve the matmul performance.
Before:  [3448,1,512] x [512,1536] =  [3448,1,1536]
After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output
can be reshaped to [3448, 1, 1536]

The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17
ms on my iGPUs.

---------

Co-authored-by: Yulong Wang <[email protected]>
qjia7 added a commit to qjia7/onnxruntime that referenced this issue Nov 4, 2024
BUG microsoft#22031

The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms
on my iGPUs.
qjia7 added a commit to qjia7/onnxruntime that referenced this issue Nov 4, 2024
BUG microsoft#22031

The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms
on my iGPUs.
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this issue Nov 19, 2024
While allowing axes in unsqueeze to be scalar, its shape couldn't be
always accessed like a vector. This PR fixes issue microsoft#22031 so that the
original model could run well.
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this issue Nov 19, 2024
BUG microsoft#22031

Optimize below two situations:
1. Increase workgroupSize if only one workgroup is dispatched.
2. Avoid transpose if not necessary.

The overall time of demucs model becomes 106.36 ms from 154.60 ms on my
dGPUs with this PR and PR microsoft#22577
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this issue Nov 19, 2024
### Description
<!-- Describe your changes. -->
BUG microsoft#22031

In the demucs model, there are lots of MatMul ops with shapes like
below:
`input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32,
output[0]: [3448,1,1536] | float32`

We can see that for this kind of shape, the batch size is a big value,
but M = 1. Our current algorithm is based on [M, N] to partition tiles,
which is not efficient for such kind of shapes. This PR reshapes the
inputs to improve the matmul performance.
Before:  [3448,1,512] x [512,1536] =  [3448,1,1536]
After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output
can be reshaped to [3448, 1, 1536]

The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17
ms on my iGPUs.

---------

Co-authored-by: Yulong Wang <[email protected]>
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this issue Nov 19, 2024
BUG microsoft#22031

The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms
on my iGPUs.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this issue Nov 19, 2024
…microsoft#22709)

microsoft#22031

For reduce related ops, we should increase workgroupSize to improve
parallelism if only one workgroup is dispatched.

The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this issue Nov 19, 2024
BUG microsoft#22031 

The overall time of ConvTranspose in Demucs model becomes 517.41 ms from
1415.65 ms on my iGPUs.
@gyagp
Copy link

gyagp commented Nov 29, 2024

@gianlourbano, Google improved the perf via https://issues.chromium.org/issues/379009123. Could you please try the latest Chrome Canary to see the performance. @qjia7 and I may not have access to MacBook recently.

@gianlourbano
Copy link
Author

@gyagp @qjia7 i can confirm that on canary the performance returned to normal, even faster i would say (avg step of 1.2 seconds). Thank you very much for your help!

guschmue pushed a commit that referenced this issue Dec 2, 2024
…#22709)

#22031

For reduce related ops, we should increase workgroupSize to improve
parallelism if only one workgroup is dispatched.

The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.
guschmue pushed a commit that referenced this issue Dec 2, 2024
BUG #22031 

The overall time of ConvTranspose in Demucs model becomes 517.41 ms from
1415.65 ms on my iGPUs.
ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024
BUG microsoft#22031

Optimize below two situations:
1. Increase workgroupSize if only one workgroup is dispatched.
2. Avoid transpose if not necessary.

The overall time of demucs model becomes 106.36 ms from 154.60 ms on my
dGPUs with this PR and PR microsoft#22577
ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024
### Description
<!-- Describe your changes. -->
BUG microsoft#22031

In the demucs model, there are lots of MatMul ops with shapes like
below:
`input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32,
output[0]: [3448,1,1536] | float32`

We can see that for this kind of shape, the batch size is a big value,
but M = 1. Our current algorithm is based on [M, N] to partition tiles,
which is not efficient for such kind of shapes. This PR reshapes the
inputs to improve the matmul performance.
Before:  [3448,1,512] x [512,1536] =  [3448,1,1536]
After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output
can be reshaped to [3448, 1, 1536]

The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17
ms on my iGPUs.

---------

Co-authored-by: Yulong Wang <[email protected]>
ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024
BUG microsoft#22031

The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms
on my iGPUs.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024
…microsoft#22709)

microsoft#22031

For reduce related ops, we should increase workgroupSize to improve
parallelism if only one workgroup is dispatched.

The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.
ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024
BUG microsoft#22031 

The overall time of ConvTranspose in Demucs model becomes 517.41 ms from
1415.65 ms on my iGPUs.
ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024
BUG microsoft#22031

Optimize below two situations:
1. Increase workgroupSize if only one workgroup is dispatched.
2. Avoid transpose if not necessary.

The overall time of demucs model becomes 106.36 ms from 154.60 ms on my
dGPUs with this PR and PR microsoft#22577
ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024
### Description
<!-- Describe your changes. -->
BUG microsoft#22031

In the demucs model, there are lots of MatMul ops with shapes like
below:
`input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32,
output[0]: [3448,1,1536] | float32`

We can see that for this kind of shape, the batch size is a big value,
but M = 1. Our current algorithm is based on [M, N] to partition tiles,
which is not efficient for such kind of shapes. This PR reshapes the
inputs to improve the matmul performance.
Before:  [3448,1,512] x [512,1536] =  [3448,1,1536]
After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output
can be reshaped to [3448, 1, 1536]

The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17
ms on my iGPUs.

---------

Co-authored-by: Yulong Wang <[email protected]>
ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024
BUG microsoft#22031

The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms
on my iGPUs.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024
…microsoft#22709)

microsoft#22031

For reduce related ops, we should increase workgroupSize to improve
parallelism if only one workgroup is dispatched.

The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.
ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024
BUG microsoft#22031 

The overall time of ConvTranspose in Demucs model becomes 517.41 ms from
1415.65 ms on my iGPUs.
ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024
### Description
<!-- Describe your changes. -->
BUG microsoft#22031

In the demucs model, there are lots of MatMul ops with shapes like
below:
`input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32,
output[0]: [3448,1,1536] | float32`

We can see that for this kind of shape, the batch size is a big value,
but M = 1. Our current algorithm is based on [M, N] to partition tiles,
which is not efficient for such kind of shapes. This PR reshapes the
inputs to improve the matmul performance.
Before:  [3448,1,512] x [512,1536] =  [3448,1,1536]
After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output
can be reshaped to [3448, 1, 1536]

The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17
ms on my iGPUs.

---------

Co-authored-by: Yulong Wang <[email protected]>
@fs-eire
Copy link
Contributor

fs-eire commented Dec 18, 2024

Close the issue as corresponding fixes and features are merged and verified.

@fs-eire fs-eire closed this as completed Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider platform:web issues related to ONNX Runtime web; typically submitted using template
Projects
None yet
Development

No branches or pull requests

5 participants