[webgpu] Optimize matmulnbits with M > 1 #23102

qjia7 · 2024-12-13T13:21:45Z

This is the webgpu native ep implementation of #23092.

I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to test. Meanwhile, applied fs-eire/ort-webgpu-nodejs-chatapp-prototype#2 to print the first token time.

The result is like below:
The latest main branch:
Intel Arc Graphics

659 tokens in 24.8sec, 26.57 tokens/sec
    Decoding first token with input 449 tokens: 13.0 sec
    Decoding remaining 210 tokens:
        11.8 sec
        17.79 tokens/sec

NV RTX 2000

659 tokens in 14.4sec, 45.85 tokens/sec
    Decoding first token with input 449 tokens: 7.3 sec
    Decoding remaining 210 tokens:
        7.0 sec
        29.81 tokens/sec

With this PR:
Intel Arc Graphics

657 tokens in 20.6sec, 31.92 tokens/sec
    Decoding first token with input 449 tokens: 8.5 sec
    Decoding remaining 208 tokens:
        12.1 sec
        17.23 tokens/sec

NV RTX 2000

659 tokens in 11.4sec, 57.93 tokens/sec
    Decoding first token with input 449 tokens: 4.1 sec
    Decoding remaining 210 tokens:
        7.2 sec
        28.98 tokens/sec

From above data, you can see that with this PR, both intel (13s -> 8.5s) and NV (7.3s -> 4.1s) GPUs for the first token time are performing better.

qjia7 · 2024-12-13T13:29:45Z

@sushraja-msft @sushanthr Currently, I only test it on my laptop with dual GPUs. You can find the data in above description message. Please help verify it in your side to see if we can see similar result since our gpus and benchmarks are not same.

cc @guschmue @fs-eire This PR still needs to be further refactored to reduce some duplicated codes. Now it's just for verification.

sushraja-msft · 2024-12-13T23:44:38Z

@sushraja-msft @sushanthr Currently, I only test it on my laptop with dual GPUs. You can find the data in above description message. Please help verify it in your side to see if we can see similar result since our gpus and benchmarks are not same.

cc @guschmue @fs-eire This PR still needs to be further refactored to reduce some duplicated codes. Now it's just for verification.

Ran your change on my intel Xe laptop, this is faster than mine 👏. 55tk's vs 44tk's in mine.
We should land yours and its okay to remove my implementation @guschmue @qjia7

C:\model_benchmark>model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       9.10299e+06
        avg (tokens/s): 55.0369                                <<<<
        p50 (us):       9.09658e+06
        stddev (us):    13042.6
        n:              5 * 501 token(s)
Token generation:
        avg (us):       79482.3
        avg (tokens/s): 12.5814
        p50 (us):       79505.4
        stddev (us):    2280.06
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       18.0841
        avg (tokens/s): 55297.3
        p50 (us):       14.4
        stddev (us):    24.9088
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       19199.6
        p50 (ms):       19200.1
        stddev (ms):    20.0724
        n:              5
Peak working set size (bytes): 5470642176
WebGPU device lost (2): Device was destroyed.```

guschmue · 2024-12-14T00:00:20Z

very cool JiaJia, I can run it on a bunch of machines

guschmue · 2024-12-14T00:02:04Z

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

guschmue · 2024-12-14T00:02:14Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

azure-pipelines · 2024-12-14T00:02:23Z

Azure Pipelines successfully started running 2 pipeline(s).

guschmue · 2024-12-14T00:02:24Z

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

guschmue · 2024-12-14T00:02:33Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-12-14T00:02:47Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2024-12-14T00:02:51Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-12-14T00:02:56Z

Azure Pipelines successfully started running 9 pipeline(s).

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

qjia7 · 2024-12-16T07:31:14Z

@guschmue @fs-eire This is ready for review. Please take a look, thanks.

guschmue · 2024-12-16T17:11:52Z

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

guschmue · 2024-12-16T17:11:59Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

guschmue · 2024-12-16T17:12:04Z

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

azure-pipelines · 2024-12-16T17:12:09Z

Azure Pipelines successfully started running 2 pipeline(s).

guschmue · 2024-12-16T17:12:10Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-12-16T17:12:25Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2024-12-16T17:12:25Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-12-16T17:12:41Z

Azure Pipelines successfully started running 9 pipeline(s).

guschmue · 2024-12-17T03:11:53Z

/azp run ONNX Runtime Web CI Pipeline

azure-pipelines · 2024-12-17T03:12:05Z

Azure Pipelines successfully started running 1 pipeline(s).

### Description After the optimization of prefill time with #23102, it seems that always using the tile matmulnibits with block_size = 32 can bring better performance even for discrete gpu for phi3 model. Phi3 becomes 42.64 tokens/sec from 32.82 tokens/sec in easy mode on my NV RTX 2000 GPU.

This is the webgpu native ep implementation of #23092. I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to test. Meanwhile, applied fs-eire/ort-webgpu-nodejs-chatapp-prototype#2 to print the first token time. The result is like below: The latest main branch: Intel Arc Graphics ``` 659 tokens in 24.8sec, 26.57 tokens/sec Decoding first token with input 449 tokens: 13.0 sec Decoding remaining 210 tokens: 11.8 sec 17.79 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 14.4sec, 45.85 tokens/sec Decoding first token with input 449 tokens: 7.3 sec Decoding remaining 210 tokens: 7.0 sec 29.81 tokens/sec ``` ------------------------------------------------------------------------- With this PR: Intel Arc Graphics ``` 657 tokens in 20.6sec, 31.92 tokens/sec Decoding first token with input 449 tokens: 8.5 sec Decoding remaining 208 tokens: 12.1 sec 17.23 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 11.4sec, 57.93 tokens/sec Decoding first token with input 449 tokens: 4.1 sec Decoding remaining 210 tokens: 7.2 sec 28.98 tokens/sec ``` From above data, you can see that with this PR, both intel (13s -> 8.5s) and NV (7.3s -> 4.1s) GPUs for the first token time are performing better.

### Description After the optimization of prefill time with #23102, it seems that always using the tile matmulnibits with block_size = 32 can bring better performance even for discrete gpu for phi3 model. Phi3 becomes 42.64 tokens/sec from 32.82 tokens/sec in easy mode on my NV RTX 2000 GPU.

[webgpu] Optimize matmulnbits with M > 1

d30cf80

guschmue added the ep:WebGPU ort-web webgpu provider label Dec 14, 2024

github-advanced-security bot found potential problems Dec 14, 2024

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Fixed Show fixed Hide fixed

qjia7 added 5 commits December 16, 2024 10:01

Remove MatMulNBitsProgramPrefill

a349ad4

remove components_a limitation

a7a7d9b

make tile_m as class member

be81377

merge MatMulNBitsWithLargeMProgram to MatMulNBitsProgram

d6277ea

set tile M threshold

ca8ef7a

guschmue approved these changes Dec 17, 2024

View reviewed changes

guschmue merged commit 0981bbf into microsoft:main Dec 17, 2024
77 checks passed

qjia7 mentioned this pull request Dec 18, 2024

[webgpu] Always use tile matmulnbits for block_size = 32 #23140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[webgpu] Optimize matmulnbits with M > 1 #23102

[webgpu] Optimize matmulnbits with M > 1 #23102

qjia7 commented Dec 13, 2024 •

edited

Loading

qjia7 commented Dec 13, 2024

sushraja-msft commented Dec 13, 2024

guschmue commented Dec 14, 2024

guschmue commented Dec 14, 2024

guschmue commented Dec 14, 2024

azure-pipelines bot commented Dec 14, 2024

guschmue commented Dec 14, 2024

guschmue commented Dec 14, 2024

azure-pipelines bot commented Dec 14, 2024

azure-pipelines bot commented Dec 14, 2024

azure-pipelines bot commented Dec 14, 2024

qjia7 commented Dec 16, 2024

guschmue commented Dec 16, 2024

guschmue commented Dec 16, 2024

guschmue commented Dec 16, 2024

azure-pipelines bot commented Dec 16, 2024

guschmue commented Dec 16, 2024

azure-pipelines bot commented Dec 16, 2024

azure-pipelines bot commented Dec 16, 2024

azure-pipelines bot commented Dec 16, 2024

guschmue commented Dec 17, 2024

azure-pipelines bot commented Dec 17, 2024

[webgpu] Optimize matmulnbits with M > 1 #23102

[webgpu] Optimize matmulnbits with M > 1 #23102

Conversation

qjia7 commented Dec 13, 2024 • edited Loading

qjia7 commented Dec 13, 2024

sushraja-msft commented Dec 13, 2024

guschmue commented Dec 14, 2024

guschmue commented Dec 14, 2024

guschmue commented Dec 14, 2024

azure-pipelines bot commented Dec 14, 2024

guschmue commented Dec 14, 2024

guschmue commented Dec 14, 2024

azure-pipelines bot commented Dec 14, 2024

azure-pipelines bot commented Dec 14, 2024

azure-pipelines bot commented Dec 14, 2024

qjia7 commented Dec 16, 2024

guschmue commented Dec 16, 2024

guschmue commented Dec 16, 2024

guschmue commented Dec 16, 2024

azure-pipelines bot commented Dec 16, 2024

guschmue commented Dec 16, 2024

azure-pipelines bot commented Dec 16, 2024

azure-pipelines bot commented Dec 16, 2024

azure-pipelines bot commented Dec 16, 2024

guschmue commented Dec 17, 2024

azure-pipelines bot commented Dec 17, 2024

qjia7 commented Dec 13, 2024 •

edited

Loading