Add first token's time #2

qjia7 · 2024-12-13T13:12:18Z

No description provided.

fs-eire · 2024-12-13T22:43:12Z

is it still WIP?

qjia7 · 2024-12-16T01:36:42Z

is it still WIP?

Ready for review. Thanks.

This is the webgpu native ep implementation of #23092. I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to test. Meanwhile, applied fs-eire/ort-webgpu-nodejs-chatapp-prototype#2 to print the first token time. The result is like below: The latest main branch: Intel Arc Graphics ``` 659 tokens in 24.8sec, 26.57 tokens/sec Decoding first token with input 449 tokens: 13.0 sec Decoding remaining 210 tokens: 11.8 sec 17.79 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 14.4sec, 45.85 tokens/sec Decoding first token with input 449 tokens: 7.3 sec Decoding remaining 210 tokens: 7.0 sec 29.81 tokens/sec ``` ------------------------------------------------------------------------- With this PR: Intel Arc Graphics ``` 657 tokens in 20.6sec, 31.92 tokens/sec Decoding first token with input 449 tokens: 8.5 sec Decoding remaining 208 tokens: 12.1 sec 17.23 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 11.4sec, 57.93 tokens/sec Decoding first token with input 449 tokens: 4.1 sec Decoding remaining 210 tokens: 7.2 sec 28.98 tokens/sec ``` From above data, you can see that with this PR, both intel (13s -> 8.5s) and NV (7.3s -> 4.1s) GPUs for the first token time are performing better.

This is the webgpu native ep implementation of microsoft#23092. I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to test. Meanwhile, applied fs-eire/ort-webgpu-nodejs-chatapp-prototype#2 to print the first token time. The result is like below: The latest main branch: Intel Arc Graphics ``` 659 tokens in 24.8sec, 26.57 tokens/sec Decoding first token with input 449 tokens: 13.0 sec Decoding remaining 210 tokens: 11.8 sec 17.79 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 14.4sec, 45.85 tokens/sec Decoding first token with input 449 tokens: 7.3 sec Decoding remaining 210 tokens: 7.0 sec 29.81 tokens/sec ``` ------------------------------------------------------------------------- With this PR: Intel Arc Graphics ``` 657 tokens in 20.6sec, 31.92 tokens/sec Decoding first token with input 449 tokens: 8.5 sec Decoding remaining 208 tokens: 12.1 sec 17.23 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 11.4sec, 57.93 tokens/sec Decoding first token with input 449 tokens: 4.1 sec Decoding remaining 210 tokens: 7.2 sec 28.98 tokens/sec ``` From above data, you can see that with this PR, both intel (13s -> 8.5s) and NV (7.3s -> 4.1s) GPUs for the first token time are performing better.

add first token's time

455e634

qjia7 mentioned this pull request Dec 13, 2024

[webgpu] Optimize matmulnbits with M > 1 microsoft/onnxruntime#23102

Merged

fs-eire approved these changes Dec 13, 2024

View reviewed changes

nits

94794c4

qjia7 changed the title ~~[WIP] add first token's time~~ add first token's time Dec 16, 2024

qjia7 changed the title ~~add first token's time~~ Add first token's time Dec 16, 2024

qjia7 requested a review from fs-eire December 16, 2024 01:37

fs-eire approved these changes Dec 16, 2024

View reviewed changes

fs-eire merged commit c8212f1 into fs-eire:main Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add first token's time #2

Add first token's time #2

qjia7 commented Dec 13, 2024

fs-eire commented Dec 13, 2024

qjia7 commented Dec 16, 2024

Add first token's time #2

Add first token's time #2

Conversation

qjia7 commented Dec 13, 2024

fs-eire commented Dec 13, 2024

qjia7 commented Dec 16, 2024