-
-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[unlimited:waifu2x] Multithreading is possible but not configured properly #34
Comments
It's astonishing how fast unlimited:waifu2x can get in Firefox with 12 threads. Seems Firefox really is the best at WebAssembly JIT. It's possible to make it even faster by making the models compatible with ONNX runtime's WebGL or WebGPU backends, so that they can be executed on the GPU, just like with CUDA. In fact, the WebGPU backend might already be compatible (but I have not looked into this yet) Compatibility mostly consists of removing operators that WebGL doesn't support, like ConstantOfShape. Some optimizers can already recognize and remove these. But you also have to adjust int64 values so that they fit in 32 bits ( This can be done manually in a python debugger (as I have successfully done for some models). And not all the actual int64 types have to be converted to int32 (although the int64 casts need to be removed), they just need to fit in an int32. I have successfully gotten some of the utility models to load in ONNX runtime's WebGL backend, but unfortunately, this isn't very useful because the utility models are mostly precalculations, and the most expensive part is the actual upscaling model, which uses operators that WebGL doesn't support, namely I'm also looking into seeing if I can actually add support for ConstantOfShape into the WebGL backend myself, but of course this is not very easy since I cannot build ONNX runtime from source yet. Maybe I will modify the minified JS (hehehe....). My personal version of unlimited:waifu2x is based on a TypeScript translation/rewrite of reverse engineered minified code. |
Thank you for sharing.
I tried this before but did not applied it as it was slower than the original code on chrome.
I gave up on using WebGL backend because of the many unsupported functions.
I tried it recently but it did not work yet. microsoft/onnxruntime#15796 As of now, I am hoping to get WebGPU backend to work. |
This is clearly not true anymore
I also tried modifying that script. But int32 conversion is not required, only reducing magnitude of the values. And full conversion causes the model to fail validation anyway, because some operators require int64 attributes.
Good to know~
You're right, it doesn't. This issue itself is about WASM multithreading, not WebGL (that was just a slightly related comment).
Absolutely |
Also pytorch version (cli/server and training) is running on 16-bit float (half float). |
It should be sufficient to convert the input tensor to float16 and back each time you run the model. You can probably use these converter functions and use Uint16Array tensors as float16. Then use a model that expects float16. I will probably perform my own experiments once my codebase is functional |
OK, I have confirmed that it is faster with multithreading.
I may have made that mistake before, as it gets very slow when DevTools is open. One thing that is not great is that all javascript files must be hosted locally to enable SharedArrayBuffer. |
You mean vendored (on your server that serves the correct HTTP headers)? You should have been doing that anyway. You should not depend on CDNs for your website's main functionality. You host the models on your server so why not host the runtime to execute them? |
Ahh, I remember. One of the reasons I didn't use it is because it would not work with Google Analytics or Adsense. |
Why not? Can't you vendor those scripts as well? |
It needs to load scripts from third-party servers. |
If you are ok with only being compatible with Chrome 96 and higher, setting https://chromestatus.com/feature/4918234241302528 The header works on my chrome and SharedArrayBuffer exists with it. But this does not enable multithreading in firefox (firefox does not support it). Also try adding the |
For now, I have not been able to get Adsense to work with cross-origin isolation env. |
Is there any way to get firefox support as well? |
- Inference is defined as an async method, but, it blocks. After a couple days of trying all avenues and looking at sample apps, it looks like it is synchronous in that it will consume the attention of the thread the `await session.run` is called on. - Using Squadron to handle multi-threading didn't work. Now that the JS function in index.html is loading the model and passing it to a worker, it's possible it might. - In any case, this shows exactly how to set up a worker that A) does inference without blocking UI rendering B) allows Dart code to `await` the result without blocking UI - This process was frustrating and fraught, there's a surprising lack of info and examples around ONNX web. Most seem to consume it via diffusers.js/transformers.js. ONNX web was a separate library from the rest of the ONNX runtime until sometime around late 2022. The examples still use that library, and the examples use simple enough models that it's hard to catch whether they are blocking without falling back to dev tools. - Its absolutely crucial when debugging speed locally to make sure you're loading the ONNX version you expect (i.e. wasm AND threaded AND simd). The easiest way to check is network loads in Dev Tools, sort by size, and look for the .wasm file to A) be loaded B) include wasm, simd, and threaded in the filename. - Two things can prevent that: -- CORS nonsense with Flutter serving itself in debug mode: --- see here, nagadomi/nunif#34 --- note that the extension became adware, you should have Chrome set up its permissions such that it isn't run until you click it. Also, note that you have to do that each time the Flutter web app in debug mode's port changes. -- MIME type issues --- Even after that, I would see errors in console logs about the MIME type of the .wasm being incorrect and starting with the wrong bytes. That, again, seems due to local Flutter serving of the web app. To work around that, you can download the WASM files from the same CDN folder that hosts ort.min.js (see worker.js) and also in worker.js, remove the // in front of ort.env.wasm.wasmPaths = "". That indicates you've placed the WASM files next to index.html, which you should. Note you just need the 4 .wasm files, no more, from the CDN. Some performance review notes: - `webgpu` as execution provider completely errors out, says "JS executor not supported in the ONNX version" (1.16.3) - `webgl` throws "Cannot read properties of null (reading 'irVersion')" - Tested perf by varying wasm / simd / thread and thread count on M2 MacBook Air 16 GB ram, Chrome 120 - Landed on simd & thread count = 1/2 of cores as best performing -- first # is minilm l6v2, second is minilm l6v3, average inference time for 200 / 400 words -- 4 threads: 526 ms / 2196 ms -- simd 4 threads: 86 ms / 214 ms -- simd 8 threads: 106 ms / 260 ms -- simd 128 threads: 2879 ms / skipped -- simd navigator.hardwareConcurrency threads (8): 107 ms / 222 ms
- Inference is defined as an async method, but, it blocks. After a couple days of trying all avenues and looking at sample apps, it looks like it is synchronous in that it will consume the attention of the thread the `await session.run` is called on. - Using Squadron to handle multi-threading didn't work. Now that the JS function in index.html is loading the model and passing it to a worker, it's possible it might. - In any case, this shows exactly how to set up a worker that A) does inference without blocking UI rendering B) allows Dart code to `await` the result without blocking UI - This process was frustrating and fraught, there's a surprising lack of info and examples around ONNX web. Most seem to consume it via diffusers.js/transformers.js. ONNX web was a separate library from the rest of the ONNX runtime until sometime around late 2022. The examples still use that library, and the examples use simple enough models that it's hard to catch whether they are blocking without falling back to dev tools. - Its absolutely crucial when debugging speed locally to make sure you're loading the ONNX version you expect (i.e. wasm AND threaded AND simd). The easiest way to check is network loads in Dev Tools, sort by size, and look for the .wasm file to A) be loaded B) include wasm, simd, and threaded in the filename. - Two things can prevent that: -- CORS nonsense with Flutter serving itself in debug mode: --- see here, nagadomi/nunif#34 --- note that the extension became adware, you should have Chrome set up its permissions such that it isn't run until you click it. Also, note that you have to do that each time the Flutter web app in debug mode's port changes. -- MIME type issues --- Even after that, I would see errors in console logs about the MIME type of the .wasm being incorrect and starting with the wrong bytes. That, again, seems due to local Flutter serving of the web app. To work around that, you can download the WASM files from the same CDN folder that hosts ort.min.js (see worker.js) and also in worker.js, remove the // in front of ort.env.wasm.wasmPaths = "". That indicates you've placed the WASM files next to index.html, which you should. Note you just need the 4 .wasm files, no more, from the CDN. Some performance review notes: - `webgpu` as execution provider completely errors out, says "JS executor not supported in the ONNX version" (1.16.3) - `webgl` throws "Cannot read properties of null (reading 'irVersion')" - Tested perf by varying wasm / simd / thread and thread count on M2 MacBook Air 16 GB ram, Chrome 120 - Landed on simd & thread count = 1/2 of cores as best performing -- first # is minilm l6v2, second is minilm l6v3, average inference time for 200 / 400 words -- 4 threads: 526 ms / 2196 ms -- simd 4 threads: 86 ms / 214 ms -- simd 8 threads: 106 ms / 260 ms -- simd 128 threads: 2879 ms / skipped -- simd navigator.hardwareConcurrency threads (8): 107 ms / 222 ms
Problem
ONNX runtime supports multithreaded model execution, and it will automatically be enabled.
However, that can only happen when
SharedArrayBuffer
is available, which requires these HTTP headers to be set:Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Opener-Policy: same-origin
https://unlimited.waifu2x.net does not send these headers, so ONNX runtime cannot use multiple threads. I will perform an experiment to show that this is a mistake.
Experiment
I will add these headers for testing by using a Chrome extension.
These headers will make
SharedArrayBuffer
available, and ONNX runtime will automatically use multiple threads.Parameters for the experiment
swin_unet.art_scan
3
(highest)1
(1x)256
(console:tile size = 256
)0
(disabled)false
(no alpha channel)Performed using the version of unlimited:waifu2x that is currently live at https://unlimited.waifu2x.net.
Result of the experiment
Chromium
1 main thread that performs the execution (no changes)
388556.5769042969 ms (approx. 9251.347069149926 ms per tile)
12 worker threads that perform the execution (with headers)
143964.38818359375 ms (approx. 3427.723528180805 ms per tile)
Using 12 threads divides the time taken by 2.698977030408252, a 2.7x improvement.
Firefox
1 main thread that performs the execution (no changes)
DNF (slow); 109955ms for 3 tiles; estimated 1539370ms for 42 tiles (approx. 36651ms per tile)169983ms (approx. 4047.214285714286ms per tile)
12 worker threads that perform the execution (enabled
dom.postMessage.sharedArrayBuffer.bypassCOOP_COEP.insecure.enabled
)DNF (slow); 147402ms for 18 tiles; estimated 343938ms for 42 tiles (approx. 8189ms per tile)58643ms (approx. 1396.261904761905ms per tile)
Using 12 threads divides the time taken by
4.4757194610656572.898606824343911, a 2.9x improvement, even larger than Chromium.Implementation steps
ort.env.wasm.numThreads = navigator.hardwareConcurrency
before initialization, or else it will default to only 4 threadsThe text was updated successfully, but these errors were encountered: