-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
major difference in performance between sync and async methods to download tensor data #6687
Comments
We once did some investigation on why there is a big perf diff between WebGL data() and dataSync(). The basic finding is below:
More details of our investigation can be found at https://docs.google.com/document/d/13Up4P7Or9uliiU2RZfzOk1aQ5yjFSjG2M6k16BkQn4U/edit. |
thanks for the explanation
difference is definitely smaller, but still noticeable at 10% looking forward to new timeout method |
Thanks for the quick try! It seems 4ms delay in setTimeout() contributes to most of the gap between async and sync calls. Once it's solved, I also think another gap comes from polling, like you mentioned. More specific, browser needs to check the result at some frequency. If that's too frequent, much power will be drained. Currently in Chrome implementation, the frequency of task scheduler is about 1ms (the exact implementation is more complicated), which is the general cost of async calls. |
@gyagp Thank you for the detail explanation, we would hope to unify the API between WebGL and WebGPU without major performance discrepancy, looking forward to the new timeout mechanism. thanks. |
@vladmandic I have a PR (under review) at #6694. May I have your help to give a try? |
after enabling
and i'm clearly not using wasm here at all, its just that my my guess is that issue is because you're using browsers
(window as any).setTimeoutWPM(tryFn, nextBackoff); registering a function directly on a global namespace is a bad practice |
i've removed tensor with shape [1, 720, 720, 3], running model inference 500 times and calculating average time for tensor download:
repo used for test: https://github.com/vladmandic/anime |
Thank you so much @vladmandic ! This is not a very outstanding improvement, but 4ms delay is real and critical for an application that targets 60FPS, or cares about every millisecond. I will polish my PR and try to get it merged. After this, I will investigate the gap between data() and dataSync() again to see if gap can be further filled. |
very true i've tried and performance difference is much bigger - 15-20% overall performance improvement and that is huge:
repo used for test: https://github.com/vladmandic/human |
I have updated the PR, and it's under final review. @vladmandic May I ask your favor to test it again? I don't want to introduce any regression for your cases. By the way, I changed the flag name, and you need to "tf.env().set('USE_SETTIMEOUTCUSTOM', true);" first. |
as soon as
issue is that self.onmessage = (e) => {
...
} else {
err('worker.js received unknown command ' + e.data.cmd);
err(e.data);
} where i don't think that is not a bug in your code, but bad behavior from wasm worker script as it should not treat bigger problem is that WASM worker script is generated by EMScripten and only lightly modified by TFJS build process actual source for and it seems this is directly related: |
@vladmandic
If you want to try this, you could follow the steps:
My code clip is :
I'm not sure what's the difference between this sample and your sample. Do you have some ideas? And is it possible for me to run your sample on my local so I could do more profile work? If it is possible, your guidance would be appreciate. |
@vladmandic, For the background, I asked @shaoboyan to investigate the diff between data and dataSync (116ms vs. 94ms) even if we use the custom setTimeout as you reported below. |
i just redid all the tests - seems there is quite a difference depending on:
resultsdownload tensor data after
download tensor data after
download tensor data for manually created/modified tensor:
all-in-all, using other than that, btw, this is my code for async function syncWait(gl: WebGL2RenderingContext): Promise<number> {
const sync = gl.fenceSync(gl.SYNC_GPU_COMMANDS_COMPLETE, 0);
if (!sync) return 0;
const ts = performance.now();
return new Promise((resolve) => {
const loop = () => {
const status = gl.clientWaitSync(sync, gl.SYNC_FLUSH_COMMANDS_BIT, 0);
if (status === gl.WAIT_FAILED) throw new Error('clientWaitSync: wait failed');
else if (status === gl.ALREADY_SIGNALED || status === gl.CONDITION_SATISFIED) {
gl.deleteSync(sync);
resolve(performance.now() - ts);
} else {
setTimeout(loop, 0);
}
};
loop();
});
} and simplified code example would be: const tensor = model.execute(input);
await syncWait(tf.backend().getGPGPUContext().gl);
const timeStamp = performance.now();
const data = await tensor.data();
totalTime += (performance.now() - timeStamp);
log({ avgDataTime: totalTime / ++numFrames }); |
@vladmandic |
@shaoboyan i'm not sure what is the ask - i've already done that - not just several warmup rounds, but 100 and then measuring average between rounds 100 to 200. yes, gap still exists, but a) its much smaller, up to a point of close enough, b) its due to webgl pipeline also performing timer checks which interferes with browser handling of ticks. my much bigger concern is blocking issue of new code being in conflict with |
@vladmandic Thanks. I misunderstand your tests. I think this reason resolves my concern. |
I tried e2e benchmark in TFJS on both Windows and macOS. However, I couldn't repro the issue you mentioned. I double checked the emscripten source code (_bazel_user/qpnosdr3/external/emscripten_bin_win/emscripten/src/library_pthread.js) we use now, and it's still the problematic one, and I think its version is 3.1.4. Emscripten fixed this problem (emscripten-core/emscripten#16450) on Mar 10, and the first release after the fix is 3.1.8 on Mar 24. |
@gyagp i'm not sure upgrading |
@gyagp any progress on emscripten |
@vladmandic Sorry, I was occupied by some other things. @mattsoulanille, do you have any suggestion on comment 17? |
@gyagp Maybe we can register our listener before Emscripten and suppress propagation of our event? However, I'm not sure if this is possible in a webworker created by Emscripten. Ideally, there would be something similar to CustomEvents that we could use as a unique event (we can't use I also updated Emscripten to 3.1.20, although this might not solve the issue. |
@mattsoulanille @gyagp i just created emscripten issue for this as this is really something that should be patched there - imo anything you do here is just a workaround to avoid bad emscripten behavior |
@mattsoulanille Thanks and let's wait for Emscripten's feedback at emscripten-core/emscripten#17844. |
Will MessageChannel fit your needs here? |
MessageChannel looks like it will work. Thanks @hujiajie for pointing it out to us! @gyagp, we should try this out as a drop-in replacement for A quick test: lastTime = new Date().getTime();
i = 0;
function time() {
i++;
if (i % 1000 === 0) {
currentTime = new Date().getTime();
delta = currentTime - lastTime;
console.log(`1000 iterations in ${delta} ms. ${delta / 1000} ms per iteration`);
lastTime = currentTime;
}
}
channel = new MessageChannel()
channel.port1.onmessage = () => {
channel.port2.postMessage('hello');
time();
};
channel.port2.postMessage('hello') This looks like it has no 4ms delay (while the same code under a setTimeout 'loop' does):
|
imo, |
@mattsoulanille @gyagp emscripten pr has been merged to main, so now it should be just a question of upgrading to new version once its out. |
Thanks for the update @vladmandic, and for the extra info on |
Sorry for the late response. MessageChannel also works and I updated my test (https://github.com/webatintel/webatintel.github.io/blob/main/timer.html#L44) to double confirm this. Given @vladmandic already fixed the issue in Emscripten upstream (thank you so much for the effort!), let's keep the current solution. |
@mattsoulanille fyi, |
Thanks for the ping @vladmandic. I checked emsdk's github a few days ago, so I'm not sure how I missed the 3.1.22 update. I've created #6856. |
@vladmandic, emsdk has been upgraded to 3.1.23 in tfjs's master branch. Could you please check this against your esbuild build infrastructure when you get a chance? Thanks! |
@mattsoulanille @gyagp sure! |
@vladmandic Thanks for the double check! |
If the setTimeout nesting level is greater than 5 and timeout is less than 4ms, timeout will be clamped to 4ms, which hurts the perf. A custom setTimeout is provided to mitigate the perf impact. BUG: #6687 Co-authored-by: Na Li <[email protected]>
closing as resolved - thanks! |
* Customize setTimeout (tensorflow#6694) If the setTimeout nesting level is greater than 5 and timeout is less than 4ms, timeout will be clamped to 4ms, which hurts the perf. A custom setTimeout is provided to mitigate the perf impact. BUG: tensorflow#6687 Co-authored-by: Na Li <[email protected]> * Upgrade windows BrowserStack chrome to 104 (tensorflow#6866) * webgpu: Disable importExternalTexture (tensorflow#6868) WebGPU Working Group recently found some problem with importExtenalTexture in spec, so we have to disable it temporarily. Co-authored-by: Yang Gu <[email protected]> Co-authored-by: Na Li <[email protected]> Co-authored-by: Matthew Soulanille <[email protected]>
* Customize setTimeout (tensorflow#6694) If the setTimeout nesting level is greater than 5 and timeout is less than 4ms, timeout will be clamped to 4ms, which hurts the perf. A custom setTimeout is provided to mitigate the perf impact. BUG: tensorflow#6687 Co-authored-by: Na Li <[email protected]> * Upgrade windows BrowserStack chrome to 104 (tensorflow#6866) * webgpu: Disable importExternalTexture (tensorflow#6868) WebGPU Working Group recently found some problem with importExtenalTexture in spec, so we have to disable it temporarily. Co-authored-by: Yang Gu <[email protected]> Co-authored-by: Na Li <[email protected]> Co-authored-by: Matthew Soulanille <[email protected]>
* Customize setTimeout (tensorflow#6694) If the setTimeout nesting level is greater than 5 and timeout is less than 4ms, timeout will be clamped to 4ms, which hurts the perf. A custom setTimeout is provided to mitigate the perf impact. BUG: tensorflow#6687 Co-authored-by: Na Li <[email protected]> * Upgrade windows BrowserStack chrome to 104 (tensorflow#6866) * webgpu: Disable importExternalTexture (tensorflow#6868) WebGPU Working Group recently found some problem with importExtenalTexture in spec, so we have to disable it temporarily. * Refactored Resizing Layer Unit Tests (#38) * Rescaling Preprocessing Layer Co-authored-by: David Kim (@koyykdy) <[email protected]> Brian Zheng (@Brianzheng123) <[email protected]> * PR issues resolved * linting and PR issues resolved Co-authored-by: Adam Lang (@AdamLang96) <[email protected]> Co-authored-by: (@Brianzheng123) <[email protected]> * initial implementation of image preprocessing: resizing layer, and associated unit tests. Comments and refactoring for image scaling layer * refactoring in computeOutputShape for image resizing layer * Unit tests for image resizing preprocessing layer expanded and refactored * refactored unit tests for resizing layer * Preprocessing-Resizing layer unit test expansion and refactoring. Co-authored-by: Adam Lang <@AdamLang96> ([email protected]) * cleaning up commit diffs * cleaning up commit diffs * PR commit suggestions accepted - code refactored to reflect changes * resizing layer unit test refactoring Co-authored-by: AdamLang96 <[email protected]> * Linting issue resolved: unused import statement culled (#39) * Rescaling Preprocessing Layer Co-authored-by: David Kim (@koyykdy) <[email protected]> Brian Zheng (@Brianzheng123) <[email protected]> * PR issues resolved * linting and PR issues resolved Co-authored-by: Adam Lang (@AdamLang96) <[email protected]> Co-authored-by: (@Brianzheng123) <[email protected]> * initial implementation of image preprocessing: resizing layer, and associated unit tests. Comments and refactoring for image scaling layer * refactoring in computeOutputShape for image resizing layer * Unit tests for image resizing preprocessing layer expanded and refactored * refactored unit tests for resizing layer * Preprocessing-Resizing layer unit test expansion and refactoring. Co-authored-by: Adam Lang <@AdamLang96> ([email protected]) * cleaning up commit diffs * cleaning up commit diffs * PR commit suggestions accepted - code refactored to reflect changes * resizing layer unit test refactoring * linting issues resolved: unusued import statement culled Co-authored-by: AdamLang96 <[email protected]> * Update jasmine_util.ts (tensorflow#6872) FIX * webgl: Fix NaN issue (tensorflow#6828) Fix tensorflow#6822 Problem 1: On some GPUs, even if a and b are both non-NaN, the value of isNaN in vec4 isNaN = min(vec4(isnan(a)) + vec4(isnan(b)), vec4(1.0)); are still larger than 0., which misleads all values become NAN. 2: After resolving NAN issue, the result is still incorrect. It seems that the isnan_custom is not well supported on the problem GPU. After switching back to builtin isnan, everything works well. Solution: Use the bool type bvec4 instead of float type vec4 to calculate isNaN to avoid the the float precision issue when comparing with zero. Meanwhile, add an env flag WEBGL2_ISNAN_CUSTOM to allow user to specify which isnan to use. * Upgrade nodejs to 18.7.0 (tensorflow#6863) * Upgrade nodejs to 18.7.0 * Fix hash table test string not passed as base64 * fixed prelu fusing code that pre-maturely neg the const on multiply (tensorflow#6876) Co-authored-by: RajeshT <[email protected]> * Update tfjs-layers/src/layers/preprocessing/image_resizing.ts Co-authored-by: Matthew Soulanille <[email protected]> Co-authored-by: Yang Gu <[email protected]> Co-authored-by: Na Li <[email protected]> Co-authored-by: Matthew Soulanille <[email protected]> Co-authored-by: AdamLang96 <[email protected]> Co-authored-by: Linchenn <[email protected]> Co-authored-by: Jiajia Qin <[email protected]> Co-authored-by: Ping Yu <[email protected]> Co-authored-by: RajeshT <[email protected]> Co-authored-by: Matthew Soulanille <[email protected]> Co-authored-by: Yang Gu <[email protected]> Co-authored-by: Na Li <[email protected]> Co-authored-by: Matthew Soulanille <[email protected]> Co-authored-by: AdamLang96 <[email protected]> Co-authored-by: Linchenn <[email protected]> Co-authored-by: Jiajia Qin <[email protected]> Co-authored-by: Ping Yu <[email protected]> Co-authored-by: RajeshT <[email protected]> Co-authored-by: Matthew Soulanille <[email protected]>
when using
webgl
backend,tensor.dataSync()
is about 35% faster thanawait tensor.data()
when returning larger amounts of data(in my case, model takes 720x720x3 image as input and produces 720x720x3 image as output)
actual timings are ~100ms for
tensor.dataSync()
and ~140ms forawait tensor.data()
note that there is nothing else executing, this is a single active loop
i wouldn't mind using
dataSync()
, but then again any sync calls are not compatible withwebgpu
backend and maintaining two separate codepaths forwebgl
andwebgpu
is a no-opbtw, model in question can be found at https://github.com/vladmandic/anime
environment: tfjs 3.19..0 on chrome 103
The text was updated successfully, but these errors were encountered: