-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
webgpu: Fix bank conflicts #6152
Conversation
2159f25
to
c685ff5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xhcao @axinging @haoyunfeix @gyagp Please take a look, thanks.
@@ -31,7 +31,7 @@ ENV.registerFlag('WEBGPU_CPU_FORWARD', () => true); | |||
/** | |||
* Thread register block size for matmul kernel. | |||
*/ | |||
ENV.registerFlag('WEBGPU_MATMUL_WORK_PER_THREAD', () => 4); | |||
ENV.registerFlag('WEBGPU_MATMUL_WORK_PER_THREAD', () => 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the workgroup size from (8, 8, 1) to (16, 16, 1) to better align with bank size 16. However, due to the maximum shared memory size limitation, I have to change the work per thread size to 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this part of changes to make sure this PR only focuses on fixing bank conflicts. Please take another look, thanks.
c685ff5
to
0b35bec
Compare
close it by #6862 |
This PR fixed the shared memory bank conflicts issue on MatMulPackedProgram. Previously, it used interleaved loads, which resulted the bank conflict. This PR uses sequential accessing to resolve it.
Before:
After:
With this change, Conv2DDerInputMMProgram/Conv2DMMProgram/MatMulPackedProgram get about >20% improvement. For example, the total time of Conv2DBackpropInput in hand_detector becomes 7.46ms from 11.26ms. The total time Conv2D in AutoML Image becomes 4.05ms from 5.61ms.
To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.
This change is