-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] [metal] Support TLS and SIMD group reduction for range-for kernels #1358
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1358 +/- ##
=======================================
Coverage 85.57% 85.57%
=======================================
Files 19 19
Lines 3368 3368
Branches 623 623
=======================================
Hits 2882 2882
Misses 356 356
Partials 130 130 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Btw, do you observe any performance difference when you turn on/off SIMD reduction in the epilogue?
Ah yep. The number I gave was basically comparing SIMD on vs off. TBH TLS itself didn't help much here, since I didn't seem to implement grid-strided loops for range-for loops.. |
Ah I see. I thought the numbers were no TLS v.s. TLS. It's interesting to see that SIMD has such a great improvement. on CUDA the improvement was small. I guess we will need both. If your grid size is small, then grid-strided loops can significantly reduce the number of atomics needed. |
Good point! I will try switching to that later.. (Also I probably still need some time to cleanup in this PR, so as to simplify the process of passing |
@yuanming-hu FYI, I ran another |
misc/benchmark_reduction.py
: no perf difference usingi32
. I switched tof32
, and had to reduce the tensor size from1024*1024*1024
(4GB no longer fits intoi32
..) to1024*1024*16
. Reduction duration went from~7s
->0.2s
.mpm_langrangian_force.py
got about 4x improvement...This PR isn't small, but a big part of it comes from the fact that we cannot just enable SIMD group by default. This might not be available on OS X
<= 10.14
.Apple claims that SIMD group is supported in compute kernels since Metal Shader Language 2.0 (released on OSX
10.13
), but my test showed that we had to compile it at>= MSL 2.1
(https://developer.apple.com/documentation/metal/mtllanguageversion/version2_1).Example:
Related issue = #576
[Click here for the format server]