-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TLS] Thread local storage for optimized reduction #576
Comments
Here's a good presentation on optimizing reductions on gpu :) |
@yuanming-hu Thank you for creating this issue! One particularly important case of reduction happens on param gradient accumulation during backprop. Consider the code:
Running the below benchmarks gives the following results (after the warm-up run):
Nice!
Why is grad kernel slower? (although still pretty fast)
Accumulating grad to |
Thanks for pointing this out! The benchmark is very meaningful. On CPU, the Of course, the systematic solution is to add thread-local storage/shared memory to the IR, which I believe I'll have some time for in May... |
As a temporary solution I managed to get a decent speed-up by reducing the contention with a number of copies of param matrix.
|
Forgot to mention: benchmarks were run on Nvidia P100 CUDA |
An update from the Metal side when using SIMD reductions (equivalent to CUDA warp-level reductions). There's no obvious perf difference when doing global reduction on integer atomics types. For float types, it was about 49x faster (for the particular case I benchmarked). I guess the current atomic add impl for floats are not that efficient: taichi/taichi/backends/metal/shaders/helpers.metal.h Lines 53 to 67 in d9c9616
The BM case is to sum 65536 floats. Here's the results:
|
Very cool!
One possibility is that the Metal compiler already does the optimization for you on integer, since integer operations are associative. Note that the compiler is not allowed to change float-point operation order unless you use |
Systematically resolving this issue is non-trivial. Maybe it's a good chance to add IR support for thread-local storage and scratchpad (shared) memory extension as well.
The text was updated successfully, but these errors were encountered: