Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TLS] Thread local storage for optimized reduction #576

Closed
yuanming-hu opened this issue Mar 10, 2020 · 7 comments · Fixed by #1941
Closed

[TLS] Thread local storage for optimized reduction #576

yuanming-hu opened this issue Mar 10, 2020 · 7 comments · Fixed by #1941
Labels
feature request Suggest an idea on this project

Comments

@yuanming-hu
Copy link
Member

Systematically resolving this issue is non-trivial. Maybe it's a good chance to add IR support for thread-local storage and scratchpad (shared) memory extension as well.

@yuanming-hu yuanming-hu added the feature request Suggest an idea on this project label Mar 10, 2020
@KLozes
Copy link
Collaborator

KLozes commented Mar 11, 2020

Here's a good presentation on optimizing reductions on gpu :)
gpu_reduction.pdf

@znah
Copy link
Contributor

znah commented Apr 25, 2020

@yuanming-hu Thank you for creating this issue!

One particularly important case of reduction happens on param gradient accumulation during backprop. Consider the code:

import taichi as ti
import math

ti.reset()
ti.init(arch=ti.cuda)

N = 2**20
a = ti.Matrix(4, 4, dt=ti.f32, shape=(), needs_grad=False)
b = ti.Matrix(4, 4, dt=ti.f32, shape=(), needs_grad=True)
x = ti.Vector(4, dt=ti.f32, shape=N, needs_grad=True)
y = ti.Vector(4, dt=ti.f32, shape=N)
@ti.kernel
def f(w: ti.template()):
  for p in ti.grouped(x):
    y[p] = w@x[p]

Running the below benchmarks gives the following results (after the warm-up run):

%%time
for i in range(1000): f(a)
ti.sync()
# CPU times: user 98.4 ms, sys: 44.7 ms, total: 143 ms
# Wall time: 146 ms

Nice!

%%time
for i in range(1000): f.grad(a)
ti.sync()
# CPU times: user 240 ms, sys: 165 ms, total: 405 ms
# Wall time: 405 ms

Why is grad kernel slower? (although still pretty fast)

%%time
for i in range(1000): f.grad(b)
ti.sync()
# CPU times: user 18 s, sys: 12.6 s, total: 30.6 s
# Wall time: 30.4 s

Accumulating grad to b kills the performance.

@yuanming-hu
Copy link
Member Author

Thanks for pointing this out! The benchmark is very meaningful. On CPU, the atomic_adds are really slow, since we have to implement in it using atomicCAS + while loop. On GPU, hardware atomics are faster but atomic contention is still an issue.

Of course, the systematic solution is to add thread-local storage/shared memory to the IR, which I believe I'll have some time for in May...

@znah
Copy link
Contributor

znah commented Apr 26, 2020

As a temporary solution I managed to get a decent speed-up by reducing the contention with a number of copies of param matrix.

N = 2**20
D = 256
w = ti.Matrix(4, 4, dt=ti.f32, shape=(D,), needs_grad=True)
x = ti.Vector(4, dt=ti.f32, shape=N, needs_grad=True)
y = ti.Vector(4, dt=ti.f32, shape=N)

@ti.kernel
def f():
  for i in x:
    y[i] = w[0]@x[i]

@ti.kernel
def f_strided():
  for i in x:
    y[i] = w[i%D]@x[i]


%%time
for i in range(1000): f.grad()
ti.sync()
# CPU times: user 18.3 s, sys: 11.9 s, total: 30.2 s
# Wall time: 30.1 s

%%time
for i in range(1000): f_strided.grad()
ti.sync()
# CPU times: user 776 ms, sys: 462 ms, total: 1.24 s
# Wall time: 1.24 s

@znah
Copy link
Contributor

znah commented Apr 26, 2020

Forgot to mention: benchmarks were run on Nvidia P100 CUDA

@yuanming-hu yuanming-hu changed the title Optimized reduction [TLS] Thread local storage for optimized reduction Jun 21, 2020
@k-ye
Copy link
Member

k-ye commented Jun 23, 2020

An update from the Metal side when using SIMD reductions (equivalent to CUDA warp-level reductions).

There's no obvious perf difference when doing global reduction on integer atomics types. For float types, it was about 49x faster (for the particular case I benchmarked). I guess the current atomic add impl for floats are not that efficient:

float fatomic_fetch_add(device float *dest, const float operand) {
// A huge hack! Metal does not support atomic floating point numbers
// natively.
bool ok = false;
float old_val = 0.0f;
while (!ok) {
old_val = *dest;
float new_val = (old_val + operand);
ok = atomic_compare_exchange_weak_explicit(
(device atomic_int *)dest, (thread int *)(&old_val),
*((thread int *)(&new_val)), metal::memory_order_relaxed,
metal::memory_order_relaxed);
}
return old_val;
}

The BM case is to sum 65536 floats. Here's the results:

  • naive:
    • Samples: [0.056786, 0.052332, 0.054000, 0.054563, 0.051821, 0.051795, 0.052055, 0.052787, 0.052526, 0.052136]
    • Avg: 0.0531s
  • SIMD:
    • Samples: [0.004288, 0.000835, 0.000750, 0.000751, 0.000752, 0.000724, 0.000706, 0.000692, 0.000704, 0.000726]
    • Avg: 0.00109s

@yuanming-hu
Copy link
Member Author

Very cool!

There's no obvious perf difference when doing global reduction on integer atomics types.

One possibility is that the Metal compiler already does the optimization for you on integer, since integer operations are associative. Note that the compiler is not allowed to change float-point operation order unless you use --fastmath, so programmers have to specify SIMD reduction manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Suggest an idea on this project
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants