[TLS] Thread local storage for optimized reduction #576

yuanming-hu · 2020-03-10T22:32:23Z

Systematically resolving this issue is non-trivial. Maybe it's a good chance to add IR support for thread-local storage and scratchpad (shared) memory extension as well.

KLozes · 2020-03-11T18:29:54Z

Here's a good presentation on optimizing reductions on gpu :)
gpu_reduction.pdf

znah · 2020-04-25T21:49:34Z

@yuanming-hu Thank you for creating this issue!

One particularly important case of reduction happens on param gradient accumulation during backprop. Consider the code:

import taichi as ti
import math

ti.reset()
ti.init(arch=ti.cuda)

N = 2**20
a = ti.Matrix(4, 4, dt=ti.f32, shape=(), needs_grad=False)
b = ti.Matrix(4, 4, dt=ti.f32, shape=(), needs_grad=True)
x = ti.Vector(4, dt=ti.f32, shape=N, needs_grad=True)
y = ti.Vector(4, dt=ti.f32, shape=N)
@ti.kernel
def f(w: ti.template()):
  for p in ti.grouped(x):
    y[p] = w@x[p]

Running the below benchmarks gives the following results (after the warm-up run):

%%time
for i in range(1000): f(a)
ti.sync()
# CPU times: user 98.4 ms, sys: 44.7 ms, total: 143 ms
# Wall time: 146 ms

Nice!

%%time
for i in range(1000): f.grad(a)
ti.sync()
# CPU times: user 240 ms, sys: 165 ms, total: 405 ms
# Wall time: 405 ms

Why is grad kernel slower? (although still pretty fast)

%%time
for i in range(1000): f.grad(b)
ti.sync()
# CPU times: user 18 s, sys: 12.6 s, total: 30.6 s
# Wall time: 30.4 s

Accumulating grad to b kills the performance.

yuanming-hu · 2020-04-25T22:30:10Z

Thanks for pointing this out! The benchmark is very meaningful. On CPU, the atomic_adds are really slow, since we have to implement in it using atomicCAS + while loop. On GPU, hardware atomics are faster but atomic contention is still an issue.

Of course, the systematic solution is to add thread-local storage/shared memory to the IR, which I believe I'll have some time for in May...

znah · 2020-04-26T08:14:11Z

As a temporary solution I managed to get a decent speed-up by reducing the contention with a number of copies of param matrix.

N = 2**20
D = 256
w = ti.Matrix(4, 4, dt=ti.f32, shape=(D,), needs_grad=True)
x = ti.Vector(4, dt=ti.f32, shape=N, needs_grad=True)
y = ti.Vector(4, dt=ti.f32, shape=N)

@ti.kernel
def f():
  for i in x:
    y[i] = w[0]@x[i]

@ti.kernel
def f_strided():
  for i in x:
    y[i] = w[i%D]@x[i]


%%time
for i in range(1000): f.grad()
ti.sync()
# CPU times: user 18.3 s, sys: 11.9 s, total: 30.2 s
# Wall time: 30.1 s

%%time
for i in range(1000): f_strided.grad()
ti.sync()
# CPU times: user 776 ms, sys: 462 ms, total: 1.24 s
# Wall time: 1.24 s

znah · 2020-04-26T20:37:15Z

Forgot to mention: benchmarks were run on Nvidia P100 CUDA

k-ye · 2020-06-23T14:03:20Z

An update from the Metal side when using SIMD reductions (equivalent to CUDA warp-level reductions).

There's no obvious perf difference when doing global reduction on integer atomics types. For float types, it was about 49x faster (for the particular case I benchmarked). I guess the current atomic add impl for floats are not that efficient:

taichi/taichi/backends/metal/shaders/helpers.metal.h

Lines 53 to 67 in d9c9616

    
           float fatomic_fetch_add(device float *dest, const float operand) { 
        
             // A huge hack! Metal does not support atomic floating point numbers 
        
             // natively. 
        
             bool ok = false; 
        
             float old_val = 0.0f; 
        
             while (!ok) { 
        
               old_val = *dest; 
        
               float new_val = (old_val + operand); 
        
               ok = atomic_compare_exchange_weak_explicit( 
        
                   (device atomic_int *)dest, (thread int *)(&old_val), 
        
                   *((thread int *)(&new_val)), metal::memory_order_relaxed, 
        
                   metal::memory_order_relaxed); 
        
             } 
        
             return old_val; 
        
           }

The BM case is to sum 65536 floats. Here's the results:

naive:
- Samples: [0.056786, 0.052332, 0.054000, 0.054563, 0.051821, 0.051795, 0.052055, 0.052787, 0.052526, 0.052136]
- Avg: 0.0531s
SIMD:
- Samples: [0.004288, 0.000835, 0.000750, 0.000751, 0.000752, 0.000724, 0.000706, 0.000692, 0.000704, 0.000726]
- Avg: 0.00109s

yuanming-hu · 2020-06-23T14:45:39Z

Very cool!

There's no obvious perf difference when doing global reduction on integer atomics types.

One possibility is that the Metal compiler already does the optimization for you on integer, since integer operations are associative. Note that the compiler is not allowed to change float-point operation order unless you use --fastmath, so programmers have to specify SIMD reduction manually.

yuanming-hu added the feature request Suggest an idea on this project label Mar 10, 2020

yuanming-hu mentioned this issue Mar 10, 2020

add mgpcg example #573

Merged

yuanming-hu changed the title ~~Optimized reduction~~ [TLS] Thread local storage for optimized reduction Jun 21, 2020

yuanming-hu mentioned this issue Jun 21, 2020

[Perf] Thread local storage for range-for reductions on CPUs #1296

Merged

yuanming-hu mentioned this issue Jun 25, 2020

[Perf] Thread local storage for range-for reductions on GPUs #1336

Merged

k-ye mentioned this issue Jun 29, 2020

[Perf] [metal] Support TLS and SIMD group reduction for range-for kernels #1358

Merged

yuanming-hu mentioned this issue Jul 4, 2020

[IR] Block local storage for struct-for's on CUDA #1394

Merged

k-ye mentioned this issue Jul 6, 2020

[Perf] Support TLS for GlobalTemporaryStmt #1423

Merged

yuanming-hu mentioned this issue Oct 12, 2020

[Perf] Support thread local storage for reduction in struct-fors #1941

Merged

yuanming-hu closed this as completed in #1941 Oct 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TLS] Thread local storage for optimized reduction #576

[TLS] Thread local storage for optimized reduction #576

yuanming-hu commented Mar 10, 2020

KLozes commented Mar 11, 2020

znah commented Apr 25, 2020

yuanming-hu commented Apr 25, 2020

znah commented Apr 26, 2020

znah commented Apr 26, 2020

k-ye commented Jun 23, 2020 •

edited

Loading

yuanming-hu commented Jun 23, 2020

[TLS] Thread local storage for optimized reduction #576

[TLS] Thread local storage for optimized reduction #576

Comments

yuanming-hu commented Mar 10, 2020

KLozes commented Mar 11, 2020

znah commented Apr 25, 2020

yuanming-hu commented Apr 25, 2020

znah commented Apr 26, 2020

znah commented Apr 26, 2020

k-ye commented Jun 23, 2020 • edited Loading

yuanming-hu commented Jun 23, 2020

k-ye commented Jun 23, 2020 •

edited

Loading