[Perf] Thread local storage for range-for reductions on CPUs #1296

yuanming-hu · 2020-06-21T01:01:18Z

Related issue = #576

Benchmark:

import taichi as ti

ti.init(print_ir=True, kernel_profiler=True)

N = 1024 * 1024

a = ti.var(ti.i32, shape=N)
tot = ti.var(ti.i32, shape=())


@ti.kernel
def fill():
    for i in a:
        a[i] = i


@ti.kernel
def reduce():
    for i in a:
        tot[None] += a[i]


fill()

for i in range(10):
    reduce()

ground_truth = 10 * N * (N - 1) / 2 % 2**32
assert tot[None] % 2**32 == ground_truth
ti.kernel_profiler_print()

Before: 17ms
After: 0.48ms (35x faster)

[Click here for the format server]

Design doc:

codecov · 2020-06-21T01:43:30Z

Codecov Report

Merging #1296 into master will increase coverage by 0.86%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1296      +/-   ##
==========================================
+ Coverage   84.72%   85.59%   +0.86%     
==========================================
  Files          18       18              
  Lines        3274     3283       +9     
  Branches      616      621       +5     
==========================================
+ Hits         2774     2810      +36     
+ Misses        365      345      -20     
+ Partials      135      128       -7

Impacted Files	Coverage Δ
python/taichi/lang/ops.py	`92.83% <0.00%> (-0.83%)`	⬇️
python/taichi/lang/common_ops.py	`93.86% <0.00%> (+3.25%)`	⬆️
python/taichi/lang/matrix.py	`90.13% <0.00%> (+4.12%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69a7704...e524478. Read the comment docs.

taichi/transforms/make_thread_local.cpp

xumingkuan

I'll take a pass tomorrow.

taichi/transforms/make_thread_local.cpp

Co-authored-by: Ye Kuang <[email protected]> Co-authored-by: xumingkuan <[email protected]>

xumingkuan

LGTM in general!

We may need something like the following code in maybe_same_address for better aliasing analysis:

  if (var1->is<ThreadLocalPtrStmt>() || var2->is<ThreadLocalPtrStmt>()) {
    if (!var1->is<ThreadLocalPtrStmt>() || !var2->is<ThreadLocalPtrStmt>())
      return false;
    return var1->as<ThreadLocalPtrStmt>()->offset ==
           var2->as<ThreadLocalPtrStmt>()->offset;
  }

taichi/transforms/compile_to_offloads.cpp

xumingkuan · 2020-06-21T20:51:51Z

taichi/transforms/make_thread_local.cpp

+  for (auto dest : atomic_destinations) {
+    // check if there is any other global load/store/atomic operations
+    auto related_global_mem_ops =
+        irpass::analysis::gather_statements(offload, [&](Stmt *stmt) {


When gather_statements is redesigned, we can let it support early-reject -- in many cases, we only need to know if the returned std::vector is empty.

Yeah, that sounds like a good specialization of the map-reduce design.

xumingkuan · 2020-06-21T20:56:23Z

taichi/transforms/make_thread_local.cpp

+      if (offload->prologue == nullptr) {
+        offload->prologue = std::make_unique<Block>();
+      }
+      irpass::analysis::clone(dest);


Why is the return value unused here?

Good point - I guess this line is useless.

Co-authored-by: xumingkuan <[email protected]>

yuanming-hu · 2020-06-21T21:58:17Z

We may need something like the following code in maybe_same_address for better aliasing analysis:

  if (var1->is<ThreadLocalPtrStmt>() || var2->is<ThreadLocalPtrStmt>()) {
    if (!var1->is<ThreadLocalPtrStmt>() || !var2->is<ThreadLocalPtrStmt>())
      return false;
    return var1->as<ThreadLocalPtrStmt>()->offset ==
           var2->as<ThreadLocalPtrStmt>()->offset;
  }

Sounds good - we should do that in a future PR next week.

xumingkuan

LGTM now. Cool!

KLozes · 2020-06-21T22:28:05Z

taichi/transforms/make_thread_local.cpp

+bool is_atomic_op_linear(AtomicOpType op_type) {
+  return op_type == AtomicOpType::add || op_type == AtomicOpType::sub;
+}


Why only add/sub? I'm sure we would see huge speed ups applying this to max/min and others as well. I think we will need to load the original value to thread local pointer in the prologue instead of zero fill.

Maybe just nit, but the $ numbers are all out of order for the prologue and epilogue

kernel { $0 = offloaded range_for(0, 1048576) block_dim=adaptive prologue { <i32 x1> $64 = thread local ptr (offset = 0 B) <i32 x1> $65 = const [0] <i32 x1> $66 : global store [$64 <- $65] } body { <i32 x1> $1 = thread local ptr (offset = 0 B) <i32 x1> $2 = loop $0 index 0 <gen*x1> $3 = get root <i32 x1> $4 = const [0] <gen*x1> $5 = [S0root][root]::lookup($3, $4) activate = false <gen*x1> $6 = get child [S0root->S1dense] $5 <gen*x1> $7 = [S1dense][dense]::lookup($6, $2) activate = false <i32*x1> $8 = get child [S1dense->S2place_i32] $7 <i32 x1> $9 = global load $8 <i32 x1> $10 = global load $1 <i32 x1> $11 = add $10 $9 <i32 x1> $12 : global store [$1 <- $11] } epilogue { <i32 x1> $68 = thread local ptr (offset = 0 B) <i32 x1> $69 = global load $68 <gen*x1> $82 = get root <i32 x1> $98 = const [0] <gen*x1> $84 = [S0root][root]::lookup($82, $98) activate = false <gen*x1> $85 = get child [S0root->S3dense] $84 <gen*x1> $87 = [S3dense][dense]::lookup($85, $98) activate = false <i32*x1> $88 = get child [S3dense->S4place_i32] $87 <i32 x1> $71 = atomic add($88, $69) } }

Why only add/sub? I'm sure we would see huge speed ups applying this to max/min and others as well.

Yeah we can add that later. I'm using add/sub just because it's easier to test.

I think we will need to load the original value to thread local pointer in the prologue instead of zero fill.

Actually I think zero-fill leads to the correct behavior. We are only accumulating local contributions here, and local contributions start with 0.

Right, I just meant you may need to load the original value for min/max cases. Sounds good though thanks!

the $ numbers are all out of order for the prologue and epilogue

Maybe we should just use BasicStmtVisitor instead of IRVisitor for re_id...

KLozes

LGTM here! I'm not super familiar with IR transforms yet since I have done much work on them though.

k-ye

Great!

taichi/transforms/make_thread_local.cpp

taichi/runtime/llvm/runtime.cpp

taichi/transforms/make_thread_local.cpp

…dev#1296)

yuanming-hu added 10 commits June 20, 2020 18:50

.

8789f93

detect reduction

6af3d40

make_thread_local statements

083399c

reproduce verify failure

4233670

fix. reduce range for 15ms->1.3ms

0675ed9

remote TLS atomic to non-atomic RMW. reduction 1.3ms->0.48ms

9e17e4b

stronger test

aed2800

reproduce test failures

74032a1

fix tests

cd87e2d

clean up

e262e3a

yuanming-hu requested review from xumingkuan, k-ye and KLozes June 21, 2020 01:01

finalize

32c3757

k-ye reviewed Jun 21, 2020

View reviewed changes

taichi/transforms/make_thread_local.cpp Show resolved Hide resolved

taichi/transforms/make_thread_local.cpp Show resolved Hide resolved

taichi/transforms/make_thread_local.cpp Show resolved Hide resolved

xumingkuan reviewed Jun 21, 2020

View reviewed changes

taichi/transforms/make_thread_local.cpp Outdated Show resolved Hide resolved

Apply suggestions from code review

88c7f71

Co-authored-by: Ye Kuang <[email protected]> Co-authored-by: xumingkuan <[email protected]>

yuanming-hu requested a review from taichi-gardener June 21, 2020 12:28

[skip ci] enforce code format

bb8d75d

xumingkuan reviewed Jun 21, 2020

View reviewed changes

yuanming-hu and others added 2 commits June 21, 2020 17:21

[skip ci] Update taichi/transforms/compile_to_offloads.cpp

75d34df

Co-authored-by: xumingkuan <[email protected]>

apply review suggestion

27fba70

xumingkuan approved these changes Jun 21, 2020

View reviewed changes

KLozes reviewed Jun 21, 2020

View reviewed changes

k-ye approved these changes Jun 22, 2020

View reviewed changes

taichi/transforms/make_thread_local.cpp Outdated Show resolved Hide resolved

taichi/runtime/llvm/runtime.cpp Outdated Show resolved Hide resolved

taichi/transforms/make_thread_local.cpp Show resolved Hide resolved

apply review suggestions

e524478

yuanming-hu merged commit dac7724 into taichi-dev:master Jun 22, 2020

xumingkuan mentioned this pull request Jun 22, 2020

[ir] [refactor] Simplify the "re_id" pass #1304

Merged

FantasyVR mentioned this pull request Jun 24, 2020

[release] v0.6.13 #1317

Merged

Rullec pushed a commit to Rullec/taichi that referenced this pull request Jun 26, 2020

[Perf] Thread local storage for range-for reductions on CPUs (taichi-…

d46333a

…dev#1296)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Thread local storage for range-for reductions on CPUs #1296

[Perf] Thread local storage for range-for reductions on CPUs #1296

yuanming-hu commented Jun 21, 2020 •

edited

Loading

codecov bot commented Jun 21, 2020 •

edited

Loading

xumingkuan left a comment

xumingkuan left a comment

xumingkuan Jun 21, 2020

yuanming-hu Jun 21, 2020

xumingkuan Jun 21, 2020

yuanming-hu Jun 21, 2020

yuanming-hu commented Jun 21, 2020

xumingkuan left a comment

KLozes Jun 21, 2020

KLozes Jun 21, 2020 •

edited

Loading

yuanming-hu Jun 21, 2020

KLozes Jun 21, 2020 •

edited

Loading

xumingkuan Jun 21, 2020

KLozes left a comment

k-ye left a comment

[Perf] Thread local storage for range-for reductions on CPUs #1296

[Perf] Thread local storage for range-for reductions on CPUs #1296

Conversation

yuanming-hu commented Jun 21, 2020 • edited Loading

codecov bot commented Jun 21, 2020 • edited Loading

Codecov Report

xumingkuan left a comment

Choose a reason for hiding this comment

xumingkuan left a comment

Choose a reason for hiding this comment

xumingkuan Jun 21, 2020

Choose a reason for hiding this comment

yuanming-hu Jun 21, 2020

Choose a reason for hiding this comment

xumingkuan Jun 21, 2020

Choose a reason for hiding this comment

yuanming-hu Jun 21, 2020

Choose a reason for hiding this comment

yuanming-hu commented Jun 21, 2020

xumingkuan left a comment

Choose a reason for hiding this comment

KLozes Jun 21, 2020

Choose a reason for hiding this comment

KLozes Jun 21, 2020 • edited Loading

Choose a reason for hiding this comment

yuanming-hu Jun 21, 2020

Choose a reason for hiding this comment

KLozes Jun 21, 2020 • edited Loading

Choose a reason for hiding this comment

xumingkuan Jun 21, 2020

Choose a reason for hiding this comment

KLozes left a comment

Choose a reason for hiding this comment

k-ye left a comment

Choose a reason for hiding this comment

yuanming-hu commented Jun 21, 2020 •

edited

Loading

codecov bot commented Jun 21, 2020 •

edited

Loading

KLozes Jun 21, 2020 •

edited

Loading

KLozes Jun 21, 2020 •

edited

Loading