Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Thread local storage for range-for reductions on CPUs #1296

Merged
merged 16 commits into from
Jun 22, 2020

Conversation

yuanming-hu
Copy link
Member

@yuanming-hu yuanming-hu commented Jun 21, 2020

Related issue = #576

Benchmark:

import taichi as ti

ti.init(print_ir=True, kernel_profiler=True)

N = 1024 * 1024

a = ti.var(ti.i32, shape=N)
tot = ti.var(ti.i32, shape=())


@ti.kernel
def fill():
    for i in a:
        a[i] = i


@ti.kernel
def reduce():
    for i in a:
        tot[None] += a[i]


fill()

for i in range(10):
    reduce()

ground_truth = 10 * N * (N - 1) / 2 % 2**32
assert tot[None] % 2**32 == ground_truth
ti.kernel_profiler_print()

Before: 17ms
After: 0.48ms (35x faster)

[Click here for the format server]


Design doc:

image

@codecov
Copy link

codecov bot commented Jun 21, 2020

Codecov Report

Merging #1296 into master will increase coverage by 0.86%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1296      +/-   ##
==========================================
+ Coverage   84.72%   85.59%   +0.86%     
==========================================
  Files          18       18              
  Lines        3274     3283       +9     
  Branches      616      621       +5     
==========================================
+ Hits         2774     2810      +36     
+ Misses        365      345      -20     
+ Partials      135      128       -7     
Impacted Files Coverage Δ
python/taichi/lang/ops.py 92.83% <0.00%> (-0.83%) ⬇️
python/taichi/lang/common_ops.py 93.86% <0.00%> (+3.25%) ⬆️
python/taichi/lang/matrix.py 90.13% <0.00%> (+4.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69a7704...e524478. Read the comment docs.

Copy link
Contributor

@xumingkuan xumingkuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a pass tomorrow.

Co-authored-by: Ye Kuang <[email protected]>
Co-authored-by: xumingkuan <[email protected]>
Copy link
Contributor

@xumingkuan xumingkuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general!

We may need something like the following code in maybe_same_address for better aliasing analysis:

  if (var1->is<ThreadLocalPtrStmt>() || var2->is<ThreadLocalPtrStmt>()) {
    if (!var1->is<ThreadLocalPtrStmt>() || !var2->is<ThreadLocalPtrStmt>())
      return false;
    return var1->as<ThreadLocalPtrStmt>()->offset ==
           var2->as<ThreadLocalPtrStmt>()->offset;
  }

for (auto dest : atomic_destinations) {
// check if there is any other global load/store/atomic operations
auto related_global_mem_ops =
irpass::analysis::gather_statements(offload, [&](Stmt *stmt) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When gather_statements is redesigned, we can let it support early-reject -- in many cases, we only need to know if the returned std::vector is empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds like a good specialization of the map-reduce design.

if (offload->prologue == nullptr) {
offload->prologue = std::make_unique<Block>();
}
irpass::analysis::clone(dest);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the return value unused here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I guess this line is useless.

@yuanming-hu
Copy link
Member Author

We may need something like the following code in maybe_same_address for better aliasing analysis:

  if (var1->is<ThreadLocalPtrStmt>() || var2->is<ThreadLocalPtrStmt>()) {
    if (!var1->is<ThreadLocalPtrStmt>() || !var2->is<ThreadLocalPtrStmt>())
      return false;
    return var1->as<ThreadLocalPtrStmt>()->offset ==
           var2->as<ThreadLocalPtrStmt>()->offset;
  }

Sounds good - we should do that in a future PR next week.

Copy link
Contributor

@xumingkuan xumingkuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. Cool!

Comment on lines +10 to +12
bool is_atomic_op_linear(AtomicOpType op_type) {
return op_type == AtomicOpType::add || op_type == AtomicOpType::sub;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only add/sub? I'm sure we would see huge speed ups applying this to max/min and others as well. I think we will need to load the original value to thread local pointer in the prologue instead of zero fill.

Copy link
Collaborator

@KLozes KLozes Jun 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just nit, but the $ numbers are all out of order for the prologue and epilogue

kernel {
  $0 = offloaded range_for(0, 1048576) block_dim=adaptive  
  prologue {
    <i32 x1> $64 = thread local ptr (offset = 0 B)
    <i32 x1> $65 = const [0]
    <i32 x1> $66 : global store [$64 <- $65]
  }
  body {
    <i32 x1> $1 = thread local ptr (offset = 0 B)
    <i32 x1> $2 = loop $0 index 0
    <gen*x1> $3 = get root
    <i32 x1> $4 = const [0]
    <gen*x1> $5 = [S0root][root]::lookup($3, $4) activate = false
    <gen*x1> $6 = get child [S0root->S1dense] $5
    <gen*x1> $7 = [S1dense][dense]::lookup($6, $2) activate = false
    <i32*x1> $8 = get child [S1dense->S2place_i32] $7
    <i32 x1> $9 = global load $8
    <i32 x1> $10 = global load $1
    <i32 x1> $11 = add $10 $9
    <i32 x1> $12 : global store [$1 <- $11]
  }
  epilogue {
    <i32 x1> $68 = thread local ptr (offset = 0 B)
    <i32 x1> $69 = global load $68
    <gen*x1> $82 = get root
    <i32 x1> $98 = const [0]
    <gen*x1> $84 = [S0root][root]::lookup($82, $98) activate = false
    <gen*x1> $85 = get child [S0root->S3dense] $84
    <gen*x1> $87 = [S3dense][dense]::lookup($85, $98) activate = false
    <i32*x1> $88 = get child [S3dense->S4place_i32] $87
    <i32 x1> $71 = atomic add($88, $69)
  }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only add/sub? I'm sure we would see huge speed ups applying this to max/min and others as well.

Yeah we can add that later. I'm using add/sub just because it's easier to test.

I think we will need to load the original value to thread local pointer in the prologue instead of zero fill.

Actually I think zero-fill leads to the correct behavior. We are only accumulating local contributions here, and local contributions start with 0.

Copy link
Collaborator

@KLozes KLozes Jun 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I just meant you may need to load the original value for min/max cases. Sounds good though thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the $ numbers are all out of order for the prologue and epilogue

Maybe we should just use BasicStmtVisitor instead of IRVisitor for re_id...

Copy link
Collaborator

@KLozes KLozes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM here! I'm not super familiar with IR transforms yet since I have done much work on them though.

Copy link
Member

@k-ye k-ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@yuanming-hu yuanming-hu merged commit dac7724 into taichi-dev:master Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants