-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Thread local storage for range-for reductions on CPUs #1296
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1296 +/- ##
==========================================
+ Coverage 84.72% 85.59% +0.86%
==========================================
Files 18 18
Lines 3274 3283 +9
Branches 616 621 +5
==========================================
+ Hits 2774 2810 +36
+ Misses 365 345 -20
+ Partials 135 128 -7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a pass tomorrow.
Co-authored-by: Ye Kuang <[email protected]> Co-authored-by: xumingkuan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general!
We may need something like the following code in maybe_same_address
for better aliasing analysis:
if (var1->is<ThreadLocalPtrStmt>() || var2->is<ThreadLocalPtrStmt>()) {
if (!var1->is<ThreadLocalPtrStmt>() || !var2->is<ThreadLocalPtrStmt>())
return false;
return var1->as<ThreadLocalPtrStmt>()->offset ==
var2->as<ThreadLocalPtrStmt>()->offset;
}
for (auto dest : atomic_destinations) { | ||
// check if there is any other global load/store/atomic operations | ||
auto related_global_mem_ops = | ||
irpass::analysis::gather_statements(offload, [&](Stmt *stmt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When gather_statements
is redesigned, we can let it support early-reject -- in many cases, we only need to know if the returned std::vector
is empty
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds like a good specialization of the map-reduce
design.
if (offload->prologue == nullptr) { | ||
offload->prologue = std::make_unique<Block>(); | ||
} | ||
irpass::analysis::clone(dest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the return value unused here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I guess this line is useless.
Co-authored-by: xumingkuan <[email protected]>
Sounds good - we should do that in a future PR next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now. Cool!
bool is_atomic_op_linear(AtomicOpType op_type) { | ||
return op_type == AtomicOpType::add || op_type == AtomicOpType::sub; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only add/sub? I'm sure we would see huge speed ups applying this to max/min and others as well. I think we will need to load the original value to thread local pointer in the prologue instead of zero fill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just nit, but the $ numbers are all out of order for the prologue and epilogue
kernel {
$0 = offloaded range_for(0, 1048576) block_dim=adaptive
prologue {
<i32 x1> $64 = thread local ptr (offset = 0 B)
<i32 x1> $65 = const [0]
<i32 x1> $66 : global store [$64 <- $65]
}
body {
<i32 x1> $1 = thread local ptr (offset = 0 B)
<i32 x1> $2 = loop $0 index 0
<gen*x1> $3 = get root
<i32 x1> $4 = const [0]
<gen*x1> $5 = [S0root][root]::lookup($3, $4) activate = false
<gen*x1> $6 = get child [S0root->S1dense] $5
<gen*x1> $7 = [S1dense][dense]::lookup($6, $2) activate = false
<i32*x1> $8 = get child [S1dense->S2place_i32] $7
<i32 x1> $9 = global load $8
<i32 x1> $10 = global load $1
<i32 x1> $11 = add $10 $9
<i32 x1> $12 : global store [$1 <- $11]
}
epilogue {
<i32 x1> $68 = thread local ptr (offset = 0 B)
<i32 x1> $69 = global load $68
<gen*x1> $82 = get root
<i32 x1> $98 = const [0]
<gen*x1> $84 = [S0root][root]::lookup($82, $98) activate = false
<gen*x1> $85 = get child [S0root->S3dense] $84
<gen*x1> $87 = [S3dense][dense]::lookup($85, $98) activate = false
<i32*x1> $88 = get child [S3dense->S4place_i32] $87
<i32 x1> $71 = atomic add($88, $69)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only add/sub? I'm sure we would see huge speed ups applying this to max/min and others as well.
Yeah we can add that later. I'm using add/sub just because it's easier to test.
I think we will need to load the original value to thread local pointer in the prologue instead of zero fill.
Actually I think zero-fill leads to the correct behavior. We are only accumulating local contributions here, and local contributions start with 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I just meant you may need to load the original value for min/max cases. Sounds good though thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the $ numbers are all out of order for the prologue and epilogue
Maybe we should just use BasicStmtVisitor
instead of IRVisitor
for re_id
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM here! I'm not super familiar with IR transforms yet since I have done much work on them though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Related issue = #576
Benchmark:
Before: 17ms
After: 0.48ms (35x faster)
[Click here for the format server]
Design doc: