Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Opt] Dead store and stack-related operation elimination by control-flow graph #1324

Merged
merged 9 commits into from
Jun 26, 2020

Conversation

xumingkuan
Copy link
Contributor

@xumingkuan xumingkuan commented Jun 25, 2020

Related issue = #926 #656
Related PR = #932 #1248
Replaces #859 #857

Change in build_cfg:

  • Add an empty final_node. Then assume there are reads to all global pointers after the final node (just like there are writes to all global pointers before the start node), for more convenient analysis on global pointers.
  • Tell each CFGNode if they are parallel executed (i.e. in an offloaded range_for/struct_for's body).

Change in CFGNode:

Change in compile_to_offloads:

  • Remove the variable_optimization pass. (Faster compilation!)

Benchmark:

Compilation time (#926 (comment)):
6.872m -> 1.164m
(codegen_kernel_statements: 29087 -> 28573)

Number of statements:

benchmark20200625_2

The only case that becomes worse is:

@ti.all_archs
def test_offload_with_cross_block_locals3():
    ret = ti.var(ti.f32, shape=())

    @ti.kernel
    def ker():
        s = 1
        t = s
        for i in range(10):
            s += i
        ret[None] = t

    ker()

    assert ret[None] == 1

Previously, we didn't consider global temp as global ptrs, so we could simplify the kernel to ret[None] = 1...

[Click here for the format server]


@xumingkuan
Copy link
Contributor Author

We can weaken an AtomicOpStmt to a load if there are no loads after it:

$1 = alloca
$2 = local store $1 <- const[10]
$3 = atomic add($1, const[5]) (can be replaced with local load $1)
$4 = print $3

We can do this optimization not only when atomic->dest is an alloca but also when it's a global temp -- as long as there are no loads after it before the next store.

However, we can't do this optimization if it's parallel executed:

$0 = offloaded range_for(0, 8) {
  $1 = global temp [offset=0]
  $2 = atomic add($1, const[5]) (cannot be replaced!)
  $3 = global ptr a
  $4 = global store $3 <- $2
}

@xumingkuan
Copy link
Contributor Author

How to deal with the "load" in offloaded range for?

kernel {
  $0 = offloaded
  body {
    <i32*x1> $1 = global ptr [S4place_i32], index [] activate=false
    <i32 x1> $2 = global load $1
    <i32*x1> $3 = global tmp var (offset = 0 B)
    <i32*x1> $4 : global store [$3 <- $2]
    <i32*x1> $5 = global ptr [S6place_i32], index [] activate=false
    <i32 x1> $6 = global load $5
    <i32*x1> $7 = global tmp var (offset = 4 B)
    <i32*x1> $8 : global store [$7 <- $6]
  }
  $9 = offloaded range_for(tmp(offset=0B), tmp(offset=4B)) block_dim=adaptive
  body {
    <i32 x1> $10 = loop $9 index 0
    <i32*x1> $11 = global ptr [S2place_i32], index [] activate=true
    <i32 x1> $12 = atomic add($11, $10)
  }
}

If we don't deal with it, the global stores will be eliminated.

We need something like this:

void visit(OffloadedStmt *stmt) override {
if (stmt->task_type == stmt->range_for) {
TI_ASSERT(!maybe_run);
if (!stmt->const_begin) {
TI_ASSERT(state_machines.find(stmt->begin_offset) !=
state_machines.end());
state_machines[stmt->begin_offset].load();
}
if (!stmt->const_end) {
TI_ASSERT(state_machines.find(stmt->end_offset) !=
state_machines.end());
state_machines[stmt->end_offset].load();
}
}

@yuanming-hu Do you have any ideas? I think we probably need to traverse the IR to find a global temp of the same offset with offload->begin_offset, and add an std::vector for each CFGNode denoting the loads not representable in the node...

@xumingkuan xumingkuan added the large changeset The PR contains a large changeset and reviewing may take times label Jun 25, 2020
@xumingkuan
Copy link
Contributor Author

Are there better ways to support offloaded range fors without doing implicit loads...?

@codecov
Copy link

codecov bot commented Jun 25, 2020

Codecov Report

Merging #1324 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1324   +/-   ##
=======================================
  Coverage   85.48%   85.48%           
=======================================
  Files          19       19           
  Lines        3375     3375           
  Branches      630      630           
=======================================
  Hits         2885     2885           
  Misses        358      358           
  Partials      132      132           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a434d3...00d1666. Read the comment docs.

@xumingkuan xumingkuan changed the title [Opt] Dead store and dead stack-operation elimination by control-flow graph [Opt] Dead store and stack-related operation elimination by control-flow graph Jun 25, 2020
@xumingkuan xumingkuan marked this pull request as ready for review June 25, 2020 20:58
@xumingkuan xumingkuan mentioned this pull request Jun 25, 2020
18 tasks
Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! LGTM. It's a lot of work. Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
large changeset The PR contains a large changeset and reviewing may take times
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants