IR optimization should not eliminate stmts across the offloaded task boundaries? #729

k-ye · 2020-04-08T11:30:05Z

Describe the bug

I noticed that the latest IR passes could eliminate the constant across the offloaded task boundaries. This unfortunately broke the Metal backend. (Note that this is not broken at HEAD, it just broke my local works as I'm replacing Metal's passes with compile_to_offloads().)

On Metal, each offloaded task is created as a separate kernel function. And a ConstStmt is just translated to const auto {stmt->name()} = {stmt->val.stringify()};.

I wonder if this optimization is only done for constants? If so, maybe I can predefine all the ConstStmt as global constants in the generated source code. But if this is applied to non-const, then it's probably gonna be a real bug. Also, it doesn't break the LLVM backends. Is this because LLVM would inline the constants directly before producing the machine code?

Log/Screenshots

Detected from test_local_atomics.py::test_implicit_local_atomic_or

Good

kernel {
  $0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]  <--
    <i32*x1> $3 : global store [$1 <- $2]
    <i32*x1> $4 = global tmp var (offset = 0 B)
    <i32*x1> $5 : global store [$4 <- $2]
  }
  $6 = offloaded range_for(0, 10) block_dim=adaptive {
    <i32 x1> $7 = const [2]
    <i32 x1> $8 = loop index 0
    <i32 x1> $9 = pow $7 $8
    <i32*x1> $10 = global tmp var (offset = 0 B)
    <i32 x1> $11 = atomic bit_or($10, $9)
  }
  $12 = offloaded  {
    <i32*x1> $13 = global tmp var (offset = 0 B)
    <i32 x1> $14 = global load $13
    <gen*x1> $15 = get root
    <i32 x1> $16 = const [0]  <-- OK
    <gen*x1> $17 = [S0root][root]::lookup($15, $16) activate = false
    <gen*x1> $18 = get child [S0root->S1dense] $17
    <gen*x1> $19 = [S1dense][dense]::lookup($18, $16) activate = false
    <i32*x1> $20 = get child [S1dense->S2place_i32] $19
    <i32*x1> $21 : global store [$20 <- $14]
  }
}

Broken

Note that $2 in the first offloaded task is referenced in the third one.

kernel {
  $0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]  <--
    <i32*x1> $3 : global store [$1 <- $2]
    <i32*x1> $4 = global tmp var (offset = 0 B)
    <i32*x1> $5 : global store [$4 <- $2]
  }
  $6 = offloaded range_for(0, 10) block_dim=adaptive {
    <i32 x1> $7 = const [2]
    <i32 x1> $8 = loop index 0
    <i32 x1> $9 = pow $7 $8
    <i32*x1> $10 = global tmp var (offset = 0 B)
    <i32 x1> $11 = atomic bit_or($10, $9)
  }
  $12 = offloaded  {
    <i32*x1> $13 = global tmp var (offset = 0 B)
    <i32 x1> $14 = global load $13
    <gen*x1> $15 = get root
    <gen*x1> $16 = [S0root][root]::lookup($15, $2) activate = false  <-- refers to $2!
    <gen*x1> $17 = get child [S0root->S1dense] $16
    <gen*x1> $18 = [S1dense][dense]::lookup($17, $2) activate = false  <-- refers to $2!
    <i32*x1> $19 = get child [S1dense->S2place_i32] $18
    <i32*x1> $20 : global store [$19 <- $14]
  }
}

The text was updated successfully, but these errors were encountered:

archibate · 2020-04-08T11:48:26Z

May introduced in #510? This problem will never occur if we treat OffloadedStmt as a block like IfStmt.
Or will this also broke in IfStmt?
Also this should also break OpenGL test but it didn't...

k-ye · 2020-04-08T12:04:11Z

This problem will never occur if we treat OffloadedStmt as a block like IfStmt.

I disagree. We have to launch each offloaded task as a separate kernel, so that the GPU backends can properly establish the memory access dependencies between these kernels. Otherwise if you have two offloaded tasks in the same kernel/shader, while they writes/reads the same element, it becomes very difficult to keep the data synchronized across GPU threads.

Or will this also broke in IfStmt?

Probably not.

Also this should also break OpenGL test but it didn't...

As I mentioned, it's not broken at HEAD, only in my local repo.

yuanming-hu · 2020-04-08T14:26:54Z

Thanks for pointing this out. I find that if I disable the recently introduced advanced optimizations, I'll get

kernel {
  $0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]
    <i32*x1> $3 : global store [$1 <- $2]
    <i32*x1> $4 = global tmp var (offset = 0 B)
    <i32*x1> $5 : global store [$4 <- $2]
  }
  $6 = offloaded range_for(0, 10) block_dim=adaptive {
    <i32 x1> $7 = const [2]
    <i32 x1> $8 = loop index 0
    <i32 x1> $9 = pow $7 $8
    <i32*x1> $10 = global tmp var (offset = 0 B)
    <i32 x1> $11 = atomic bit_or($10, $9)
  }
  $12 = offloaded  {
    <i32*x1> $13 = global tmp var (offset = 0 B)
    <i32 x1> $14 = global load $13
    <gen*x1> $15 = get root
    <i32 x1> $16 = linearized(ind {}, stride {})
    <gen*x1> $17 = [S0root][root]::lookup($15, $16) activate = false # refers to $16 instead of 2
    <gen*x1> $18 = get child [S0root->S1dense] $17
    <gen*x1> $19 = [S1dense][dense]::lookup($18, $16) activate = false # refers to $16 instead of 2
    <i32*x1> $20 = get child [S1dense->S2place_i32] $19
    <i32*x1> $21 : global store [$20 <- $14]
  }
}

Could you confirm that by setting

taichi/taichi/program/compile_config.cpp

Line 5 in ad4d7ae

const bool advanced_optimization = true;

to false, this issue is gone on your end? @k-ye This will help us narrow down the potential sources. Thanks.

yuanming-hu · 2020-04-08T17:42:10Z

@xumingkuan could you take a look at this? Thanks.

xumingkuan · 2020-04-08T18:03:44Z

Thanks for pointing this out -- I wasn't aware of optimization across offloaded task boundaries. Maybe it would be better to rename it to OffLoadedTask to explicitly show that it's not a plain statement.

However, I didn't write an optimization to eliminate constants, so I still need to find where the bug is introduced.

Is compile_to_offloads() called before each of the backends in the latest IR?

xumingkuan · 2020-04-08T18:49:11Z

If we remove all advanced optimizations but lowering Linearized, the bug still exists. It's introduced in this simplify pass (in Simplified II):

before simplify
==========
kernel {
  <i32 x1> $0 = alloca
  <i32 x1> $1 = const [0]
  <i32 x1> $2 : local store [$0 <- $1]
  <i32 x1> $3 = alloca
  <i32 x1> $4 = const [10]
  for $3 in range($1, $4, step 1) {
    <i32 x1> $6 = const [2]
    <i32 x1> $7 = local load [ [$3[0]]]
    <i32 x1> $8 = pow $6 $7
    <i32 x1> $9 = atomic bit_or($0, $8)
  }
  <i32 x1> $10 = local load [ [$0[0]]]
  <i32*x1> $11 = global ptr [S2place_i32], index [] activate=true
  <gen*x1> $12 = get root
  <i32 x1> $13 = linearized(ind {}, stride {})
  <gen*x1> $14 = [S0root][root]::lookup($12, $13) activate = false
  <gen*x1> $15 = get child [S0root->S1dense] $14
  <i32 x1> $16 = linearized(ind {}, stride {})
  <gen*x1> $17 = [S1dense][dense]::lookup($15, $16) activate = false
  <i32*x1> $18 = get child [S1dense->S2place_i32] $17
  <i32 x1> $19 = shuffle $18[0]
  <i32 x1> $20 : global store [$19 <- $10]
}
==========
after simplify
==========
kernel {
  <i32 x1> $0 = alloca
  <i32 x1> $1 = const [0]
  <i32 x1> $2 : local store [$0 <- $1]
  <i32 x1> $3 = alloca
  <i32 x1> $4 = const [10]
  for $3 in range($1, $4, step 1) {
    <i32 x1> $6 = const [2]
    <i32 x1> $7 = local load [ [$3[0]]]
    <i32 x1> $8 = pow $6 $7
    <i32 x1> $9 = atomic bit_or($0, $8)
  }
  <i32 x1> $10 = local load [ [$0[0]]]
  <i32*x1> $11 = global ptr [S2place_i32], index [] activate=true
  <gen*x1> $12 = get root
  <gen*x1> $14 = [S0root][root]::lookup($12, $1) activate = false
  <gen*x1> $15 = get child [S0root->S1dense] $14
  <gen*x1> $17 = [S1dense][dense]::lookup($15, $1) activate = false
  <i32*x1> $18 = get child [S1dense->S2place_i32] $17
  <i32 x1> $20 : global store [$18 <- $10]
}
==========

They are in the same block, so it would be natural to eliminate duplicate const [0]s. However, the block is divided to several offloaded tasks in the end.

@yuanming-hu do you have any idea to systematically deal with this?

xumingkuan · 2020-04-08T19:09:30Z

Shall we copy the constants across offloaded blocks in offload()? Are there any other issues?

yuanming-hu · 2020-04-08T19:36:32Z

Sorry, I was on a meeting. Let me see what could be wrong here.

yuanming-hu · 2020-04-08T19:48:49Z

Looks like a bug in offload. I'll fix that tonight.

k-ye · 2020-04-08T23:44:22Z

Thank you all for the help! Yes I will take a look around advanced opt after works...

yuanming-hu · 2020-04-09T03:06:48Z

With the fix in #730 I get

==========
kernel {
  $0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]
    <i32*x1> $3 : global store [$1 <- $2]
    <i32*x1> $4 = global tmp var (offset = 4 B)
    <i32*x1> $5 : global store [$4 <- $2]
    <i32*x1> $6 = global tmp var (offset = 0 B)
    <i32*x1> $7 = global tmp var (offset = 4 B)
    <i32 x1> $8 = global load $7
    <i32*x1> $9 : global store [$6 <- $8]
  }
  $10 = offloaded range_for(0, 10) block_dim=adaptive {
    <i32 x1> $11 = const [2]
    <i32 x1> $12 = loop index 0
    <i32 x1> $13 = pow $11 $12
    <i32*x1> $14 = global tmp var (offset = 0 B)
    <i32 x1> $15 = atomic bit_or($14, $13)
  }
  $16 = offloaded  {
    <i32*x1> $17 = global tmp var (offset = 0 B)
    <i32 x1> $18 = global load $17
    <gen*x1> $19 = get root
    <i32*x1> $20 = global tmp var (offset = 4 B)
    <i32 x1> $21 = global load $20
    <gen*x1> $22 = [S0root][root]::lookup($19, $21) activate = false
    <gen*x1> $23 = get child [S0root->S1dense] $22
    <i32*x1> $24 = global tmp var (offset = 4 B)
    <i32 x1> $25 = global load $24
    <gen*x1> $26 = [S1dense][dense]::lookup($23, $25) activate = false
    <i32*x1> $27 = get child [S1dense->S2place_i32] $26
    <i32*x1> $28 : global store [$27 <- $18]
  }
}
==========

which looks more reasonable...

xumingkuan · 2020-04-09T03:23:31Z

(Some random thoughts for further optimization: are the following two IRs the same?)

$0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]
    <i32*x1> $3 : global store [$1 <- $2]
    <i32*x1> $4 = global tmp var (offset = 0 B)
    <i32*x1> $5 : global store [$4 <- $2]
}

and

$0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]
    <i32*x1> $3 : global store [$1 <- $2]
}

yuanming-hu · 2020-04-09T03:28:38Z

Yes - clearly we have more opportunities for optimizations now...

k-ye · 2020-04-09T03:31:34Z

(sorry i don't have time for real coding now)

With the fix in #730 I get

One concern i have is that, this turns a compile-time constant into a runtime one, and it now needs to read global memory.

FYI, yesterday i got away with this problem by predefining all the constants at the global level. Thanks to SSA + unique stmt names, this worked fine across different offloaded tasks. I think this can be done for GLSL as well? However, that puts restrictions to how backends can be implemented. Maybe it would be better to split into ConstDefStmt and ConstRefStmt.

That said, i think we should patch in #730, then create a separate issue to track the performance impact because of this?

(Some random thoughts for further optimization: are the following two IRs the same?)

I don't think so? What if another kernel doesn't know about global $1 and reads global $4?

Maybe it would be better to rename it to OffLoadedTask to explicitly show that it's not a plain statement.

+N. I think this came up a few times now when we work on IR...

xumingkuan · 2020-04-09T03:37:16Z

(Some random thoughts for further optimization: are the following two IRs the same?)

I don't think so? What if another kernel doesn't know about global $1 and reads global $4?

I think other kernels cannot access the labels $1 or $4 in this kernel. In other offloads in this kernel, however, we should also change all global $4 to global $1 when this optimization is done.

xumingkuan · 2020-04-09T03:43:01Z

With the fix in #730 I get

One concern i have is that, this turns a compile-time constant into a runtime one, and it now needs to read global memory.

Oh yes... It's indeed non-trivial to optimize them again to compile-time constants. We don't know if other kernels modify the global memory between the offloads in this kernel.

k-ye · 2020-04-09T03:47:33Z

however, we should also change all global $4 to global $1 when this optimization is done.

Sorry for my confusing terms. By kernel i was referring to a offloaded task within the same Taichi kernel... I agree with your claim, but it seems like there would be lots of edge cases.. E.g. some offloaded task reads global ($4 + 0) .. (I don't think this is actually a possible output in Taichi, but it's not forbidden in theory..)

xumingkuan · 2020-04-09T03:51:28Z

however, we should also change all global $4 to global $1 when this optimization is done.

Sorry for my confusing terms. By kernel i was referring to a offloaded task within the same Taichi kernel... I agree with your claim, but it seems like there would be lots of edge cases.. E.g. some offloaded task reads global ($4 + 0) .. (I don't think this is actually a possible output in Taichi, but it's not forbidden in theory..)

"change all global $4 to global $1" could be done with irpass::replace_all_usages_with(). It automatically handles these cases.

k-ye · 2020-04-09T03:58:19Z

Ah i know where i was thinking wrong. I just noticed that in the example you gave, both $1 and $4 pointed to the same byte offset.. Sorry about that

k-ye · 2020-04-09T09:57:49Z

Re #729 (comment)

Right, after disabling it I got the same IR, and Metal is fine.

k-ye · 2020-04-14T11:18:20Z

Closed by #730 .

k-ye added the potential bug Something that looks like a bug but not yet confirmed label Apr 8, 2020

k-ye assigned yuanming-hu and xumingkuan Apr 8, 2020

k-ye added the ir IR related issues label Apr 8, 2020

k-ye changed the title ~~IR optimization should not eliminate constants across offloaded tasks?~~ IR optimization should not eliminate stmts across the offloaded task boundaries? Apr 8, 2020

yuanming-hu mentioned this issue Apr 9, 2020

[ir] Fix out-of-scope operands during offloading #730

Merged

k-ye mentioned this issue Apr 9, 2020

[metal] Refactor the codegen to have multiple code sections #733

Merged

k-ye closed this as completed Apr 14, 2020

xumingkuan mentioned this issue Apr 17, 2020

[opt] Avoid storing constants across offloaded tasks #812

Merged

k-ye mentioned this issue May 25, 2020

[metal] Remove global constant def section #1056

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IR optimization should not eliminate stmts across the offloaded task boundaries? #729

IR optimization should not eliminate stmts across the offloaded task boundaries? #729

k-ye commented Apr 8, 2020 •

edited

Loading

archibate commented Apr 8, 2020 •

edited

Loading

k-ye commented Apr 8, 2020

yuanming-hu commented Apr 8, 2020

yuanming-hu commented Apr 8, 2020

xumingkuan commented Apr 8, 2020

xumingkuan commented Apr 8, 2020 •

edited

Loading

xumingkuan commented Apr 8, 2020

yuanming-hu commented Apr 8, 2020

yuanming-hu commented Apr 8, 2020 •

edited

Loading

k-ye commented Apr 8, 2020

yuanming-hu commented Apr 9, 2020

xumingkuan commented Apr 9, 2020

yuanming-hu commented Apr 9, 2020

k-ye commented Apr 9, 2020 •

edited

Loading

xumingkuan commented Apr 9, 2020

xumingkuan commented Apr 9, 2020

k-ye commented Apr 9, 2020

xumingkuan commented Apr 9, 2020

k-ye commented Apr 9, 2020

k-ye commented Apr 9, 2020

k-ye commented Apr 14, 2020

IR optimization should not eliminate stmts across the offloaded task boundaries? #729

IR optimization should not eliminate stmts across the offloaded task boundaries? #729

Comments

k-ye commented Apr 8, 2020 • edited Loading

archibate commented Apr 8, 2020 • edited Loading

k-ye commented Apr 8, 2020

yuanming-hu commented Apr 8, 2020

yuanming-hu commented Apr 8, 2020

xumingkuan commented Apr 8, 2020

xumingkuan commented Apr 8, 2020 • edited Loading

xumingkuan commented Apr 8, 2020

yuanming-hu commented Apr 8, 2020

yuanming-hu commented Apr 8, 2020 • edited Loading

k-ye commented Apr 8, 2020

yuanming-hu commented Apr 9, 2020

xumingkuan commented Apr 9, 2020

yuanming-hu commented Apr 9, 2020

k-ye commented Apr 9, 2020 • edited Loading

xumingkuan commented Apr 9, 2020

xumingkuan commented Apr 9, 2020

k-ye commented Apr 9, 2020

xumingkuan commented Apr 9, 2020

k-ye commented Apr 9, 2020

k-ye commented Apr 9, 2020

k-ye commented Apr 14, 2020

k-ye commented Apr 8, 2020 •

edited

Loading

archibate commented Apr 8, 2020 •

edited

Loading

xumingkuan commented Apr 8, 2020 •

edited

Loading

yuanming-hu commented Apr 8, 2020 •

edited

Loading

k-ye commented Apr 9, 2020 •

edited

Loading