Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IR optimization should not eliminate stmts across the offloaded task boundaries? #729

Closed
k-ye opened this issue Apr 8, 2020 · 21 comments
Closed
Assignees
Labels
ir IR related issues potential bug Something that looks like a bug but not yet confirmed

Comments

@k-ye
Copy link
Member

k-ye commented Apr 8, 2020

Describe the bug

I noticed that the latest IR passes could eliminate the constant across the offloaded task boundaries. This unfortunately broke the Metal backend. (Note that this is not broken at HEAD, it just broke my local works as I'm replacing Metal's passes with compile_to_offloads().)

On Metal, each offloaded task is created as a separate kernel function. And a ConstStmt is just translated to const auto {stmt->name()} = {stmt->val.stringify()};.

I wonder if this optimization is only done for constants? If so, maybe I can predefine all the ConstStmt as global constants in the generated source code. But if this is applied to non-const, then it's probably gonna be a real bug. Also, it doesn't break the LLVM backends. Is this because LLVM would inline the constants directly before producing the machine code?

Log/Screenshots

Detected from test_local_atomics.py::test_implicit_local_atomic_or

Good

kernel {
  $0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]  <--
    <i32*x1> $3 : global store [$1 <- $2]
    <i32*x1> $4 = global tmp var (offset = 0 B)
    <i32*x1> $5 : global store [$4 <- $2]
  }
  $6 = offloaded range_for(0, 10) block_dim=adaptive {
    <i32 x1> $7 = const [2]
    <i32 x1> $8 = loop index 0
    <i32 x1> $9 = pow $7 $8
    <i32*x1> $10 = global tmp var (offset = 0 B)
    <i32 x1> $11 = atomic bit_or($10, $9)
  }
  $12 = offloaded  {
    <i32*x1> $13 = global tmp var (offset = 0 B)
    <i32 x1> $14 = global load $13
    <gen*x1> $15 = get root
    <i32 x1> $16 = const [0]  <-- OK
    <gen*x1> $17 = [S0root][root]::lookup($15, $16) activate = false
    <gen*x1> $18 = get child [S0root->S1dense] $17
    <gen*x1> $19 = [S1dense][dense]::lookup($18, $16) activate = false
    <i32*x1> $20 = get child [S1dense->S2place_i32] $19
    <i32*x1> $21 : global store [$20 <- $14]
  }
}

Broken

  • Note that $2 in the first offloaded task is referenced in the third one.
kernel {
  $0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]  <--
    <i32*x1> $3 : global store [$1 <- $2]
    <i32*x1> $4 = global tmp var (offset = 0 B)
    <i32*x1> $5 : global store [$4 <- $2]
  }
  $6 = offloaded range_for(0, 10) block_dim=adaptive {
    <i32 x1> $7 = const [2]
    <i32 x1> $8 = loop index 0
    <i32 x1> $9 = pow $7 $8
    <i32*x1> $10 = global tmp var (offset = 0 B)
    <i32 x1> $11 = atomic bit_or($10, $9)
  }
  $12 = offloaded  {
    <i32*x1> $13 = global tmp var (offset = 0 B)
    <i32 x1> $14 = global load $13
    <gen*x1> $15 = get root
    <gen*x1> $16 = [S0root][root]::lookup($15, $2) activate = false  <-- refers to $2!
    <gen*x1> $17 = get child [S0root->S1dense] $16
    <gen*x1> $18 = [S1dense][dense]::lookup($17, $2) activate = false  <-- refers to $2!
    <i32*x1> $19 = get child [S1dense->S2place_i32] $18
    <i32*x1> $20 : global store [$19 <- $14]
  }
}
@k-ye k-ye added the potential bug Something that looks like a bug but not yet confirmed label Apr 8, 2020
@k-ye k-ye added the ir IR related issues label Apr 8, 2020
@k-ye k-ye changed the title IR optimization should not eliminate constants across offloaded tasks? IR optimization should not eliminate stmts across the offloaded task boundaries? Apr 8, 2020
@archibate
Copy link
Collaborator

archibate commented Apr 8, 2020

May introduced in #510? This problem will never occur if we treat OffloadedStmt as a block like IfStmt.
Or will this also broke in IfStmt?
Also this should also break OpenGL test but it didn't...

@k-ye
Copy link
Member Author

k-ye commented Apr 8, 2020

This problem will never occur if we treat OffloadedStmt as a block like IfStmt.

I disagree. We have to launch each offloaded task as a separate kernel, so that the GPU backends can properly establish the memory access dependencies between these kernels. Otherwise if you have two offloaded tasks in the same kernel/shader, while they writes/reads the same element, it becomes very difficult to keep the data synchronized across GPU threads.

Or will this also broke in IfStmt?

Probably not.

Also this should also break OpenGL test but it didn't...

As I mentioned, it's not broken at HEAD, only in my local repo.

@yuanming-hu
Copy link
Member

Thanks for pointing this out. I find that if I disable the recently introduced advanced optimizations, I'll get

kernel {
  $0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]
    <i32*x1> $3 : global store [$1 <- $2]
    <i32*x1> $4 = global tmp var (offset = 0 B)
    <i32*x1> $5 : global store [$4 <- $2]
  }
  $6 = offloaded range_for(0, 10) block_dim=adaptive {
    <i32 x1> $7 = const [2]
    <i32 x1> $8 = loop index 0
    <i32 x1> $9 = pow $7 $8
    <i32*x1> $10 = global tmp var (offset = 0 B)
    <i32 x1> $11 = atomic bit_or($10, $9)
  }
  $12 = offloaded  {
    <i32*x1> $13 = global tmp var (offset = 0 B)
    <i32 x1> $14 = global load $13
    <gen*x1> $15 = get root
    <i32 x1> $16 = linearized(ind {}, stride {})
    <gen*x1> $17 = [S0root][root]::lookup($15, $16) activate = false # refers to $16 instead of 2
    <gen*x1> $18 = get child [S0root->S1dense] $17
    <gen*x1> $19 = [S1dense][dense]::lookup($18, $16) activate = false # refers to $16 instead of 2
    <i32*x1> $20 = get child [S1dense->S2place_i32] $19
    <i32*x1> $21 : global store [$20 <- $14]
  }
}

Could you confirm that by setting

const bool advanced_optimization = true;

to false, this issue is gone on your end? @k-ye This will help us narrow down the potential sources. Thanks.

@yuanming-hu
Copy link
Member

@xumingkuan could you take a look at this? Thanks.

@xumingkuan
Copy link
Contributor

Thanks for pointing this out -- I wasn't aware of optimization across offloaded task boundaries. Maybe it would be better to rename it to OffLoadedTask to explicitly show that it's not a plain statement.

However, I didn't write an optimization to eliminate constants, so I still need to find where the bug is introduced.

Is compile_to_offloads() called before each of the backends in the latest IR?

@xumingkuan
Copy link
Contributor

xumingkuan commented Apr 8, 2020

If we remove all advanced optimizations but lowering Linearized, the bug still exists. It's introduced in this simplify pass (in Simplified II):

before simplify
==========
kernel {
  <i32 x1> $0 = alloca
  <i32 x1> $1 = const [0]
  <i32 x1> $2 : local store [$0 <- $1]
  <i32 x1> $3 = alloca
  <i32 x1> $4 = const [10]
  for $3 in range($1, $4, step 1) {
    <i32 x1> $6 = const [2]
    <i32 x1> $7 = local load [ [$3[0]]]
    <i32 x1> $8 = pow $6 $7
    <i32 x1> $9 = atomic bit_or($0, $8)
  }
  <i32 x1> $10 = local load [ [$0[0]]]
  <i32*x1> $11 = global ptr [S2place_i32], index [] activate=true
  <gen*x1> $12 = get root
  <i32 x1> $13 = linearized(ind {}, stride {})
  <gen*x1> $14 = [S0root][root]::lookup($12, $13) activate = false
  <gen*x1> $15 = get child [S0root->S1dense] $14
  <i32 x1> $16 = linearized(ind {}, stride {})
  <gen*x1> $17 = [S1dense][dense]::lookup($15, $16) activate = false
  <i32*x1> $18 = get child [S1dense->S2place_i32] $17
  <i32 x1> $19 = shuffle $18[0]
  <i32 x1> $20 : global store [$19 <- $10]
}
==========
after simplify
==========
kernel {
  <i32 x1> $0 = alloca
  <i32 x1> $1 = const [0]
  <i32 x1> $2 : local store [$0 <- $1]
  <i32 x1> $3 = alloca
  <i32 x1> $4 = const [10]
  for $3 in range($1, $4, step 1) {
    <i32 x1> $6 = const [2]
    <i32 x1> $7 = local load [ [$3[0]]]
    <i32 x1> $8 = pow $6 $7
    <i32 x1> $9 = atomic bit_or($0, $8)
  }
  <i32 x1> $10 = local load [ [$0[0]]]
  <i32*x1> $11 = global ptr [S2place_i32], index [] activate=true
  <gen*x1> $12 = get root
  <gen*x1> $14 = [S0root][root]::lookup($12, $1) activate = false
  <gen*x1> $15 = get child [S0root->S1dense] $14
  <gen*x1> $17 = [S1dense][dense]::lookup($15, $1) activate = false
  <i32*x1> $18 = get child [S1dense->S2place_i32] $17
  <i32 x1> $20 : global store [$18 <- $10]
}
==========

They are in the same block, so it would be natural to eliminate duplicate const [0]s. However, the block is divided to several offloaded tasks in the end.

@yuanming-hu do you have any idea to systematically deal with this?

@xumingkuan
Copy link
Contributor

Shall we copy the constants across offloaded blocks in offload()? Are there any other issues?

@yuanming-hu
Copy link
Member

Sorry, I was on a meeting. Let me see what could be wrong here.

@yuanming-hu
Copy link
Member

yuanming-hu commented Apr 8, 2020

Looks like a bug in offload. I'll fix that tonight.

@k-ye
Copy link
Member Author

k-ye commented Apr 8, 2020

Thank you all for the help! Yes I will take a look around advanced opt after works...

@yuanming-hu
Copy link
Member

With the fix in #730 I get

==========
kernel {
  $0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]
    <i32*x1> $3 : global store [$1 <- $2]
    <i32*x1> $4 = global tmp var (offset = 4 B)
    <i32*x1> $5 : global store [$4 <- $2]
    <i32*x1> $6 = global tmp var (offset = 0 B)
    <i32*x1> $7 = global tmp var (offset = 4 B)
    <i32 x1> $8 = global load $7
    <i32*x1> $9 : global store [$6 <- $8]
  }
  $10 = offloaded range_for(0, 10) block_dim=adaptive {
    <i32 x1> $11 = const [2]
    <i32 x1> $12 = loop index 0
    <i32 x1> $13 = pow $11 $12
    <i32*x1> $14 = global tmp var (offset = 0 B)
    <i32 x1> $15 = atomic bit_or($14, $13)
  }
  $16 = offloaded  {
    <i32*x1> $17 = global tmp var (offset = 0 B)
    <i32 x1> $18 = global load $17
    <gen*x1> $19 = get root
    <i32*x1> $20 = global tmp var (offset = 4 B)
    <i32 x1> $21 = global load $20
    <gen*x1> $22 = [S0root][root]::lookup($19, $21) activate = false
    <gen*x1> $23 = get child [S0root->S1dense] $22
    <i32*x1> $24 = global tmp var (offset = 4 B)
    <i32 x1> $25 = global load $24
    <gen*x1> $26 = [S1dense][dense]::lookup($23, $25) activate = false
    <i32*x1> $27 = get child [S1dense->S2place_i32] $26
    <i32*x1> $28 : global store [$27 <- $18]
  }
}
==========

which looks more reasonable...

@xumingkuan
Copy link
Contributor

(Some random thoughts for further optimization: are the following two IRs the same?)

$0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]
    <i32*x1> $3 : global store [$1 <- $2]
    <i32*x1> $4 = global tmp var (offset = 0 B)
    <i32*x1> $5 : global store [$4 <- $2]
}

and

$0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]
    <i32*x1> $3 : global store [$1 <- $2]
}

@yuanming-hu
Copy link
Member

Yes - clearly we have more opportunities for optimizations now...

@k-ye
Copy link
Member Author

k-ye commented Apr 9, 2020

(sorry i don't have time for real coding now)

With the fix in #730 I get

One concern i have is that, this turns a compile-time constant into a runtime one, and it now needs to read global memory.

FYI, yesterday i got away with this problem by predefining all the constants at the global level. Thanks to SSA + unique stmt names, this worked fine across different offloaded tasks. I think this can be done for GLSL as well? However, that puts restrictions to how backends can be implemented. Maybe it would be better to split into ConstDefStmt and ConstRefStmt.

That said, i think we should patch in #730, then create a separate issue to track the performance impact because of this?

(Some random thoughts for further optimization: are the following two IRs the same?)

I don't think so? What if another kernel doesn't know about global $1 and reads global $4?

Maybe it would be better to rename it to OffLoadedTask to explicitly show that it's not a plain statement.

+N. I think this came up a few times now when we work on IR...

@xumingkuan
Copy link
Contributor

(Some random thoughts for further optimization: are the following two IRs the same?)

I don't think so? What if another kernel doesn't know about global $1 and reads global $4?

I think other kernels cannot access the labels $1 or $4 in this kernel. In other offloads in this kernel, however, we should also change all global $4 to global $1 when this optimization is done.

@xumingkuan
Copy link
Contributor

With the fix in #730 I get

One concern i have is that, this turns a compile-time constant into a runtime one, and it now needs to read global memory.

Oh yes... It's indeed non-trivial to optimize them again to compile-time constants. We don't know if other kernels modify the global memory between the offloads in this kernel.

@k-ye
Copy link
Member Author

k-ye commented Apr 9, 2020

however, we should also change all global $4 to global $1 when this optimization is done.

Sorry for my confusing terms. By kernel i was referring to a offloaded task within the same Taichi kernel... I agree with your claim, but it seems like there would be lots of edge cases.. E.g. some offloaded task reads global ($4 + 0) .. (I don't think this is actually a possible output in Taichi, but it's not forbidden in theory..)

@xumingkuan
Copy link
Contributor

however, we should also change all global $4 to global $1 when this optimization is done.

Sorry for my confusing terms. By kernel i was referring to a offloaded task within the same Taichi kernel... I agree with your claim, but it seems like there would be lots of edge cases.. E.g. some offloaded task reads global ($4 + 0) .. (I don't think this is actually a possible output in Taichi, but it's not forbidden in theory..)

"change all global $4 to global $1" could be done with irpass::replace_all_usages_with(). It automatically handles these cases.

@k-ye
Copy link
Member Author

k-ye commented Apr 9, 2020

Ah i know where i was thinking wrong. I just noticed that in the example you gave, both $1 and $4 pointed to the same byte offset.. Sorry about that

@k-ye
Copy link
Member Author

k-ye commented Apr 9, 2020

Re #729 (comment)

Right, after disabling it I got the same IR, and Metal is fine.

@k-ye
Copy link
Member Author

k-ye commented Apr 14, 2020

Closed by #730 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ir IR related issues potential bug Something that looks like a bug but not yet confirmed
Projects
None yet
Development

No branches or pull requests

4 participants