-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IR optimization should not eliminate stmts across the offloaded task boundaries? #729
Comments
May introduced in #510? This problem will never occur if we treat OffloadedStmt as a block like IfStmt. |
I disagree. We have to launch each offloaded task as a separate kernel, so that the GPU backends can properly establish the memory access dependencies between these kernels. Otherwise if you have two offloaded tasks in the same kernel/shader, while they writes/reads the same element, it becomes very difficult to keep the data synchronized across GPU threads.
Probably not.
As I mentioned, it's not broken at HEAD, only in my local repo. |
Thanks for pointing this out. I find that if I disable the recently introduced
Could you confirm that by setting
to false , this issue is gone on your end? @k-ye This will help us narrow down the potential sources. Thanks.
|
@xumingkuan could you take a look at this? Thanks. |
Thanks for pointing this out -- I wasn't aware of optimization across offloaded task boundaries. Maybe it would be better to rename it to However, I didn't write an optimization to eliminate constants, so I still need to find where the bug is introduced. Is |
If we remove all advanced optimizations but lowering
They are in the same block, so it would be natural to eliminate duplicate @yuanming-hu do you have any idea to systematically deal with this? |
Shall we copy the constants across offloaded blocks in |
Sorry, I was on a meeting. Let me see what could be wrong here. |
Looks like a bug in |
Thank you all for the help! Yes I will take a look around advanced opt after works... |
With the fix in #730 I get
which looks more reasonable... |
(Some random thoughts for further optimization: are the following two IRs the same?)
and
|
Yes - clearly we have more opportunities for optimizations now... |
(sorry i don't have time for real coding now)
One concern i have is that, this turns a compile-time constant into a runtime one, and it now needs to read global memory. FYI, yesterday i got away with this problem by predefining all the constants at the global level. Thanks to SSA + unique stmt names, this worked fine across different offloaded tasks. I think this can be done for GLSL as well? However, that puts restrictions to how backends can be implemented. Maybe it would be better to split into That said, i think we should patch in #730, then create a separate issue to track the performance impact because of this?
I don't think so? What if another kernel doesn't know about
+N. I think this came up a few times now when we work on IR... |
I think other kernels cannot access the labels |
Oh yes... It's indeed non-trivial to optimize them again to compile-time constants. We don't know if other kernels modify the global memory between the offloads in this kernel. |
Sorry for my confusing terms. By kernel i was referring to a offloaded task within the same Taichi kernel... I agree with your claim, but it seems like there would be lots of edge cases.. E.g. some offloaded task reads |
"change all global $4 to global $1" could be done with |
Ah i know where i was thinking wrong. I just noticed that in the example you gave, both |
Right, after disabling it I got the same IR, and Metal is fine. |
Closed by #730 . |
Describe the bug
I noticed that the latest IR passes could eliminate the constant across the offloaded task boundaries. This unfortunately broke the Metal backend. (Note that this is not broken at
HEAD
, it just broke my local works as I'm replacing Metal's passes withcompile_to_offloads()
.)On Metal, each offloaded task is created as a separate kernel function. And a
ConstStmt
is just translated toconst auto {stmt->name()} = {stmt->val.stringify()};
.I wonder if this optimization is only done for constants? If so, maybe I can predefine all the
ConstStmt
as global constants in the generated source code. But if this is applied to non-const, then it's probably gonna be a real bug. Also, it doesn't break the LLVM backends. Is this because LLVM would inline the constants directly before producing the machine code?Log/Screenshots
Detected from
test_local_atomics.py::test_implicit_local_atomic_or
Good
Broken
$2
in the first offloaded task is referenced in the third one.The text was updated successfully, but these errors were encountered: