-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manually fuse some broadcast expressions in diagnostic edmf #3366
Conversation
6de0cbb
to
b32468f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still relevant:
# Using constant exponents in broadcasts allocate, so we use
# local_geometry_halflevel.J * local_geometry_halflevel.J instead.
# See ClimaCore.jl issue #1126.
?
It's only relevant when |
b32468f
to
4a528bc
Compare
The invalidations failure is unrelated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Could you explain more on how this is better for performance?
Yes, of course: These broadcast expressions are inside a loop over the the number of vertical levels, and each broadcast expression results in a CUDA kernel launch. Most of our kernels are memory bound (reading variables into registers, and storing them), not compute bound, so we can estimate their performance by counting reads and writes. Here is an example: Code block A@. x = y # 1 write, 1 read
@. x = x + 2*y # 1 write, 2 reads Code block B@. x = y + 2*y # 1 write, 1 read Assuming reads and writes are equally expensive (they are, roughly), the total reads/writes in code block (A, B) are (5, 2). So, code block B should be ~2.5x faster than A. To give a bit more balance to this explanation: one case where splitting kernels can be beneficial is when a kernel is very complex (for lack of a concrete example, I'll skip giving one), in such cases, it may be that many registers are needed for the kernel, resulting in register spilling, and this could lead to inefficient memory access--so inefficient that splitting the kernels into smaller and simpler kernels may be faster, despite having redundant loads/stores. |
Actually, #2951 (comment) may be an example where breaking kernels up resulted in speedup, those were pretty complex functions, though. |
Thanks @charleskawczynski! This is helpful. Just to make sure I understand, by wrapping the computation into a function, the compiler internally fuses the broadcasted expressions? |
The way I think about it is that this change moves from having multiple broadcasted expressions (multiple dots), to only one (which is faster for the reasons well described above). |
We don't technically have to put it into a function. For example: foo(y, z) = y + z + z*y - z^2*y
@. x = foo(y, z) and @. x = y + z + z*y - z^2*y will result in the same number of reads/writes. The key ingredient is that there will only be a single read/write per variable in a single broadcast expression. Peeking under the hood, the reason is because all variables, and all @. x = y + z
@. x += z*y - z^2*y then the compiler cannot hoist those reads/writes, and we pay for every required read/write per broadcast expression. |
Thanks, @charleskawczynski. I started a build here to measure the overall impact! |
It's probably not huge, but piling on a handful of these may add up. |
This PR manually fuses some broadcast expressions, which will lead to
Nv
-fewer kernel launches, per fused expression. I think we that should probably strive to manually fuse more of these, where it's easy/possible. This is a good example in that, we're computing a variable and then simply updating it in subsequent broadcasts.cc @sriharshakandala.