Manually fuse some broadcast expressions in diagnostic edmf #3366

charleskawczynski · 2024-10-08T14:59:14Z

This PR manually fuses some broadcast expressions, which will lead to Nv-fewer kernel launches, per fused expression. I think we that should probably strive to manually fuse more of these, where it's easy/possible. This is a good example in that, we're computing a variable and then simply updating it in subsequent broadcasts.

cc @sriharshakandala.

Sbozzolo

Is this comment still relevant:

           # Using constant exponents in broadcasts allocate, so we use
            # local_geometry_halflevel.J * local_geometry_halflevel.J instead.
            # See ClimaCore.jl issue #1126.

?

charleskawczynski · 2024-10-08T15:36:05Z

Is this comment still relevant...

It's only relevant when ^ is directly in a broadcast expression, if ^ inside a function call in a broadcast expression, then there are no allocation issues.

charleskawczynski · 2024-10-08T16:35:08Z

The invalidations failure is unrelated

szy21

Thank you! Could you explain more on how this is better for performance?

charleskawczynski · 2024-10-08T17:42:43Z

Thank you! Could you explain more on how this is better for performance?

Yes, of course:

These broadcast expressions are inside a loop over the the number of vertical levels, and each broadcast expression results in a CUDA kernel launch. Most of our kernels are memory bound (reading variables into registers, and storing them), not compute bound, so we can estimate their performance by counting reads and writes. Here is an example:

Code block A

@. x = y # 1 write, 1 read
@. x = x + 2*y # 1 write, 2 reads

Code block B

@. x = y + 2*y # 1 write, 1 read

Assuming reads and writes are equally expensive (they are, roughly), the total reads/writes in code block (A, B) are (5, 2). So, code block B should be ~2.5x faster than A.

To give a bit more balance to this explanation: one case where splitting kernels can be beneficial is when a kernel is very complex (for lack of a concrete example, I'll skip giving one), in such cases, it may be that many registers are needed for the kernel, resulting in register spilling, and this could lead to inefficient memory access--so inefficient that splitting the kernels into smaller and simpler kernels may be faster, despite having redundant loads/stores.

charleskawczynski · 2024-10-08T17:46:47Z

Actually, #2951 (comment) may be an example where breaking kernels up resulted in speedup, those were pretty complex functions, though.

szy21 · 2024-10-08T17:54:04Z

Thanks @charleskawczynski! This is helpful. Just to make sure I understand, by wrapping the computation into a function, the compiler internally fuses the broadcasted expressions?

Sbozzolo · 2024-10-08T17:56:41Z

Thanks @charleskawczynski! This is helpful. Just to make sure I understand, by wrapping the computation into a function, the compiler internally fuses the broadcasted expressions?

The way I think about it is that this change moves from having multiple broadcasted expressions (multiple dots), to only one (which is faster for the reasons well described above).

charleskawczynski · 2024-10-08T18:14:08Z

Thanks @charleskawczynski! This is helpful. Just to make sure I understand, by wrapping the computation into a function, the compiler internally fuses the broadcasted expressions?

We don't technically have to put it into a function. For example:

foo(y, z) = y + z + z*y - z^2*y
@. x = foo(y, z)

and

@. x = y + z + z*y - z^2*y

will result in the same number of reads/writes. The key ingredient is that there will only be a single read/write per variable in a single broadcast expression.

Peeking under the hood, the reason is because all variables, and all getindex calls for those variables, exist inside the same cuda kernel launch, and the compiler can then hoist these getindex calls. If we have two broadcast expressions:

@. x = y + z
@. x += z*y - z^2*y

then the compiler cannot hoist those reads/writes, and we pay for every required read/write per broadcast expression.

sriharshakandala · 2024-10-08T19:09:23Z

Thanks, @charleskawczynski. I started a build here to measure the overall impact!
https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/348#_

charleskawczynski · 2024-10-08T19:18:28Z

It's probably not huge, but piling on a handful of these may add up.

charleskawczynski requested review from trontrytel, Sbozzolo, szy21 and sriharshakandala October 8, 2024 14:59

charleskawczynski force-pushed the ck/manually_fuse branch from 6de0cbb to b32468f Compare October 8, 2024 15:13

Sbozzolo approved these changes Oct 8, 2024

View reviewed changes

Manually fuse some broadcast expressions in diagnostic edmf

4a528bc

charleskawczynski force-pushed the ck/manually_fuse branch from b32468f to 4a528bc Compare October 8, 2024 15:38

charleskawczynski enabled auto-merge October 8, 2024 16:37

charleskawczynski added this pull request to the merge queue Oct 8, 2024

szy21 approved these changes Oct 8, 2024

View reviewed changes

Merged via the queue into main with commit 59fd53d Oct 8, 2024
15 of 16 checks passed

charleskawczynski deleted the ck/manually_fuse branch October 8, 2024 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manually fuse some broadcast expressions in diagnostic edmf #3366

Manually fuse some broadcast expressions in diagnostic edmf #3366

charleskawczynski commented Oct 8, 2024

Sbozzolo left a comment •

edited

Loading

charleskawczynski commented Oct 8, 2024

charleskawczynski commented Oct 8, 2024

szy21 left a comment

charleskawczynski commented Oct 8, 2024

charleskawczynski commented Oct 8, 2024

szy21 commented Oct 8, 2024

Sbozzolo commented Oct 8, 2024 •

edited

Loading

charleskawczynski commented Oct 8, 2024 •

edited

Loading

sriharshakandala commented Oct 8, 2024

charleskawczynski commented Oct 8, 2024

Manually fuse some broadcast expressions in diagnostic edmf #3366

Manually fuse some broadcast expressions in diagnostic edmf #3366

Conversation

charleskawczynski commented Oct 8, 2024

Sbozzolo left a comment • edited Loading

Choose a reason for hiding this comment

charleskawczynski commented Oct 8, 2024

charleskawczynski commented Oct 8, 2024

szy21 left a comment

Choose a reason for hiding this comment

charleskawczynski commented Oct 8, 2024

Code block A

Code block B

charleskawczynski commented Oct 8, 2024

szy21 commented Oct 8, 2024

Sbozzolo commented Oct 8, 2024 • edited Loading

charleskawczynski commented Oct 8, 2024 • edited Loading

sriharshakandala commented Oct 8, 2024

charleskawczynski commented Oct 8, 2024

Sbozzolo left a comment •

edited

Loading

Sbozzolo commented Oct 8, 2024 •

edited

Loading

charleskawczynski commented Oct 8, 2024 •

edited

Loading