Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manually fuse some broadcast expressions in diagnostic edmf #3366

Merged
merged 1 commit into from
Oct 8, 2024

Conversation

charleskawczynski
Copy link
Member

This PR manually fuses some broadcast expressions, which will lead to Nv-fewer kernel launches, per fused expression. I think we that should probably strive to manually fuse more of these, where it's easy/possible. This is a good example in that, we're computing a variable and then simply updating it in subsequent broadcasts.

cc @sriharshakandala.

Copy link
Member

@Sbozzolo Sbozzolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still relevant:

           # Using constant exponents in broadcasts allocate, so we use
            # local_geometry_halflevel.J * local_geometry_halflevel.J instead.
            # See ClimaCore.jl issue #1126.

?

@charleskawczynski
Copy link
Member Author

Is this comment still relevant...

It's only relevant when ^ is directly in a broadcast expression, if ^ inside a function call in a broadcast expression, then there are no allocation issues.

@charleskawczynski
Copy link
Member Author

The invalidations failure is unrelated

@charleskawczynski charleskawczynski added this pull request to the merge queue Oct 8, 2024
Copy link
Member

@szy21 szy21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Could you explain more on how this is better for performance?

Merged via the queue into main with commit 59fd53d Oct 8, 2024
15 of 16 checks passed
@charleskawczynski charleskawczynski deleted the ck/manually_fuse branch October 8, 2024 17:41
@charleskawczynski
Copy link
Member Author

Thank you! Could you explain more on how this is better for performance?

Yes, of course:

These broadcast expressions are inside a loop over the the number of vertical levels, and each broadcast expression results in a CUDA kernel launch. Most of our kernels are memory bound (reading variables into registers, and storing them), not compute bound, so we can estimate their performance by counting reads and writes. Here is an example:

Code block A
@. x = y # 1 write, 1 read
@. x = x + 2*y # 1 write, 2 reads
Code block B
@. x = y + 2*y # 1 write, 1 read

Assuming reads and writes are equally expensive (they are, roughly), the total reads/writes in code block (A, B) are (5, 2). So, code block B should be ~2.5x faster than A.

To give a bit more balance to this explanation: one case where splitting kernels can be beneficial is when a kernel is very complex (for lack of a concrete example, I'll skip giving one), in such cases, it may be that many registers are needed for the kernel, resulting in register spilling, and this could lead to inefficient memory access--so inefficient that splitting the kernels into smaller and simpler kernels may be faster, despite having redundant loads/stores.

@charleskawczynski
Copy link
Member Author

Actually, #2951 (comment) may be an example where breaking kernels up resulted in speedup, those were pretty complex functions, though.

@szy21
Copy link
Member

szy21 commented Oct 8, 2024

Thanks @charleskawczynski! This is helpful. Just to make sure I understand, by wrapping the computation into a function, the compiler internally fuses the broadcasted expressions?

@Sbozzolo
Copy link
Member

Sbozzolo commented Oct 8, 2024

Thanks @charleskawczynski! This is helpful. Just to make sure I understand, by wrapping the computation into a function, the compiler internally fuses the broadcasted expressions?

The way I think about it is that this change moves from having multiple broadcasted expressions (multiple dots), to only one (which is faster for the reasons well described above).

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Oct 8, 2024

Thanks @charleskawczynski! This is helpful. Just to make sure I understand, by wrapping the computation into a function, the compiler internally fuses the broadcasted expressions?

We don't technically have to put it into a function. For example:

foo(y, z) = y + z + z*y - z^2*y
@. x = foo(y, z)

and

@. x = y + z + z*y - z^2*y

will result in the same number of reads/writes. The key ingredient is that there will only be a single read/write per variable in a single broadcast expression.

Peeking under the hood, the reason is because all variables, and all getindex calls for those variables, exist inside the same cuda kernel launch, and the compiler can then hoist these getindex calls. If we have two broadcast expressions:

@. x = y + z
@. x += z*y - z^2*y

then the compiler cannot hoist those reads/writes, and we pay for every required read/write per broadcast expression.

@sriharshakandala
Copy link
Member

Thanks, @charleskawczynski. I started a build here to measure the overall impact!
https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/348#_

@charleskawczynski
Copy link
Member Author

It's probably not huge, but piling on a handful of these may add up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants