Improve FieldVector broadcasting, and other potential speed-ups #275

dennisYatunin · 2021-10-29T00:55:05Z

I spent some time generating flame graphs for my tests, and I found a quick way to significantly speed up the ODE solve---forcing broadcasts over FieldVectors to fall back on broadcasts over the underlying parent arrays. Specifically, OrdinaryDiffEq.jl does element-wise operations over the solution vector and other similar vectors, which in our case are all FieldVectors. As the flame graphs below show, if the broadcasting happens according to the current code in ClimaCore.jl, these operations over FieldVectors are the slowest part of the computation.

To quickly fix this, I made the following patch:

import Base: copyto!
using Base.Broadcast: Broadcasted, BroadcastStyle
transform_broadcasted(bc::Broadcasted{Fields.FieldVectorStyle}, symb, axes) =
    Broadcasted(bc.f, map(arg -> transform_broadcasted(arg, symb, axes), bc.args), axes)
transform_broadcasted(fv::Fields.FieldVector, symb, axes) =
    parent(getproperty(fv, symb))
transform_broadcasted(x, symb, axes) = x
@inline function Base.copyto!(
    dest::Fields.FieldVector,
    bc::Broadcasted{Fields.FieldVectorStyle},
)
    for symb in propertynames(dest)
        p = parent(getproperty(dest, symb))
        copyto!(p, transform_broadcasted(bc, symb, axes(p)))
    end
    return dest
end

I initially tried to avoid using parent() and used Fields.field_values() instead. However, this resulted in errors being thrown; e.g., if w1 and w2 are both DataLayouts of Geometry.Cartesian3Vectors, then w1 ./ w2 throws an error, and this sort of operation is performed internally by OrdinaryDiffEq.jl.

In addition, I wanted to avoid assuming that every broadcasted object has the same axes as the destination parent arrays. I initially wrote the first method of transform_broadcasted() as

using Base.Broadcast: broadcasted
transform_broadcasted(bc::Broadcasted{Fields.FieldVectorStyle}, symb) =
    broadcasted(bc.f, map(arg -> transform_broadcasted(arg, symb, axes), bc.args))

However, this caused a lot of unnecessary memory allocations, and did not speed things up as much as the version shown above. Ideally, the solution to this issue would properly deal with nested broadcast axes without unnecessary allocations. One idea I had was to change the definition of axes(::FieldVector) to be a tuple of the axes of the underlying parent arrays, in which case that first method of transform_broadcasted() could be

transform_broadcasted(bc::Broadcasted{Fields.FieldVectorStyle}, symb, index) =
    Broadcasted(bc.f, map(arg -> transform_broadcasted(arg, symb, axes), bc.args), bc.axes[index])

Putting aside the problem that my temporary patch will break in more complex scenarios, I will now show my flame graph results. Here are the flame graphs for my IMEX solver, which I ran for 5000 seconds of simulation time:

The graph on the left is without my patch, and the graph on the right is with my patch. I've added colored blocks to the graph that correspond to particular points in the computation:

blue---the C function madvise()
purple---the C function memmove()
orange---the Julia function Core.Compiler.typeinf()
yellow---inscrutable C functions that have "llvm" in their names
green---the actual call to OrdinaryDiffEq.solve!()

The orange and yellow blocks appear to be caused by type instability (they don't change due to the broadcasting patch, and they are the same size no matter how many times the code is run). I'm not sure if this is something we can fix by changing how FieldVectors are defined, or if this is just a problem with OrdinaryDiffEq.jl. They are very significant parts of the computation, though, so we should look into this.

The blue and purple blocks are probably related to memory allocations and garbage collection, since they shrink along with the green block because of the broadcasting patch.

Here are the same flame graphs, but zoomed in on the green blocks:

The new colored blocks are:

brown---broadcasts over FieldVectors
green---fill!() called on FieldVectors
red---recursivecopy!() called on FieldVectors
purple---calls to the implicit tendency function
pink---calls to the non-implicit tendency function
blue---calls to the full tendency function
gray---the Schur complement linear solve
white---the Jacobian (Wfact or Wfact_t) update

The broadcasting patch clearly transforms the brown blocks from the dominant part of the computation to an insignificant part of the computation. If we correctly generalize this patch, then we could similarly make the green and red blocks insignificant.

It is worth noting that the gray and white blocks are significantly smaller than the pink, purple, and blue blocks. This is because, in this particular example, the gray and white blocks only involve broadcasts over parent arrays, while the purple, pink, and blue blocks involve broadcasts over fields. I've since rewritten my implicit tendency function to only use parent arrays, which made the purple block two orders of magnitude smaller. So, there is still quite a bit of optimization that can be done for broadcasting over fields.

@simonbyrne @jakebolewski

The text was updated successfully, but these errors were encountered:

bischtob · 2021-11-03T22:00:23Z

@kpamnany

bischtob · 2021-11-05T20:51:58Z

@dennisYatunin, what is the potential speedup here? It looks like a factor of 2, but you had some analysis that suggested a higher factor. In order to prioritize this, it would be good to know the max we can expect.

kpamnany · 2021-11-05T21:05:37Z

Right, not sure if I should prioritize this ahead of distributed ClimaCore. I should be able to take a look at this late next week or so.

dennisYatunin · 2021-11-05T21:24:57Z

@bischtob @kpamnany Sorry, I should have been clearer about the potential benefits. The issue above illustrates three slowdowns---FieldVector broadcasts, Field broadcasts, and (possibly) type instability. The patch I give above speeds up the FieldVector broadcasts by ~2 orders of magnitude, though this only speeds up the overall computation by a factor of 2--3 because of the other two slowdowns. I also mentioned that I sped up the Field broadcasts by the same ~2 orders of magnitude by just rewriting them in terms of parent arrays, though I did not show the resulting flame graphs because that is not an acceptable long-term solution. My current computation is more than an order of magnitude faster than the original version, and one of the biggest things still slowing it down is the (possible) type instability.

555: Performance patch r=charleskawczynski a=charleskawczynski This is a peel off PR from #473. This PR adds a few ``@inbounds`` to some loops, and a performance patch for ClimaCore fields. See [ClimaCore.jl's 275](CliMA/ClimaCore.jl#275) Co-authored-by: Charles Kawczynski <[email protected]>

charleskawczynski · 2023-09-05T17:48:56Z

I think this was closed by #985.

dennisYatunin added the performance label Oct 29, 2021

charleskawczynski mentioned this issue Nov 10, 2021

Performance patch CliMA/TurbulenceConvection.jl#555

Merged

jakebolewski added the enhancement New feature or request label Dec 8, 2021

This was referenced Jan 27, 2022

Define copyto! for diagonalized field vectors #478

Merged

Use better recursion pattern for instantiate, avoid getindex #480

Merged

charleskawczynski mentioned this issue Oct 7, 2022

Fix FieldVector broadcast allocations #985

Merged

charleskawczynski closed this as completed Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve FieldVector broadcasting, and other potential speed-ups #275

Improve FieldVector broadcasting, and other potential speed-ups #275

dennisYatunin commented Oct 29, 2021

bischtob commented Nov 3, 2021

bischtob commented Nov 5, 2021 •

edited

Loading

kpamnany commented Nov 5, 2021

dennisYatunin commented Nov 5, 2021

charleskawczynski commented Sep 5, 2023

Improve FieldVector broadcasting, and other potential speed-ups #275

Improve FieldVector broadcasting, and other potential speed-ups #275

Comments

dennisYatunin commented Oct 29, 2021

bischtob commented Nov 3, 2021

bischtob commented Nov 5, 2021 • edited Loading

kpamnany commented Nov 5, 2021

dennisYatunin commented Nov 5, 2021

charleskawczynski commented Sep 5, 2023

bischtob commented Nov 5, 2021 •

edited

Loading