Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve FieldVector broadcasting, and other potential speed-ups #275

Closed
dennisYatunin opened this issue Oct 29, 2021 · 5 comments
Closed
Labels
enhancement New feature or request performance

Comments

@dennisYatunin
Copy link
Member

I spent some time generating flame graphs for my tests, and I found a quick way to significantly speed up the ODE solve---forcing broadcasts over FieldVectors to fall back on broadcasts over the underlying parent arrays. Specifically, OrdinaryDiffEq.jl does element-wise operations over the solution vector and other similar vectors, which in our case are all FieldVectors. As the flame graphs below show, if the broadcasting happens according to the current code in ClimaCore.jl, these operations over FieldVectors are the slowest part of the computation.

To quickly fix this, I made the following patch:

import Base: copyto!
using Base.Broadcast: Broadcasted, BroadcastStyle
transform_broadcasted(bc::Broadcasted{Fields.FieldVectorStyle}, symb, axes) =
    Broadcasted(bc.f, map(arg -> transform_broadcasted(arg, symb, axes), bc.args), axes)
transform_broadcasted(fv::Fields.FieldVector, symb, axes) =
    parent(getproperty(fv, symb))
transform_broadcasted(x, symb, axes) = x
@inline function Base.copyto!(
    dest::Fields.FieldVector,
    bc::Broadcasted{Fields.FieldVectorStyle},
)
    for symb in propertynames(dest)
        p = parent(getproperty(dest, symb))
        copyto!(p, transform_broadcasted(bc, symb, axes(p)))
    end
    return dest
end

I initially tried to avoid using parent() and used Fields.field_values() instead. However, this resulted in errors being thrown; e.g., if w1 and w2 are both DataLayouts of Geometry.Cartesian3Vectors, then w1 ./ w2 throws an error, and this sort of operation is performed internally by OrdinaryDiffEq.jl.

In addition, I wanted to avoid assuming that every broadcasted object has the same axes as the destination parent arrays. I initially wrote the first method of transform_broadcasted() as

using Base.Broadcast: broadcasted
transform_broadcasted(bc::Broadcasted{Fields.FieldVectorStyle}, symb) =
    broadcasted(bc.f, map(arg -> transform_broadcasted(arg, symb, axes), bc.args))

However, this caused a lot of unnecessary memory allocations, and did not speed things up as much as the version shown above. Ideally, the solution to this issue would properly deal with nested broadcast axes without unnecessary allocations. One idea I had was to change the definition of axes(::FieldVector) to be a tuple of the axes of the underlying parent arrays, in which case that first method of transform_broadcasted() could be

transform_broadcasted(bc::Broadcasted{Fields.FieldVectorStyle}, symb, index) =
    Broadcasted(bc.f, map(arg -> transform_broadcasted(arg, symb, axes), bc.args), bc.axes[index])

Putting aside the problem that my temporary patch will break in more complex scenarios, I will now show my flame graph results. Here are the flame graphs for my IMEX solver, which I ran for 5000 seconds of simulation time:
full_flame_graphs_overlay
The graph on the left is without my patch, and the graph on the right is with my patch. I've added colored blocks to the graph that correspond to particular points in the computation:

  • blue---the C function madvise()
  • purple---the C function memmove()
  • orange---the Julia function Core.Compiler.typeinf()
  • yellow---inscrutable C functions that have "llvm" in their names
  • green---the actual call to OrdinaryDiffEq.solve!()

The orange and yellow blocks appear to be caused by type instability (they don't change due to the broadcasting patch, and they are the same size no matter how many times the code is run). I'm not sure if this is something we can fix by changing how FieldVectors are defined, or if this is just a problem with OrdinaryDiffEq.jl. They are very significant parts of the computation, though, so we should look into this.

The blue and purple blocks are probably related to memory allocations and garbage collection, since they shrink along with the green block because of the broadcasting patch.

Here are the same flame graphs, but zoomed in on the green blocks:
zoomed_flame_graphs_overlay
The new colored blocks are:

  • brown---broadcasts over FieldVectors
  • green---fill!() called on FieldVectors
  • red---recursivecopy!() called on FieldVectors
  • purple---calls to the implicit tendency function
  • pink---calls to the non-implicit tendency function
  • blue---calls to the full tendency function
  • gray---the Schur complement linear solve
  • white---the Jacobian (Wfact or Wfact_t) update

The broadcasting patch clearly transforms the brown blocks from the dominant part of the computation to an insignificant part of the computation. If we correctly generalize this patch, then we could similarly make the green and red blocks insignificant.

It is worth noting that the gray and white blocks are significantly smaller than the pink, purple, and blue blocks. This is because, in this particular example, the gray and white blocks only involve broadcasts over parent arrays, while the purple, pink, and blue blocks involve broadcasts over fields. I've since rewritten my implicit tendency function to only use parent arrays, which made the purple block two orders of magnitude smaller. So, there is still quite a bit of optimization that can be done for broadcasting over fields.

@simonbyrne @jakebolewski

@bischtob
Copy link
Contributor

bischtob commented Nov 3, 2021

@kpamnany

@bischtob
Copy link
Contributor

bischtob commented Nov 5, 2021

@dennisYatunin, what is the potential speedup here? It looks like a factor of 2, but you had some analysis that suggested a higher factor. In order to prioritize this, it would be good to know the max we can expect.

@kpamnany
Copy link
Contributor

kpamnany commented Nov 5, 2021

Right, not sure if I should prioritize this ahead of distributed ClimaCore. I should be able to take a look at this late next week or so.

@dennisYatunin
Copy link
Member Author

@bischtob @kpamnany Sorry, I should have been clearer about the potential benefits. The issue above illustrates three slowdowns---FieldVector broadcasts, Field broadcasts, and (possibly) type instability. The patch I give above speeds up the FieldVector broadcasts by ~2 orders of magnitude, though this only speeds up the overall computation by a factor of 2--3 because of the other two slowdowns. I also mentioned that I sped up the Field broadcasts by the same ~2 orders of magnitude by just rewriting them in terms of parent arrays, though I did not show the resulting flame graphs because that is not an acceptable long-term solution. My current computation is more than an order of magnitude faster than the original version, and one of the biggest things still slowing it down is the (possible) type instability.

bors bot added a commit to CliMA/TurbulenceConvection.jl that referenced this issue Nov 10, 2021
555: Performance patch r=charleskawczynski a=charleskawczynski

This is a peel off PR from #473. This PR adds a few ``@inbounds`` to some loops, and a performance patch for ClimaCore fields. See [ClimaCore.jl's 275](CliMA/ClimaCore.jl#275)

Co-authored-by: Charles Kawczynski <[email protected]>
@jakebolewski jakebolewski added the enhancement New feature or request label Dec 8, 2021
@charleskawczynski
Copy link
Member

I think this was closed by #985.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

5 participants