-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve FieldVector broadcasting, and other potential speed-ups #275
Comments
@dennisYatunin, what is the potential speedup here? It looks like a factor of 2, but you had some analysis that suggested a higher factor. In order to prioritize this, it would be good to know the max we can expect. |
Right, not sure if I should prioritize this ahead of distributed ClimaCore. I should be able to take a look at this late next week or so. |
@bischtob @kpamnany Sorry, I should have been clearer about the potential benefits. The issue above illustrates three slowdowns---FieldVector broadcasts, Field broadcasts, and (possibly) type instability. The patch I give above speeds up the FieldVector broadcasts by ~2 orders of magnitude, though this only speeds up the overall computation by a factor of 2--3 because of the other two slowdowns. I also mentioned that I sped up the Field broadcasts by the same ~2 orders of magnitude by just rewriting them in terms of parent arrays, though I did not show the resulting flame graphs because that is not an acceptable long-term solution. My current computation is more than an order of magnitude faster than the original version, and one of the biggest things still slowing it down is the (possible) type instability. |
555: Performance patch r=charleskawczynski a=charleskawczynski This is a peel off PR from #473. This PR adds a few ``@inbounds`` to some loops, and a performance patch for ClimaCore fields. See [ClimaCore.jl's 275](CliMA/ClimaCore.jl#275) Co-authored-by: Charles Kawczynski <[email protected]>
I think this was closed by #985. |
I spent some time generating flame graphs for my tests, and I found a quick way to significantly speed up the ODE solve---forcing broadcasts over
FieldVector
s to fall back on broadcasts over the underlying parent arrays. Specifically, OrdinaryDiffEq.jl does element-wise operations over the solution vector and other similar vectors, which in our case are allFieldVector
s. As the flame graphs below show, if the broadcasting happens according to the current code in ClimaCore.jl, these operations overFieldVector
s are the slowest part of the computation.To quickly fix this, I made the following patch:
I initially tried to avoid using
parent()
and usedFields.field_values()
instead. However, this resulted in errors being thrown; e.g., ifw1
andw2
are bothDataLayout
s ofGeometry.Cartesian3Vector
s, thenw1 ./ w2
throws an error, and this sort of operation is performed internally by OrdinaryDiffEq.jl.In addition, I wanted to avoid assuming that every broadcasted object has the same axes as the destination parent arrays. I initially wrote the first method of
transform_broadcasted()
asHowever, this caused a lot of unnecessary memory allocations, and did not speed things up as much as the version shown above. Ideally, the solution to this issue would properly deal with nested broadcast axes without unnecessary allocations. One idea I had was to change the definition of
axes(::FieldVector)
to be a tuple of the axes of the underlying parent arrays, in which case that first method oftransform_broadcasted()
could bePutting aside the problem that my temporary patch will break in more complex scenarios, I will now show my flame graph results. Here are the flame graphs for my IMEX solver, which I ran for 5000 seconds of simulation time:
The graph on the left is without my patch, and the graph on the right is with my patch. I've added colored blocks to the graph that correspond to particular points in the computation:
madvise()
memmove()
Core.Compiler.typeinf()
OrdinaryDiffEq.solve!()
The orange and yellow blocks appear to be caused by type instability (they don't change due to the broadcasting patch, and they are the same size no matter how many times the code is run). I'm not sure if this is something we can fix by changing how
FieldVector
s are defined, or if this is just a problem with OrdinaryDiffEq.jl. They are very significant parts of the computation, though, so we should look into this.The blue and purple blocks are probably related to memory allocations and garbage collection, since they shrink along with the green block because of the broadcasting patch.
Here are the same flame graphs, but zoomed in on the green blocks:
The new colored blocks are:
FieldVector
sfill!()
called onFieldVector
srecursivecopy!()
called onFieldVector
sWfact
orWfact_t
) updateThe broadcasting patch clearly transforms the brown blocks from the dominant part of the computation to an insignificant part of the computation. If we correctly generalize this patch, then we could similarly make the green and red blocks insignificant.
It is worth noting that the gray and white blocks are significantly smaller than the pink, purple, and blue blocks. This is because, in this particular example, the gray and white blocks only involve broadcasts over parent arrays, while the purple, pink, and blue blocks involve broadcasts over fields. I've since rewritten my implicit tendency function to only use parent arrays, which made the purple block two orders of magnitude smaller. So, there is still quite a bit of optimization that can be done for broadcasting over fields.
@simonbyrne @jakebolewski
The text was updated successfully, but these errors were encountered: