-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
laplace equation benchmark performance #1168
Comments
Benchmark has been added to the perf2 suite. https://github.com/JuliaLang/julia/tree/master/test/perf2 |
I don't think the vectorized version will ever be as fast as the other one. |
Yes, I was just thinking the same. The vectorized version does far too much work, but there is certainly room for improvement. -viral On 17-Aug-2012, at 3:57 PM, Jeff Bezanson wrote:
|
Original multi-language performance comparison:
|
Devectorized julia for the same number of iterations (2^15) now runs in 3.8 seconds for me. Does numpy implement views and ends up evaluating lazily in this case, and hence end up being much faster? Since octave ends up taking the same amount of time, I presume that we just have a lot of room for improvement in julia. |
Yes, numpy indexing with scalars and slices (ranges) always produces views. |
Does octave do the same? I didn't think so - since it is matching numpy's performance. I'll write the benchmark in a few other languages for comparison. |
I would imagine that it does copy on write, since it strives to emulate matlab. But since I'm not an octave user, I really don't know. |
I really think it's arguable that our slices should be views. |
+1 |
I could get on board with that. We either have to make all ops on |
In my currently-stalled rewrite of the Image library, I have a substitute brewing: type Counter{N}
max::NTuple{N, Int}
end
Counter(sz) = Counter{length(sz)}(to_tuple(sz))
function start{N}(c::Counter{N})
state = ones(Int,N)
state[1] = 0 # because of start/done/next sequence, start out one behind
return state
end
function done(c::Counter, state)
# we do the increment as part of "done" to make exit-testing more efficient
state[1] += 1
i = 1
max = c.max
while state[i] > max[i] && i < length(state)
state[i] = 1
i += 1
state[i] += 1
end
state[end] > c.max[end]
end
function next(c::Counter, state)
return state, state
end This can be pretty fast, particularly if you handle the inner loop like this: szv = [sz...] # vector from tuple
szv[1] = 1
for c = Counter(szv)
cprod = prod(c)
for i = 1:sz[1]
p[index] = i*cprod
index += 1
end
end Then for most applications the performance is essentially indistinguishable from linear indexing. (However, there are traps: this is the usage that prompted me to report performance issue 73d8ca4). Assuming it could be generalized to work on general ranges, and to return a linear index, would that help? I think in my initial version of this I tried returning both the linear index and the coordinate representation, and also ran into performance problems due to tuple creation, but memory of exactly what was going on is now hazy. |
Should have specified: on the inner loop I was demoing how you'd set each element to the product of coordinates, i.e., |
This seems to rely on making the first dimension a special case for performance, or at least assuming it is large enough for the inner loop to dominate. I think we need something more drastic. We may also need to depend on better compiler optimizations like allocating tuples on the stack or in registers. |
Yes, that's basically true. OK, might one be able to generate a user-friendly version of |
This is turning into a full-blown discussion. Perhaps best to move it over to julia-dev to invite some more ideas? |
I would rather invite anyone to have a discussion here, so that all the perf issues can be tracked here and are addressed. |
On the pure scalar indexing bit, we are still a factor of 1.5x-2x slower than C. So long as we do not close that gap either through better code generation and other optimizations, vectorized julia codes will continue to remain that much slower compared to other languages where the vectorized operations are implemented in C. |
There was a mailing list discussion a while back that found that bounds-checking accounts for some of this. Code with bounds-checks disabled (by commenting out two lines in codegen.cpp) ran in about 75% of the time of the code with bounds-checks, pretty constantly on three different machines (mine, Krys', and Bill's, if memory serves). Not a huge performance difference, but if you're thinking about factors of 2 that will be some chunk of it. |
I think with bounds-checking removed the code generated in loops was about the same as what clang generates for C/C++. You just have to feed the JIT engine enough code to be able to optimize it. |
Using views as the default behavior seems appealing for a few reasons:
Now that we have many more examples of types of arrays (sparse, distributed, etc.), we can maybe come up with something that unifies them better and has good performance. |
I ran the Laplace benchmark on master:
It would be nice to figure out if this is a problem of too many temporaries, and to what extent we need to rethink our indexing by using views and such. |
I am marking this as 0.2 just to figure out if there is something we can do to improve our performance for vectorized codes by the next release. |
It's definitely a problem of too many temporaries, as a simple sprofile run will show: out of 1104 captures on perf2/laplace.jl; laplace_iter_vec; line: 25, there were 970 at array.jl; +; line: 875. That line allocates the output array:
(Out of those 970, 963 were in However, I'd guess that this is not the kind of thing we're likely to solve in 2-3 weeks. I'd also tentatively suggest that Julia's inlining behavior may be a higher priority. The reason is simple: often a careful programmer can "write around" the library's current tendency to allocate temporaries. It can be inconvenient, but it's usually possible. Case in point: But coming from me this is all hot air, because ATM I don't feel up to the challenge of tackling this myself. |
I think that's a good point, @timholy. cc: @JeffBezanson, who is, of course, already following this thread. |
I concur that it will not be solved in 2 weeks just by setting a milestone. But I have plans. I have been ruminating a lot about julia's theory of "what is a function", and I may be starting to get somewhere. |
Sounds exciting! Here, have a nice cup of tea to help your ruminations. |
Right, I didn't expect anything to be solved in 2 weeks, but even being able to give a rough plan on tackling this would be an important signal for everyone. Also, as @JeffBezanson has often said, the best way to avoid GC is to not allocate memory. Perhaps we can revisit our indexing along these lines, as is discussed in this thread. |
57e010e helps both laplace_vec and laplace_devec significantly (but does not fix the gap between them). |
I just reran this benchmark on julia.mit.edu, and find that octave takes 0.8 seconds, whereas julia takes 2.2 seconds. |
I added a subarray version here, and it is significantly slower. I guess we have to wait for tuple overhaul. In any case, I was wondering if we can characterize why the vectorized version is so much slower (subarrays or otherwise), compared to the devectorized version. I was kind of hoping that @carnaval 's GC work would help here, and I wonder if a better GC will help (perhaps tune some knobs in the new GC), or if the gains need to come from elsewhere.
|
I thought the new gc already helped here significantly. Though of course it does not close the whole gap. We should work on subarrays. |
If you see my older comment and the new results, the difference went from 13x to 10x. The new GC helped significantly on |
I tried a subarray version a while ago, and it was a little faster. Can you link to your code? |
I have checked in the subarray version into laplace.jl. With tuple overhaul now, there are significant gains, but still too much memory allocation.
|
I should add: technically, with the new GC the performance problems are presumably more an issue of cache efficiency than memory allocation per se. When you allocate a temporary |
Cache behavior is especially bad if you are allocating temps in a loop because you will get fresh (read: not cached) memory at each iteration up until collection, and start all over again. |
Sounds like something to look forward to! |
I am really looking forward to it too! |
Hopefully the performance of the vectorized version will be fixed by #17623, assuming we change the code to use |
Syntactic fusion seems like a completely fair solution to this performance issue. |
julia> function laplace!(u, Niter, dx2, dy2)
Ny, Nx = size(u)
c = 1 / (2*(dx2+dy2))
u0 = @view u[2:Ny-1, 2:Nx-1]
u1 = @view u[1:Ny-2, 2:Nx-1]
u2 = @view u[3:Ny, 2:Nx-1]
u3 = @view u[2:Ny-1,1:Nx-2]
u4 = @view u[2:Ny-1, 3:Nx]
for i = 1:Niter
u0 .= ((u1 .+ u2) .* dy2 .+ (u3 .+ u4) .* dx2) .* c
end
return u0
end
laplace! (generic function with 1 method)
julia> u = zeros(150,150); u[1,:] = 1
1
julia> @time laplace!(u, 2^15, 0.01, 0.01);
7.999779 seconds (301.50 k allocations: 10.248 MB, 0.58% gc time)
julia> @time laplace!(u, 2^15, 0.01, 0.01);
7.747673 seconds (10 allocations: 496 bytes)
|
In comparison, an unrolled version does slightly better, which is not too unexpected — there is some overhead (less than a factor of 2) to using all of those views and generic broadcasting iteration: julia> function laplace_iter!(u, Niter, dx2, dy2)
Ny, Nx = size(u)
c = 1 / (2*(dx2+dy2))
for iter = 1:Niter
for i = 2:size(u,1)-1, j = 2:size(u,2)-1
u[i,j] = ((u[i-1,j]+u[i+1,j])*dx2 + (u[i,j-1]+u[i,j+1])*dy2)*c
end
end
return u
end
laplace_iter! (generic function with 1 method)
julia> @time laplace_iter!(u, 2^15, 0.01, 0.01);
4.281638 seconds (7.69 k allocations: 327.032 KB)
julia> @time laplace_iter!(u, 2^15, 0.01, 0.01);
4.221090 seconds (5 allocations: 176 bytes) But I think the vectorized version is fast enough that we can now close this? |
Track performance of the laplace equation benchmark. Discussion here:
https://groups.google.com/d/topic/julia-dev/VDlr3_6Szos/discussion
The text was updated successfully, but these errors were encountered: