poor performance of the code using sub #5100

alsam · 2013-12-10T22:48:56Z

This is a continuation of the #5084

The mail thread:

https://groups.google.com/forum/#!topic/julia-dev/NOF6MA6tb9Y

The code below exposes poor performance:

function main()
      const ncells = 10000

##      u    = Array(Float64, ncells+4)
##      x    = Array(Float64, ncells+1)
##      flux = Array(Float64, ncells+1)

      au    = Array(Float64, ncells+4)
      ax    = Array(Float64, ncells+1)
      aflux = Array(Float64, ncells+1)

      u    = sub(au,       4:ncells+1)
      x    = sub(ax,       2:ncells)
      flux = sub(aflux,    2:ncells)

      const nsteps   =  10000

      istep = 0
      while istep < nsteps
        for ic= 0 : (ncells-1)
            ##u[ic+3] -= (flux[ic+2]-flux[ic+1]) / (x[ic+2]-x[ic+1])
            u[ic] -= (flux[ic+1]-flux[ic]) / (x[ic+1]-x[ic])
        end

        istep = istep+1
      end

end # main

# jit compile main
main()

@time main()

elapsed time: 38.070276549 seconds (23509047972 bytes allocated)

Array-based version (commented above by ##):

elapsed time: 3.756609233 seconds (246868 bytes allocated)

I.e. sub version has 10x slowdown (the expected slowdown should be less than factor of 2) and enormous number of allocated bytes signals about type inference problems.

Thanks!

simonster · 2013-12-10T23:50:24Z

At least part of the problem is that sub can't infer the dimensionality N of the SubArray. This is easy to fix for vectors, but might be harder to fix for the general n-dimensional case.

timholy · 2013-12-11T11:03:01Z

Thanks, Simon. @alsam, FYI this would not have been an issue if the SubArrays were passed as an argument to a function that was just the "algorithmic" part of what you were doing. Julia optimizes at the level of functions, so these type inference issues would not be such a problem. Separating the allocation from the algorithm would also facilitate your comparison of Arrays vs SubArrays.

alsam · 2013-12-11T11:34:43Z

Sure, thanks Simon for the fix.
Hi Tim, actually I met a serious performance issue with creating SubArrays from a separate function.
Looks like an inference problem again. Will work on producing a testcase.
And there is another perf issue of Array vs. SubArray - will elaborate after studying (it is elusive, anyway the perf impact is less than before :)

ivarne · 2013-12-11T12:18:34Z

I think Tim meant that you could have "hidden" the type inference issues by dooing the calculations in a separate function.

a = sub(...)
do_calculation(a)

not

a = my_subarray_creation()
#do calculation

That way the do_calculation() function gets specialized, for the concrete type of a, and the costly runtime dispatch is done every time do_calculation is called, not on every step of the calculation.

alsam · 2013-12-11T12:33:36Z

Yeah, sure, got it and see the difference.
Actually I talked about another problem, the function below has serious perf problems, most likely with inference:

function farray(T, r::Range1{Int64}...)
    dims = map((x) -> length(x), r)
    array = Array(T, dims)
    sub_indices = map((x) -> -minimum(x) + 2 : maximum(x), r)
    sub(array, sub_indices)
end

Do you happen to know how to improve perf? I'm quite new to Julia.
Thanks!

alsam · 2013-12-11T13:13:18Z

Got it!
Left the above function farray intact but moved all computations with SubArrays into a separate function.
The perf is almost the same as with sub calls inlined into the code.
Now have only factor of 4 perf slowdown of SubArrays compared to Arrays instead of initial 100x!

ivarne · 2013-12-11T13:21:42Z

PS. As I guessed this function was not the bottleneck in the computation, but here you have it anyway.

That function is very flexible, but unfortunately I can't guess the types of the variables, and apparently julia can't either. If this is a performance critical function in your program, I would consider specializing it.

I would write more explicit code, using list comprehensions instead of map and lambda functions. Currently lambda functions does not get inlined, so that might be a performance bottleneck.
I would ensure that the compiler can infer the dimensions of array from the type info. The length of r, is not part of the type info, so you could write separate functions for the different number of ranges that could be inputed.

Unfortunately I don't understand what your function tries to do, but here is an attempt at something better for the 2 argment version.

function farray(T, r1::Range1{Int64},r2::Range1{Int64})
    array = Array(T, length(r1),length(r2))
    sub_indices = [minimum(x)+2 : maximum(x) for x in (r1,r2)]
    sub(array, sub_indices[1], sub_indices[2])
end

alsam · 2013-12-11T13:57:12Z

Sure, the function farray is not a bottleneck at all as called only thrice for constructing 3 arrays, though when I used it instead of inlined sub calls a serious penalty was observed before I moved computations into a separate function seems due dynamic dispatch as you explained.

The function does construction of negative (and zero-based) fortran-like arrays:

      u    = farray(Float64, -2:ncells+1)
      x    = farray(Float64,  0:ncells)
      flux = farray(Float64,  0:ncells)

instead of

      u    = Array(Float64, ncells+4)
      x    = Array(Float64, ncells+1)
      flux = Array(Float64, ncells+1)

For the latter case index should be shifted in index expressions. I mean

         u[ic+3] -= (flux[ic+2]-flux[ic+1]) / (x[ic+2]-x[ic+1])

instead of more natural

         u[ic] -= (flux[ic+1]-flux[ic]) / (x[ic+1]-x[ic])

Arrays with negative/zero bases are used here for ghost cells to accommodate boundary conditions.

I'll try with specializations to see a difference.

Thanks!

alsam · 2013-12-11T16:02:23Z

I have the following synthetic testcase:

function mykern(r, s, a, x, y)
    const nsteps = 25000 
    istep = 0

    ifirst = minimum(r)
    ilast  = maximum(r)
    while istep < nsteps #&& t < tmax

        for i = ifirst : ilast-1
            s[i] = a * (x[i+1] - x[i]) / (y[i+1] - y[i])
        end

        for i = ifirst : ilast
            x[i] = a * s[i-1]
        end

        for i = ifirst : ilast
            y[i] = a * x[i-1]
        end

        for i = ifirst : ilast-1
            s[i] = s[i] - (x[i+1] - x[i]) / (y[i+1] - y[i])
        end
        istep=istep + 1
    end 
    s
end 

const siz = 10^4
const a = e^pi - pi^e

s = rand(siz)
x = rand(siz)
y = rand(siz)

s1 = sub(s, 1:(siz - 1))
x1 = sub(x, 1:(siz - 1))
y1 = sub(y, 1:(siz - 1))

s2 = sub(s, 2:siz)
x2 = sub(x, 2:siz)
y2 = sub(y, 2:siz)

s3 = sub(s, 4:siz)

# warming up
# mykern( 2:(siz - 3), s, a, x,  y  )
# mykern( 2:(siz - 3), s, a, x1, y1 )
# mykern( 3:(siz - 2), s, a, x2, y2 )
# @time mykern( 2:(siz - 3), s, a, x,  y )

# the real test

println("follow up: mykern( 2:(siz - 3), s, a, x, y )")
@time mykern( 2:(siz - 3), s, a, x,  y  )
println("follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )")
@time mykern( 2:(siz - 3), s1, a, x1, y1 )
println("follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )")
@time mykern( 3:(siz - 2), s2, a, x2, y2 )
@time mykern( 3:(siz - 2), s1, a, x2, y2 )
@time mykern( 5:(siz - 5), s3, a, x2, y2 )

and timings for the testcase:

julia shifted2.jl 
follow up: mykern( 2:(siz - 3), s, a, x, y )
elapsed time: 2.847626459 seconds (611712 bytes allocated)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
elapsed time: 8.850332117 seconds (704324 bytes allocated)
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
elapsed time: 8.779391172 seconds (88 bytes allocated)
elapsed time: 8.804737883 seconds (88 bytes allocated)
elapsed time: 8.807666941 seconds (88 bytes allocated)

i.e. slowdown of using subs is about 3-4x. On a bigger testcase a slowdown up to 5x was observed.

Is it ok? Doesn't it make sense to create a new issue for performance improvement?
Thanks!

timholy · 2013-12-11T17:11:50Z

Yes, we'd still like to erase the remaining hit, so please do file a new issue. But now you're getting into territory where the fixes may not come as quickly. If you want to help address it, profiling is the next step.

alsam · 2013-12-12T15:13:27Z

Sure, thanks.

Submitted a new issue:
#5117

alsam mentioned this issue Dec 10, 2013

extremely slow code using combination of sub and minimum/maximum #5084

Closed

simonster mentioned this issue Dec 11, 2013

Better type inference in SubArray construction #5104

Merged

simonster closed this as completed in 7601f2c Dec 11, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

poor performance of the code using sub #5100

poor performance of the code using sub #5100

alsam commented Dec 10, 2013

simonster commented Dec 10, 2013

timholy commented Dec 11, 2013

alsam commented Dec 11, 2013

ivarne commented Dec 11, 2013

alsam commented Dec 11, 2013

alsam commented Dec 11, 2013

ivarne commented Dec 11, 2013

alsam commented Dec 11, 2013

alsam commented Dec 11, 2013

timholy commented Dec 11, 2013

alsam commented Dec 12, 2013

poor performance of the code using sub #5100

poor performance of the code using sub #5100

Comments

alsam commented Dec 10, 2013

simonster commented Dec 10, 2013

timholy commented Dec 11, 2013

alsam commented Dec 11, 2013

ivarne commented Dec 11, 2013

alsam commented Dec 11, 2013

alsam commented Dec 11, 2013

ivarne commented Dec 11, 2013

alsam commented Dec 11, 2013

alsam commented Dec 11, 2013

timholy commented Dec 11, 2013

alsam commented Dec 12, 2013