Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor performance of the code using sub #5100

Closed
alsam opened this issue Dec 10, 2013 · 11 comments
Closed

poor performance of the code using sub #5100

alsam opened this issue Dec 10, 2013 · 11 comments

Comments

@alsam
Copy link

alsam commented Dec 10, 2013

This is a continuation of the #5084

The mail thread:

https://groups.google.com/forum/#!topic/julia-dev/NOF6MA6tb9Y

The code below exposes poor performance:

function main()
      const ncells = 10000

##      u    = Array(Float64, ncells+4)
##      x    = Array(Float64, ncells+1)
##      flux = Array(Float64, ncells+1)

      au    = Array(Float64, ncells+4)
      ax    = Array(Float64, ncells+1)
      aflux = Array(Float64, ncells+1)

      u    = sub(au,       4:ncells+1)
      x    = sub(ax,       2:ncells)
      flux = sub(aflux,    2:ncells)

      const nsteps   =  10000

      istep = 0
      while istep < nsteps
        for ic= 0 : (ncells-1)
            ##u[ic+3] -= (flux[ic+2]-flux[ic+1]) / (x[ic+2]-x[ic+1])
            u[ic] -= (flux[ic+1]-flux[ic]) / (x[ic+1]-x[ic])
        end

        istep = istep+1
      end

end # main

# jit compile main
main()

@time main()

elapsed time: 38.070276549 seconds (23509047972 bytes allocated)

Array-based version (commented above by ##):

elapsed time: 3.756609233 seconds (246868 bytes allocated)

I.e. sub version has 10x slowdown (the expected slowdown should be less than factor of 2) and enormous number of allocated bytes signals about type inference problems.

Thanks!

@simonster
Copy link
Member

At least part of the problem is that sub can't infer the dimensionality N of the SubArray. This is easy to fix for vectors, but might be harder to fix for the general n-dimensional case.

@timholy
Copy link
Member

timholy commented Dec 11, 2013

Thanks, Simon. @alsam, FYI this would not have been an issue if the SubArrays were passed as an argument to a function that was just the "algorithmic" part of what you were doing. Julia optimizes at the level of functions, so these type inference issues would not be such a problem. Separating the allocation from the algorithm would also facilitate your comparison of Arrays vs SubArrays.

@alsam
Copy link
Author

alsam commented Dec 11, 2013

Sure, thanks Simon for the fix.
Hi Tim, actually I met a serious performance issue with creating SubArrays from a separate function.
Looks like an inference problem again. Will work on producing a testcase.
And there is another perf issue of Array vs. SubArray - will elaborate after studying (it is elusive, anyway the perf impact is less than before :)

@ivarne
Copy link
Member

ivarne commented Dec 11, 2013

I think Tim meant that you could have "hidden" the type inference issues by dooing the calculations in a separate function.

a = sub(...)
do_calculation(a)

not

a = my_subarray_creation()
#do calculation

That way the do_calculation() function gets specialized, for the concrete type of a, and the costly runtime dispatch is done every time do_calculation is called, not on every step of the calculation.

@alsam
Copy link
Author

alsam commented Dec 11, 2013

Yeah, sure, got it and see the difference.
Actually I talked about another problem, the function below has serious perf problems, most likely with inference:

function farray(T, r::Range1{Int64}...)
    dims = map((x) -> length(x), r)
    array = Array(T, dims)
    sub_indices = map((x) -> -minimum(x) + 2 : maximum(x), r)
    sub(array, sub_indices)
end

Do you happen to know how to improve perf? I'm quite new to Julia.
Thanks!

@alsam
Copy link
Author

alsam commented Dec 11, 2013

Got it!
Left the above function farray intact but moved all computations with SubArrays into a separate function.
The perf is almost the same as with sub calls inlined into the code.
Now have only factor of 4 perf slowdown of SubArrays compared to Arrays instead of initial 100x!

@ivarne
Copy link
Member

ivarne commented Dec 11, 2013

PS. As I guessed this function was not the bottleneck in the computation, but here you have it anyway.

That function is very flexible, but unfortunately I can't guess the types of the variables, and apparently julia can't either. If this is a performance critical function in your program, I would consider specializing it.

  1. I would write more explicit code, using list comprehensions instead of map and lambda functions. Currently lambda functions does not get inlined, so that might be a performance bottleneck.
  2. I would ensure that the compiler can infer the dimensions of array from the type info. The length of r, is not part of the type info, so you could write separate functions for the different number of ranges that could be inputed.

Unfortunately I don't understand what your function tries to do, but here is an attempt at something better for the 2 argment version.

function farray(T, r1::Range1{Int64},r2::Range1{Int64})
    array = Array(T, length(r1),length(r2))
    sub_indices = [minimum(x)+2 : maximum(x) for x in (r1,r2)]
    sub(array, sub_indices[1], sub_indices[2])
end

@alsam
Copy link
Author

alsam commented Dec 11, 2013

Sure, the function farray is not a bottleneck at all as called only thrice for constructing 3 arrays, though when I used it instead of inlined sub calls a serious penalty was observed before I moved computations into a separate function seems due dynamic dispatch as you explained.

The function does construction of negative (and zero-based) fortran-like arrays:

      u    = farray(Float64, -2:ncells+1)
      x    = farray(Float64,  0:ncells)
      flux = farray(Float64,  0:ncells)

instead of

      u    = Array(Float64, ncells+4)
      x    = Array(Float64, ncells+1)
      flux = Array(Float64, ncells+1)

For the latter case index should be shifted in index expressions. I mean

         u[ic+3] -= (flux[ic+2]-flux[ic+1]) / (x[ic+2]-x[ic+1])

instead of more natural

         u[ic] -= (flux[ic+1]-flux[ic]) / (x[ic+1]-x[ic])

Arrays with negative/zero bases are used here for ghost cells to accommodate boundary conditions.

I'll try with specializations to see a difference.

Thanks!

@alsam
Copy link
Author

alsam commented Dec 11, 2013

I have the following synthetic testcase:

function mykern(r, s, a, x, y)
    const nsteps = 25000 
    istep = 0

    ifirst = minimum(r)
    ilast  = maximum(r)
    while istep < nsteps #&& t < tmax

        for i = ifirst : ilast-1
            s[i] = a * (x[i+1] - x[i]) / (y[i+1] - y[i])
        end

        for i = ifirst : ilast
            x[i] = a * s[i-1]
        end

        for i = ifirst : ilast
            y[i] = a * x[i-1]
        end

        for i = ifirst : ilast-1
            s[i] = s[i] - (x[i+1] - x[i]) / (y[i+1] - y[i])
        end
        istep=istep + 1
    end 
    s
end 

const siz = 10^4
const a = e^pi - pi^e

s = rand(siz)
x = rand(siz)
y = rand(siz)

s1 = sub(s, 1:(siz - 1))
x1 = sub(x, 1:(siz - 1))
y1 = sub(y, 1:(siz - 1))

s2 = sub(s, 2:siz)
x2 = sub(x, 2:siz)
y2 = sub(y, 2:siz)

s3 = sub(s, 4:siz)

# warming up
# mykern( 2:(siz - 3), s, a, x,  y  )
# mykern( 2:(siz - 3), s, a, x1, y1 )
# mykern( 3:(siz - 2), s, a, x2, y2 )
# @time mykern( 2:(siz - 3), s, a, x,  y )

# the real test

println("follow up: mykern( 2:(siz - 3), s, a, x, y )")
@time mykern( 2:(siz - 3), s, a, x,  y  )
println("follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )")
@time mykern( 2:(siz - 3), s1, a, x1, y1 )
println("follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )")
@time mykern( 3:(siz - 2), s2, a, x2, y2 )
@time mykern( 3:(siz - 2), s1, a, x2, y2 )
@time mykern( 5:(siz - 5), s3, a, x2, y2 )

and timings for the testcase:

julia shifted2.jl 
follow up: mykern( 2:(siz - 3), s, a, x, y )
elapsed time: 2.847626459 seconds (611712 bytes allocated)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
elapsed time: 8.850332117 seconds (704324 bytes allocated)
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
elapsed time: 8.779391172 seconds (88 bytes allocated)
elapsed time: 8.804737883 seconds (88 bytes allocated)
elapsed time: 8.807666941 seconds (88 bytes allocated)

i.e. slowdown of using subs is about 3-4x. On a bigger testcase a slowdown up to 5x was observed.

Is it ok? Doesn't it make sense to create a new issue for performance improvement?
Thanks!

@timholy
Copy link
Member

timholy commented Dec 11, 2013

Yes, we'd still like to erase the remaining hit, so please do file a new issue. But now you're getting into territory where the fixes may not come as quickly. If you want to help address it, profiling is the next step.

@alsam
Copy link
Author

alsam commented Dec 12, 2013

Sure, thanks.

Submitted a new issue:
#5117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants