Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dot product with sub() slow for small ranges #15961

Closed
xlambein opened this issue Apr 20, 2016 · 3 comments
Closed

Dot product with sub() slow for small ranges #15961

xlambein opened this issue Apr 20, 2016 · 3 comments

Comments

@xlambein
Copy link

Hello,

I'm a new user of Julia, I've been using it for the past 3 months and I recently started using it for university.

Right now I'm doing a project in which I have matrices of size p×n, with p small and n large, and I need to perform hundreds of thousands of dot products between random columns of these matrices. My code being too slow, I was looking for a bottleneck and found that it was because U[:,i] actually performed a copy. I tried using sub but the results weren't satisfying either. I decided to test different ways of computing my dot product, and found that the fastest way to do it was to do a simple loop with an accumulator.

Here is my benchmark code:

dot1(U, V, i, j) = dot(U[:,i], V[:,j])
dot2(U, V, i, j) = sum([U[k,i]*V[k,j] for k=1:size(U,1)])
dot3(U, V, i, j) = dot(sub(U, :, i), sub(V, :, j))

function dot4(U, V, i, j)
    r = 0.0
    for k=1:size(U,1)
        r += U[k,i]*V[k,j]
    end
    r
end

for (p, n) in [(25, 1000000), (250, 100000), (2500, 10000)]
    U, V = randn(p, 10000), randn(p, 10000)

    # Compile functions and check that they work
    @assert dot1(U, V, 1, 1)  dot2(U, V, 1, 1)  dot3(U, V, 1, 1)  dot4(U, V, 1, 1)

    println("(p, n) = ($p, $n)")
    @time for k=1:n dot1(U, V, rand(1:10000), rand(1:10000)) end
    @time for k=1:n dot2(U, V, rand(1:10000), rand(1:10000)) end
    @time for k=1:n dot3(U, V, rand(1:10000), rand(1:10000)) end
    @time for k=1:n dot4(U, V, rand(1:10000), rand(1:10000)) end
end

And here is the result, on my computer:

(p, n) = (25, 1000000)
  0.664159 seconds (7.80 M allocations: 729.302 MB, 8.76% gc time)
  0.369080 seconds (1000.00 k allocations: 320.435 MB, 1.80% gc time)
  0.350421 seconds (8.00 M allocations: 274.658 MB, 3.06% gc time)
  0.227205 seconds
(p, n) = (250, 100000)
  0.200553 seconds (779.60 k allocations: 405.572 MB, 15.58% gc time)
  0.183650 seconds (100.00 k allocations: 198.364 MB, 10.59% gc time)
  0.124294 seconds (800.00 k allocations: 27.466 MB, 2.45% gc time)
  0.092410 seconds
(p, n) = (2500, 10000)
  0.154928 seconds (117.95 k allocations: 383.880 MB, 30.14% gc time)
  0.147903 seconds (20.00 k allocations: 191.345 MB, 7.82% gc time)
  0.039056 seconds (80.00 k allocations: 2.747 MB)
  0.065859 seconds

For large values of p, the observed behaviour is the one I expected: dot with sub is fastest. But for smaller values, it's barely faster than a list comprehension, and the loop with an accumulator is fastest. I don't know if this is the expected behaviour for sub, but it seems like a performance bug. I suppose there must be an overhead associated with sub. Is there a way to index an array in a "C-pointer-like" way, without overhead? That would most certainly fix the issue.

@KristofferC
Copy link
Member

KristofferC commented Apr 20, 2016

What you are looking for is #14955, see also https://groups.google.com/forum/#!msg/julia-users/6Z1kJ5LrSMU/_HcWLsubBAAJ

Also, fyi, this will make your loop version a bit faster (so it is not slower than sub for large p):

function dot5{T}(U::Matrix{T}, V::Matrix{T}, i, j)
    s = zero(T)
    @inbounds @simd for k in 1:size(U,1)
        s += U[k, i] * V[k, j]
    end
    return s
end

@timholy
Copy link
Member

timholy commented Apr 20, 2016

It's considerably better if you time a function rather than at global scope, see below (requires julia-0.5 to be efficient). But still far from ideal.

This can't be fixed without #12205; basically the issue is that the SubArray type needs to hold a reference to the parent array, and since that's not a bitstype each object gets allocated on the heap rather than the stack.

Modified test code:

dot1(U, V, i, j) = dot(U[:,i], V[:,j])
dot2(U, V, i, j) = sum([U[k,i]*V[k,j] for k=1:size(U,1)])
dot3(U, V, i, j) = dot(sub(U, :, i), sub(V, :, j))

function dot4(U, V, i, j)
    r = 0.0
    for k=1:size(U,1)
        r += U[k,i]*V[k,j]
    end
    r
end

function rundot(f, U, V, n)
    size(U) == size(V) || throw(DimensionMismatch("Matrices have different sizes"))
    rng = 1:size(U, 2)
    s = 0.0
    for k = 1:n
        s += f(U, V, rand(rng), rand(rng))
    end
    s
end

for (p, n) in [(25, 1000000), (250, 100000), (2500, 10000)]
    U, V = randn(p, 10000), randn(p, 10000)

    # Compile functions and check that they work
    @assert dot1(U, V, 1, 1)   dot2(U, V, 1, 1)   dot3(U, V, 1, 1)   dot4(U, V, 1, 1)
    for f in (dot1, dot2, dot3, dot4)
        rundot(f, U, V, 1)
    end

    println("(p, n) = ($p, $n)")
    @time rundot(dot1, U, V, n)
    @time rundot(dot2, U, V, n)
    @time rundot(dot3, U, V, n)
    @time rundot(dot4, U, V, n)
end

Results:

(p, n) = (25, 1000000)
  0.606342 seconds (5.90 M allocations: 700.342 MB, 14.30% gc time)
  0.315316 seconds (1.00 M allocations: 320.435 MB, 6.49% gc time)
  0.379915 seconds (4.00 M allocations: 152.588 MB, 4.69% gc time)
  0.182975 seconds (1 allocation: 16 bytes)
(p, n) = (250, 100000)
  0.543606 seconds (589.69 k allocations: 420.985 MB, 34.70% gc time)
  0.269937 seconds (100.00 k allocations: 207.520 MB, 12.57% gc time)
  0.078260 seconds (400.00 k allocations: 15.259 MB, 3.41% gc time)
  0.083623 seconds (1 allocation: 16 bytes)
(p, n) = (2500, 10000)
  0.247148 seconds (78.97 k allocations: 384.201 MB, 53.79% gc time)
  0.076361 seconds (20.00 k allocations: 191.803 MB, 8.75% gc time)
  0.032036 seconds (40.00 k allocations: 1.526 MB)
  0.055625 seconds (1 allocation: 16 bytes)

@timholy
Copy link
Member

timholy commented Apr 20, 2016

Since there's nothing to do and this will be "automatically" fixed by #12205, I'm closing this.

@timholy timholy closed this as completed Apr 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants