Dot product with sub() slow for small ranges #15961

xlambein · 2016-04-20T10:30:55Z

Hello,

I'm a new user of Julia, I've been using it for the past 3 months and I recently started using it for university.

Right now I'm doing a project in which I have matrices of size p×n, with p small and n large, and I need to perform hundreds of thousands of dot products between random columns of these matrices. My code being too slow, I was looking for a bottleneck and found that it was because U[:,i] actually performed a copy. I tried using sub but the results weren't satisfying either. I decided to test different ways of computing my dot product, and found that the fastest way to do it was to do a simple loop with an accumulator.

Here is my benchmark code:

dot1(U, V, i, j) = dot(U[:,i], V[:,j])
dot2(U, V, i, j) = sum([U[k,i]*V[k,j] for k=1:size(U,1)])
dot3(U, V, i, j) = dot(sub(U, :, i), sub(V, :, j))

function dot4(U, V, i, j)
    r = 0.0
    for k=1:size(U,1)
        r += U[k,i]*V[k,j]
    end
    r
end

for (p, n) in [(25, 1000000), (250, 100000), (2500, 10000)]
    U, V = randn(p, 10000), randn(p, 10000)

    # Compile functions and check that they work
    @assert dot1(U, V, 1, 1) ≈ dot2(U, V, 1, 1) ≈ dot3(U, V, 1, 1) ≈ dot4(U, V, 1, 1)

    println("(p, n) = ($p, $n)")
    @time for k=1:n dot1(U, V, rand(1:10000), rand(1:10000)) end
    @time for k=1:n dot2(U, V, rand(1:10000), rand(1:10000)) end
    @time for k=1:n dot3(U, V, rand(1:10000), rand(1:10000)) end
    @time for k=1:n dot4(U, V, rand(1:10000), rand(1:10000)) end
end

And here is the result, on my computer:

(p, n) = (25, 1000000)
  0.664159 seconds (7.80 M allocations: 729.302 MB, 8.76% gc time)
  0.369080 seconds (1000.00 k allocations: 320.435 MB, 1.80% gc time)
  0.350421 seconds (8.00 M allocations: 274.658 MB, 3.06% gc time)
  0.227205 seconds
(p, n) = (250, 100000)
  0.200553 seconds (779.60 k allocations: 405.572 MB, 15.58% gc time)
  0.183650 seconds (100.00 k allocations: 198.364 MB, 10.59% gc time)
  0.124294 seconds (800.00 k allocations: 27.466 MB, 2.45% gc time)
  0.092410 seconds
(p, n) = (2500, 10000)
  0.154928 seconds (117.95 k allocations: 383.880 MB, 30.14% gc time)
  0.147903 seconds (20.00 k allocations: 191.345 MB, 7.82% gc time)
  0.039056 seconds (80.00 k allocations: 2.747 MB)
  0.065859 seconds

For large values of p, the observed behaviour is the one I expected: dot with sub is fastest. But for smaller values, it's barely faster than a list comprehension, and the loop with an accumulator is fastest. I don't know if this is the expected behaviour for sub, but it seems like a performance bug. I suppose there must be an overhead associated with sub. Is there a way to index an array in a "C-pointer-like" way, without overhead? That would most certainly fix the issue.

The text was updated successfully, but these errors were encountered:

KristofferC · 2016-04-20T10:56:30Z

What you are looking for is #14955, see also https://groups.google.com/forum/#!msg/julia-users/6Z1kJ5LrSMU/_HcWLsubBAAJ

Also, fyi, this will make your loop version a bit faster (so it is not slower than sub for large p):

function dot5{T}(U::Matrix{T}, V::Matrix{T}, i, j)
    s = zero(T)
    @inbounds @simd for k in 1:size(U,1)
        s += U[k, i] * V[k, j]
    end
    return s
end

timholy · 2016-04-20T11:32:12Z

It's considerably better if you time a function rather than at global scope, see below (requires julia-0.5 to be efficient). But still far from ideal.

This can't be fixed without #12205; basically the issue is that the SubArray type needs to hold a reference to the parent array, and since that's not a bitstype each object gets allocated on the heap rather than the stack.

Modified test code:

dot1(U, V, i, j) = dot(U[:,i], V[:,j])
dot2(U, V, i, j) = sum([U[k,i]*V[k,j] for k=1:size(U,1)])
dot3(U, V, i, j) = dot(sub(U, :, i), sub(V, :, j))

function dot4(U, V, i, j)
    r = 0.0
    for k=1:size(U,1)
        r += U[k,i]*V[k,j]
    end
    r
end

function rundot(f, U, V, n)
    size(U) == size(V) || throw(DimensionMismatch("Matrices have different sizes"))
    rng = 1:size(U, 2)
    s = 0.0
    for k = 1:n
        s += f(U, V, rand(rng), rand(rng))
    end
    s
end

for (p, n) in [(25, 1000000), (250, 100000), (2500, 10000)]
    U, V = randn(p, 10000), randn(p, 10000)

    # Compile functions and check that they work
    @assert dot1(U, V, 1, 1)   dot2(U, V, 1, 1)   dot3(U, V, 1, 1)   dot4(U, V, 1, 1)
    for f in (dot1, dot2, dot3, dot4)
        rundot(f, U, V, 1)
    end

    println("(p, n) = ($p, $n)")
    @time rundot(dot1, U, V, n)
    @time rundot(dot2, U, V, n)
    @time rundot(dot3, U, V, n)
    @time rundot(dot4, U, V, n)
end

Results:

(p, n) = (25, 1000000)
  0.606342 seconds (5.90 M allocations: 700.342 MB, 14.30% gc time)
  0.315316 seconds (1.00 M allocations: 320.435 MB, 6.49% gc time)
  0.379915 seconds (4.00 M allocations: 152.588 MB, 4.69% gc time)
  0.182975 seconds (1 allocation: 16 bytes)
(p, n) = (250, 100000)
  0.543606 seconds (589.69 k allocations: 420.985 MB, 34.70% gc time)
  0.269937 seconds (100.00 k allocations: 207.520 MB, 12.57% gc time)
  0.078260 seconds (400.00 k allocations: 15.259 MB, 3.41% gc time)
  0.083623 seconds (1 allocation: 16 bytes)
(p, n) = (2500, 10000)
  0.247148 seconds (78.97 k allocations: 384.201 MB, 53.79% gc time)
  0.076361 seconds (20.00 k allocations: 191.803 MB, 8.75% gc time)
  0.032036 seconds (40.00 k allocations: 1.526 MB)
  0.055625 seconds (1 allocation: 16 bytes)

timholy · 2016-04-20T11:32:48Z

Since there's nothing to do and this will be "automatically" fixed by #12205, I'm closing this.

timholy closed this as completed Apr 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dot product with sub() slow for small ranges #15961

Dot product with sub() slow for small ranges #15961

xlambein commented Apr 20, 2016

KristofferC commented Apr 20, 2016 •

edited

Loading

timholy commented Apr 20, 2016

timholy commented Apr 20, 2016

Dot product with sub() slow for small ranges #15961

Dot product with sub() slow for small ranges #15961

Comments

xlambein commented Apr 20, 2016

KristofferC commented Apr 20, 2016 • edited Loading

timholy commented Apr 20, 2016

timholy commented Apr 20, 2016

KristofferC commented Apr 20, 2016 •

edited

Loading