Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiplication of sparse matrices with diagonal matrices is slow (fix suggested) #28934

Closed
Pbellive opened this issue Aug 28, 2018 · 2 comments · Fixed by #29045
Closed

Multiplication of sparse matrices with diagonal matrices is slow (fix suggested) #28934

Pbellive opened this issue Aug 28, 2018 · 2 comments · Fixed by #29045

Comments

@Pbellive
Copy link
Contributor

Hi all,
It's great to see all the work that's gone into the linear algebra support for 1.0! I've recently noticed one small area where I think significant performance is being left on the table. Namely, multiplication of sparse matrices with diagonal matrices. This was also mentioned last month on discourse. Multiplication between diagonal matrices and sparse matrices is currently defined at:

function (*)(D::Diagonal, A::SparseMatrixCSC)
T = Base.promote_op(*, eltype(D), eltype(A))
mul!(LinearAlgebra.copy_oftype(A, T), D, A)
end
function (*)(A::SparseMatrixCSC, D::Diagonal)
T = Base.promote_op(*, eltype(D), eltype(A))
mul!(LinearAlgebra.copy_oftype(A, T), A, D)
end

these methods seem to fall back to pretty generic matrix multiplication methods and not to any of the specialized methods defined in stdlib/LinearAlgebra/src/diagonal.jl. I've written a couple of potential replacements. The versions for right multiplying a sparse matrix by a diagonal matrix are benchmarked below. Please let me know if there is a reason why something similar hasn't been implemented already. If it's just an oversight then I can prepare a PR to add in the optimized methods. The benchmarking was run on a fresh clone of master from this morning

julia> versioninfo()
Julia Version 1.1.0-DEV.129
Commit 59c7cedfc4 (2018-08-28 15:47 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)

To benchmark I did the following. I wrote the following script and saved it to a file bnchmrkDiag.jl

using LinearAlgebra,SparseArrays,BenchmarkTools

# Optimized matrix matrix product. Underlying in-place multiplication is non-allocating
function MTimesDiag( A::SparseMatrixCSC, D::Diagonal )
    T = Base.promote_op(*, eltype(D), eltype(A))
    C = LinearAlgebra.copy_oftype(A, T)
    MTimesDiag!(C,D)
end

function MTimesDiag!( A::SparseMatrixCSC, D::Diagonal )

    n = size(A,2)
    if (length(D.diag) != n)
        error("Incompatible sizes")
    end

    for ir = 1:n
        j1 = A.colptr[ir]
        j2 = A.colptr[ir+1] - 1
        dgir = D.diag[ir]
        for ic = j1:j2
            A.nzval[ic] *= dgir
        end
    end
    return A
end

# Alternative implementation using rmul!. It's much better than what you currently get
# with * but doesn't compete with the above method.
function MTimesDiagRmul(A::SparseMatrixCSC, D::Diagonal)
    T = Base.promote_op(*, eltype(D), eltype(A))
    C = LinearAlgebra.copy_oftype(A, T)
    rmul!(C, D)
end

n = 1000;
B = sprandn(n,n,1e-3);
#n = 100000;
#B = sprandn(n,n,5e-4);
D = Diagonal(randn(n));

# Allocating versions
b1 = @benchmarkable $B * $D
b2 = @benchmarkable MTimesDiagRmul($B,$D)
b3 = @benchmarkable MTimesDiag($B,$D)

#in place versions
b4 = @benchmarkable rmul!(C,$D) setup=(C = copy($B))
b5 = @benchmarkable MTimesDiag!(C,$D) setup=(C = copy($B))

Then from the REPL I did

julia> include("bnchmrkDiag.jl")
Benchmark(evals=1, seconds=5.0, samples=10000)

julia> run(b1)
BenchmarkTools.Trial: 
  memory estimate:  24.13 KiB
  allocs estimate:  10
  --------------
  minimum time:     750.135 ms (0.00% GC)
  median time:      755.622 ms (0.00% GC)
  mean time:        756.598 ms (0.00% GC)
  maximum time:     763.333 ms (0.00% GC)
  --------------
  samples:          7
  evals/sample:     1

julia> run(b2)
BenchmarkTools.Trial: 
  memory estimate:  120.09 KiB
  allocs estimate:  30
  --------------
  minimum time:     36.090 μs (0.00% GC)
  median time:      38.475 μs (0.00% GC)
  mean time:        50.252 μs (17.26% GC)
  maximum time:     41.074 ms (99.82% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> run(b3)
BenchmarkTools.Trial: 
  memory estimate:  23.80 KiB
  allocs estimate:  4
  --------------
  minimum time:     3.568 μs (0.00% GC)
  median time:      4.060 μs (0.00% GC)
  mean time:        10.303 μs (55.05% GC)
  maximum time:     41.750 ms (99.95% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> run(b4)
BenchmarkTools.Trial: 
  memory estimate:  96.30 KiB
  allocs estimate:  26
  --------------
  minimum time:     34.150 μs (0.00% GC)
  median time:      36.624 μs (0.00% GC)
  mean time:        45.483 μs (16.58% GC)
  maximum time:     41.385 ms (99.83% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> run(b5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.841 μs (0.00% GC)
  median time:      2.016 μs (0.00% GC)
  mean time:        2.137 μs (0.00% GC)
  maximum time:     21.772 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

As you can see the simple B * D is incredibly slow compared to the others. As you would expect, the differences become more significant for larger matrices. I re ran the benchmarks with

n = 100000;
B = sprandn(n,n,5e-4);

For reference, that gave nnz(B)/n = 50.00124. I got impatient and killed the B * D benchmark after about a minute. The results for the other methods were

julia> run(b2)
BenchmarkTools.Trial: 
  memory estimate:  160.93 MiB
  allocs estimate:  48
  --------------
  minimum time:     90.440 ms (15.96% GC)
  median time:      123.624 ms (38.25% GC)
  mean time:        110.487 ms (29.57% GC)
  maximum time:     139.199 ms (41.40% GC)
  --------------
  samples:          46
  evals/sample:     1

julia> run(b3)
BenchmarkTools.Trial: 
  memory estimate:  77.06 MiB
  allocs estimate:  7
  --------------
  minimum time:     36.896 ms (10.34% GC)
  median time:      37.501 ms (10.50% GC)
  mean time:        38.607 ms (12.12% GC)
  maximum time:     89.270 ms (62.05% GC)
  --------------
  samples:          130
  evals/sample:     1

julia> run(b4)
BenchmarkTools.Trial: 
  memory estimate:  83.88 MiB
  allocs estimate:  41
  --------------
  minimum time:     47.182 ms (1.35% GC)
  median time:      52.857 ms (9.70% GC)
  mean time:        53.374 ms (11.09% GC)
  maximum time:     88.956 ms (46.19% GC)
  --------------
  samples:          46
  evals/sample:     1

julia> run(b5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.496 ms (0.00% GC)
  median time:      4.910 ms (0.00% GC)
  mean time:        4.945 ms (0.00% GC)
  maximum time:     6.768 ms (0.00% GC)
  --------------
  samples:          128
  evals/sample:     1

You see about the same thing for left multiplication by a diagonal matrix.

Thanks, Patrick

@Pbellive Pbellive changed the title Multiplication of sparse matrices with diagonal matrices Multiplication of sparse matrices with diagonal matrices is slow (fix suggested) Aug 28, 2018
@GiggleLiu
Copy link
Contributor

This bug is related to incorrect function dispatch

function mul!(C::SparseMatrixCSC, A::SparseMatrixCSC, D::Diagonal{<:Vector})

There is no Diagonal{<:Vector} type! it should be Diagonal{T, <:Vector} where T.

Let's see first several lines of profile

1061 ./task.jl:259; (::getfield(REPL, Symbol("##28#29")){REP...
 1061 ...lib/v1.0/REPL/src/REPL.jl:117; macro expansion
  1061 ...lib/v1.0/REPL/src/REPL.jl:85; eval_user_input(::Any, ::REPL.REPLBackend)
   1061 ./boot.jl:319; eval(::Module, ::Any)
    1061 ....0/Profile/src/Profile.jl:25; top-level scope
     1061 ...arseArrays/src/linalg.jl:135; *(::SparseMatrixCSC{Float64,Int64}, :...
      1061 ...arAlgebra/src/matmul.jl:172; mul!
       1061 ...arAlgebra/src/matmul.jl:578; generic_matmatmul!(::SparseMatrixCSC...
        2   ...arAlgebra/src/matmul.jl:0; _generic_matmatmul!(::SparseMatrixCS...
        119 ...arAlgebra/src/matmul.jl:638; _generic_matmatmul!(::SparseMatrixCS...
         119 ...arAlgebra/src/matmul.jl:480; copy_transpose!
          1  ./sort.jl:0; copy_transpose!(::Array{Float64,2},...
          9  ./sort.jl:177; copy_transpose!(::Array{Float64,2},...
          1  ...gebra/src/transpose.jl:183; copy_transpose!(::Array{Float64,2},...
          4  ...gebra/src/transpose.jl:196; copy_transpose!(::Array{Float64,2},...
          84 ...gebra/src/transpose.jl:197; copy_transpose!(::Array{Float64,2},...

See, it is not diapatched correctly as @Pbellive mensioned.

@Pbellive
Copy link
Contributor Author

Pbellive commented Sep 4, 2018

Thanks @GiggleLiu. I missed that the correct implementation existed in linalg.jl. Just need to change the type signatures to fix the bug. Nice and easy. Preparing a PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants