Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asymmetric speed of in-place sparse*dense matrix product #29956

Closed
carstenbauer opened this issue Nov 8, 2018 · 11 comments
Closed

Asymmetric speed of in-place sparse*dense matrix product #29956

carstenbauer opened this issue Nov 8, 2018 · 11 comments
Labels
performance Must go faster sparse Sparse arrays

Comments

@carstenbauer
Copy link
Member

carstenbauer commented Nov 8, 2018

First reported here.

julia> using BenchmarkTools, SparseArrays, LinearAlgebra

julia> A = sprand(100,100,0.01);

julia> B = rand(100,100);

julia> C = A*B;

julia> @btime $C = $A*$B;
  18.550 μs (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A; # other way around
  20.097 μs (2 allocations: 78.20 KiB)

julia> @btime mul!($C,$A,$B);
  14.531 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  825.506 μs (6 allocations: 336 bytes)

This asymmetry of performance, which already existed on 0.6, is not (just) due to CSC format of sparse matrices but because mul! falls back to the most generic method in LinearAlgebra.

julia> @which mul!(C, A, B)
[...] SparseArrays\src\linalg.jl:110

julia> @which mul!(C, B, A)
[...] LinearAlgebra\src\matmul.jl:172

It can be fixed (most simply) by copying the Base.:* method and adjusting it to be a mul! version:

import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
    end
    C
end

which gives

julia> @btime mul!($C,$A,$B);
  16.077 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  18.241 μs (0 allocations: 0 bytes)

Note that PR #24045 might fix this (haven't looked into it in detail). However, since this PR is sort of lying around for over a year now, maybe we should add an intermediate fix?

@JimMajor
Copy link

Actually even more speedup can be achieved for dense*sparse by proper arrangement of loops.

Using the same setup:

julia> using BenchmarkTools, SparseArrays, LinearAlgebra

julia> A = sprand(100,100,0.01);

julia> B = rand(100,100);

julia> C = A*B;

proposed function:

import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
    end
    C
end

gives:

julia> @btime mul!($C,$B,$A);
  21.778 μs (0 allocations: 0 bytes)

while:

function mul_alt!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for  col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        ki=rowval[k]
        kv=nzval[k]
        for multivec_row=1:mX
            C[multivec_row, col] += X[multivec_row, ki] * kv
        end
    end
    C
end

gives:

@btime mul_alt!($C,$B,$A);
  4.624 μs (0 allocations: 0 bytes)

@carstenbauer
Copy link
Member Author

@Sacha0 Do you think (in the light of your PR #24045) that it is worth creating a PR for this?

@Sacha0
Copy link
Member

Sacha0 commented Nov 22, 2018

@Sacha0

Hi Carsten! Great meeting you this past summer at JuliaCon! :)

Do you think (in the light of your PR #24045) that it is worth creating a PR for this?

Apologies, I'm not sure I understand the question --- are you implicitly asking about the timescale on which #24045 might rise from the grave? :)

@carstenbauer
Copy link
Member Author

Hi Carsten! Great meeting you this past summer at JuliaCon! :)

Ditto!

[...] are you implicitly asking about the timescale on which #24045 might rise from the grave? :)

Basically, yes :) As you can see in my first post the current performance is pretty bad and the suggested fix is simple and effective (~200x speedup). If your PR is overriding this any time soon though it might not be worth the effort, that's why I'm asking.

It'd be great to see a revival of #24045, of course :)

@tkf
Copy link
Member

tkf commented Nov 22, 2018

@crstnbr There is a discussion JuliaLang/LinearAlgebra.jl#473 for adding an API for "combined multiply-add" C = αAB + βC and I wrote a PR #29634 for this. If you are going to write a new method (which BTW sounds like a great addition!), it'd be nice if it uses this API. SparseArrays already has this API using 5-argument mul! so I think you don't need to wait for #29634 to implement it.

@Sacha0
Copy link
Member

Sacha0 commented Nov 24, 2018

Basically, yes :) [...] It'd be great to see a revival of #24045

#24045 has seen enough buzz recently that I might prioritize resurrecting it over other things in the next few weeks. That said, I can't make any promises what with other obligations :). Best!

@Sacha0
Copy link
Member

Sacha0 commented Dec 30, 2018

(A little update: I ended up working through the holiday, so likely it'll be a few more weeks. Best! S)

@mbauman mbauman added performance Must go faster sparse Sparse arrays labels Dec 31, 2018
@carstenbauer
Copy link
Member Author

Hey @Sacha0. Any progress on this? If #24045 takes much longer, maybe we should fix the issue here temporarily as suggested above? It's kind of sad that this performance asymmetry still exists in 1.2/1.3.

@Sacha0
Copy link
Member

Sacha0 commented Aug 20, 2019

Hey @crstnbr! :) Regrettably reality has sunk in: I will not have any bandwidth to work on things Julia for the foreseeable future. That being the case, it would be fantastic to see someone else push #24045 over the line. Formerly all that was necessary was transforming method signatures to use Adjoint/Transpose. Perhaps @KristofferC would still be up for the task? :) Best!

@goggle
Copy link
Contributor

goggle commented Sep 13, 2019

Has the original issue been solved?
Redoing the benchmarks (on the current master) gives me:

julia> @btime $C = $A*$B;
  22.556 μs (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A;
  29.979 μs (2 allocations: 78.20 KiB)

julia> @btime mul!($C,$A,$B);
  19.144 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  21.595 μs (0 allocations: 0 bytes)

@dkarrasch
Copy link
Member

Has the original issue been solved?

Yes, multiplication is not falling back to any generic matmul method, but to

function mul!(C::StridedVecOrMat, X::AdjOrTransStridedOrTriangularMatrix, adjA::Adjoint{<:Any,<:AbstractSparseMatrixCSC}, α::Number, β::Number)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster sparse Sparse arrays
Projects
None yet
Development

No branches or pull requests

7 participants