Asymmetric speed of in-place `sparse*dense` matrix product #29956

carstenbauer · 2018-11-08T08:33:01Z

First reported here.

julia> using BenchmarkTools, SparseArrays, LinearAlgebra

julia> A = sprand(100,100,0.01);

julia> B = rand(100,100);

julia> C = A*B;

julia> @btime $C = $A*$B;
  18.550 μs (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A; # other way around
  20.097 μs (2 allocations: 78.20 KiB)

julia> @btime mul!($C,$A,$B);
  14.531 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  825.506 μs (6 allocations: 336 bytes)

This asymmetry of performance, which already existed on 0.6, is not (just) due to CSC format of sparse matrices but because mul! falls back to the most generic method in LinearAlgebra.

julia> @which mul!(C, A, B)
[...] SparseArrays\src\linalg.jl:110

julia> @which mul!(C, B, A)
[...] LinearAlgebra\src\matmul.jl:172

It can be fixed (most simply) by copying the Base.:* method and adjusting it to be a mul! version:

import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
    end
    C
end

which gives

julia> @btime mul!($C,$A,$B);
  16.077 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  18.241 μs (0 allocations: 0 bytes)

Note that PR #24045 might fix this (haven't looked into it in detail). However, since this PR is sort of lying around for over a year now, maybe we should add an intermediate fix?

The text was updated successfully, but these errors were encountered:

JimMajor · 2018-11-22T00:40:06Z

Actually even more speedup can be achieved for dense*sparse by proper arrangement of loops.

Using the same setup:

julia> using BenchmarkTools, SparseArrays, LinearAlgebra

julia> A = sprand(100,100,0.01);

julia> B = rand(100,100);

julia> C = A*B;

proposed function:

import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
    end
    C
end

gives:

julia> @btime mul!($C,$B,$A);
  21.778 μs (0 allocations: 0 bytes)

while:

function mul_alt!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for  col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        ki=rowval[k]
        kv=nzval[k]
        for multivec_row=1:mX
            C[multivec_row, col] += X[multivec_row, ki] * kv
        end
    end
    C
end

gives:

@btime mul_alt!($C,$B,$A);
  4.624 μs (0 allocations: 0 bytes)

carstenbauer · 2018-11-22T07:50:21Z

@Sacha0 Do you think (in the light of your PR #24045) that it is worth creating a PR for this?

Sacha0 · 2018-11-22T20:43:58Z

@Sacha0

Hi Carsten! Great meeting you this past summer at JuliaCon! :)

Do you think (in the light of your PR #24045) that it is worth creating a PR for this?

Apologies, I'm not sure I understand the question --- are you implicitly asking about the timescale on which #24045 might rise from the grave? :)

carstenbauer · 2018-11-22T21:07:54Z

Hi Carsten! Great meeting you this past summer at JuliaCon! :)

Ditto!

[...] are you implicitly asking about the timescale on which #24045 might rise from the grave? :)

Basically, yes :) As you can see in my first post the current performance is pretty bad and the suggested fix is simple and effective (~200x speedup). If your PR is overriding this any time soon though it might not be worth the effort, that's why I'm asking.

It'd be great to see a revival of #24045, of course :)

tkf · 2018-11-22T22:54:08Z

@crstnbr There is a discussion JuliaLang/LinearAlgebra.jl#473 for adding an API for "combined multiply-add" C = αAB + βC and I wrote a PR #29634 for this. If you are going to write a new method (which BTW sounds like a great addition!), it'd be nice if it uses this API. SparseArrays already has this API using 5-argument mul! so I think you don't need to wait for #29634 to implement it.

Sacha0 · 2018-11-24T19:13:03Z

Basically, yes :) [...] It'd be great to see a revival of #24045

#24045 has seen enough buzz recently that I might prioritize resurrecting it over other things in the next few weeks. That said, I can't make any promises what with other obligations :). Best!

Sacha0 · 2018-12-30T20:26:56Z

(A little update: I ended up working through the holiday, so likely it'll be a few more weeks. Best! S)

carstenbauer · 2019-08-20T11:34:55Z

Hey @Sacha0. Any progress on this? If #24045 takes much longer, maybe we should fix the issue here temporarily as suggested above? It's kind of sad that this performance asymmetry still exists in 1.2/1.3.

Sacha0 · 2019-08-20T15:44:34Z

Hey @crstnbr! :) Regrettably reality has sunk in: I will not have any bandwidth to work on things Julia for the foreseeable future. That being the case, it would be fantastic to see someone else push #24045 over the line. Formerly all that was necessary was transforming method signatures to use Adjoint/Transpose. Perhaps @KristofferC would still be up for the task? :) Best!

goggle · 2019-09-13T13:49:09Z

Has the original issue been solved?
Redoing the benchmarks (on the current master) gives me:

julia> @btime $C = $A*$B;
  22.556 μs (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A;
  29.979 μs (2 allocations: 78.20 KiB)

julia> @btime mul!($C,$A,$B);
  19.144 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  21.595 μs (0 allocations: 0 bytes)

dkarrasch · 2020-09-04T16:25:14Z

Has the original issue been solved?

Yes, multiplication is not falling back to any generic matmul method, but to

julia/stdlib/SparseArrays/src/linalg.jl

Line 131 in d2daa81

    
           function mul!(C::StridedVecOrMat, X::AdjOrTransStridedOrTriangularMatrix, adjA::Adjoint{<:Any,<:AbstractSparseMatrixCSC}, α::Number, β::Number)

Sacha0 mentioned this issue Dec 17, 2018

[WIP] improved dense-sparse and sparse-dense matrix multiplication kernels #24045

Closed

mbauman added performance Must go faster sparse Sparse arrays labels Dec 31, 2018

dkarrasch closed this as completed Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asymmetric speed of in-place `sparse*dense` matrix product #29956

Asymmetric speed of in-place `sparse*dense` matrix product #29956

carstenbauer commented Nov 8, 2018 •

edited

Loading

JimMajor commented Nov 22, 2018

carstenbauer commented Nov 22, 2018

Sacha0 commented Nov 22, 2018

carstenbauer commented Nov 22, 2018

tkf commented Nov 22, 2018

Sacha0 commented Nov 24, 2018

Sacha0 commented Dec 30, 2018

carstenbauer commented Aug 20, 2019

Sacha0 commented Aug 20, 2019

goggle commented Sep 13, 2019 •

edited

Loading

dkarrasch commented Sep 4, 2020

Asymmetric speed of in-place sparse*dense matrix product #29956

Asymmetric speed of in-place sparse*dense matrix product #29956

Comments

carstenbauer commented Nov 8, 2018 • edited Loading

JimMajor commented Nov 22, 2018

carstenbauer commented Nov 22, 2018

Sacha0 commented Nov 22, 2018

carstenbauer commented Nov 22, 2018

tkf commented Nov 22, 2018

Sacha0 commented Nov 24, 2018

Sacha0 commented Dec 30, 2018

carstenbauer commented Aug 20, 2019

Sacha0 commented Aug 20, 2019

goggle commented Sep 13, 2019 • edited Loading

dkarrasch commented Sep 4, 2020

Asymmetric speed of in-place `sparse*dense` matrix product #29956

Asymmetric speed of in-place `sparse*dense` matrix product #29956

carstenbauer commented Nov 8, 2018 •

edited

Loading

goggle commented Sep 13, 2019 •

edited

Loading