[WIP] improved dense-sparse and sparse-dense matrix multiplication kernels #24045

Sacha0 · 2017-10-07T23:28:29Z

This pull request provides improved dense-sparse matrix multiplication kernels. ~~Equivalent kernels for sparse-dense matrix multiplication to follow.~~ (Edit: Equivalent kernels for sparse-dense matrix multiplication have landed.) Further optimization of the trickier cases in this pull request (see comments) also to follow.

Performance in a machine learning application from @pevnak motivated this pull request (see slack/#linalg). Specifically, IIRC TensorFlow outperforms Flux on master on his problem, whereas with these kernels (and the ~~forthcoming~~ sparse-dense equivalents) the tables turn.

Comprehensive benchmarks ~~forthcoming~~ (via BaseBenchmarks PR JuliaCI/BaseBenchmarks.jl#128).

Best!

Sacha0 · 2017-10-08T20:52:14Z

Equivalent kernels for sparse-dense matrix multiplication are now live in the second commit. I have a third commit that consolidates the dense-sparse and sparse-dense methods, but would like to run nanosoldier before and after pushing that commit to check that the consolidation does not impact performance.

ararslan · 2017-10-10T03:45:28Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2017-10-10T06:00:32Z

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo cset shield -e su nanosoldier -- -c ./benchscript.sh`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @ararslan

Sacha0 · 2017-10-10T15:46:44Z

From nanosoldier's logs,

ERROR: LoadError: MethodError: no method matching h5d_write(::Int32, ::Int32, ::Base.ReinterpretArray{UInt8,1,HDF5.HDF5ReferenceObj,Array{HDF5.HDF5ReferenceObj,1}})
Closest candidates are:
  h5d_write(::Any, ::Any, ::Any, !Matched::Any, !Matched::Any, !Matched::Any) at /home/nanosoldier/.julia/v0.7/HDF5/src/HDF5.jl:2053
  h5d_write(::Int32, ::Int32, !Matched::String) at /home/nanosoldier/.julia/v0.7/HDF5/src/HDF5.jl:1922
  h5d_write(::Int32, ::Int32, !Matched::Array{S<:String,N} where N) where S<:String at /home/nanosoldier/.julia/v0.7/HDF5/src/HDF5.jl:1929
  ...
Stacktrace:
 [1] writearray(::HDF5.HDF5Dataset, ::Int32, ::Base.ReinterpretArray{UInt8,1,HDF5.HDF5ReferenceObj,Array{HDF5.HDF5ReferenceObj,1}}) at /home/nanosoldier/.julia/v0.7/HDF5/src/HDF5.jl:1807
 [2] #_write#17(::Array{Any,1}, ::Function, ::JLD.JldFile, ::String, ::Array{Any,1}, ::JLD.JldWriteSession) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:574
 [3] _write(::JLD.JldFile, ::String, ::Array{Any,1}, ::JLD.JldWriteSession) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:559
 [4] #write#14(::Array{Any,1}, ::Function, ::JLD.JldFile, ::String, ::Any, ::JLD.JldWriteSession) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:512
 [5] #jldopen#9(::Bool, ::Bool, ::Bool, ::Function, ::String, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:178
 [6] (::getfield(JLD, Symbol("#kw##jldopen")))(::Array{Any,1}, ::typeof(JLD.jldopen), ::String, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at ./<missing>:0
 [7] #jldopen#10(::Bool, ::Bool, ::Bool, ::Function, ::String, ::String) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:231
 [8] (::getfield(JLD, Symbol("#kw##jldopen")))(::Array{Any,1}, ::typeof(JLD.jldopen), ::String, ::String) at ./<missing>:0
 [9] #jldopen#11(::Array{Any,1}, ::Function, ::getfield(JLD, Symbol("##34#35")){String,Dict{String,String},Tuple{String,BenchmarkTools.BenchmarkGroup}}, ::String, ::Vararg{Any,N} where N) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:241
 [10] (::getfield(JLD, Symbol("#kw##jldopen")))(::Array{Any,1}, ::typeof(JLD.jldopen), ::Function, ::String, ::String) at ./<missing>:0
 [11] #save#33(::Bool, ::Bool, ::Function, ::FileIO.File{FileIO.DataFormat{:JLD}}, ::String, ::Any, ::Any, ::Vararg{Any,N} where N) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:1220
 [12] save(::FileIO.File{FileIO.DataFormat{:JLD}}, ::String, ::Dict{String,String}, ::String, ::Vararg{Any,N} where N) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:1217
 [13] #save#14(::Array{Any,1}, ::Function, ::String, ::String, ::Vararg{Any,N} where N) at /home/nanosoldier/.julia/v0.7/FileIO/src/loadsave.jl:61
 [14] save(::String, ::String, ::Dict{String,String}, ::String, ::Vararg{Any,N} where N) at /home/nanosoldier/.julia/v0.7/FileIO/src/loadsave.jl:61
 [15] save(::String, ::String, ::Vararg{Any,N} where N) at /home/nanosoldier/.julia/v0.7/BenchmarkTools/src/serialization.jl:36
 [16] include_relative(::Module, ::String) at ./loading.jl:533
 [17] include(::Module, ::String) at ./sysimg.jl:14
 [18] process_options(::Base.JLOptions) at ./client.jl:325
 [19] _start() at ./client.jl:391
in expression starting at /home/nanosoldier/workdir/tmpra50k1/benchscript.jl:28

which looks like #23750 (comment) ? (Update: The JLD issue appears resolved by JuliaIO/JLD.jl#192.)

ararslan · 2017-10-10T23:23:10Z

Yep, that's a JLD issue that just got fixed. Let's try this again.

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2017-10-11T04:30:47Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

Sacha0 · 2017-10-13T18:23:20Z

@nanosoldier runbenchmarks("sparse" && "matmul", vs=":master")

Sacha0 · 2017-10-13T18:36:34Z

Nanosoldier reported meaningful regressions only in the kernel for A(t|c)_mul_B[!]([dense,] sparse, dense). Local testing indicated that an added @simd annotation, replacement of a z += x*y with a z = muladd(x, y, z), and the type assertion in CjAjB::eltype(C) = zero(eltype(C)) each negatively impacted performance, so I've removed each. If these changes fix the A(t|c)_mul_B[!]([dense,] sparse, dense) regressions per nanosoldier, I plan to merge this pull request and defer consolidation and additional improvements to later pull requests. Best!

nanosoldier · 2017-10-13T18:54:10Z

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: stored type BenchmarkTools.ParametersPreV006 does not match currently loaded type

Logs and partial data can be found here
cc @ararslan

andreasnoack · 2017-11-08T10:03:06Z

Fixes JuliaLang/LinearAlgebra.jl#443

KristofferC · 2017-11-08T10:13:33Z

@nanosoldier runbenchmarks("sparse" && "matmul", vs=":master")

nanosoldier · 2017-11-08T11:16:50Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

ararslan · 2017-11-08T19:20:25Z

The sparse matmul benchmarks have been pretty noisy on master, so it's kind of hard to determine how much is related and how much is noise.

ViralBShah · 2018-03-04T01:08:12Z

Should we try to bring back these improvements?

Sacha0 · 2018-03-04T01:36:05Z

I'd love to. Regrettably I won't have time to see this through for the foreseeable future. Best!

KristofferC · 2018-04-10T07:41:44Z

Is there anything "big" that is left to do here @Sacha0? Getting these in would be very nice so if it is just a manner of rebasing (A_mul_B! -> mul! etc), checking Nanosoldier and doing some small performance tweaks, I could look at it.

Sacha0 · 2018-04-10T16:58:10Z

Is there anything "big" that is left to do here @Sacha0? Getting these in would be very nice so if it is just a manner of rebasing (A_mul_B! -> mul! etc), checking Nanosoldier and doing some small performance tweaks, I could look at it.

[Insert hazy memory disclaimer here.] Much thanks for looking into resuscitating this pull request! :) Rebasing through the lazy adjoint/transpose change and associated cleanup should make this mergeworthy again. Of course there remains potential downstream work, but none of it pressing. (If this pull request nonetheless remains open then, I hope to polish it off after submitting my dissertation and figuring out immediate next steps.) Best!

jebej · 2018-05-09T16:57:57Z

It would be amazing to have this land :)

carstenbauer · 2018-11-06T15:46:06Z

This would fix https://discourse.julialang.org/t/asymmetric-speed-of-in-place-sparse-dense-matrix-product/10256/4 I guess.

mschauer · 2018-11-21T15:39:53Z

Currently, B*A, A*B and B'*A create dense matrices and perform reasonably, while not perfectly.
Here A is full and B is sparse.

A*B' freezes for some time while it creates a full sparse matrix, which is likely unwanted. I just note that this PR would fix this too (I think this problem is less subtile than others performance issues mentioned.)

ViralBShah · 2018-12-17T03:16:04Z

@Sacha0 Would it be possible for you to revive this PR?

Sacha0 · 2018-12-17T16:55:16Z

@Sacha0 Would it be possible for you to revive this PR?

That's the hope! :)

ViralBShah · 2018-12-17T18:01:09Z

Thank you. :-)

vtjnash · 2021-04-13T18:08:46Z

IIUC, this was revived in #38876, but we might still need the tests from here, as it seems like that was merged without tests? @Sacha0, @dkarrasch, or @ViralBShah would you be able confirm and close or update this, as appropriate?

dkarrasch · 2021-04-13T18:40:01Z

We are covering all but one cases according to https://codecov.io/gh/JuliaLang/julia/src/master/stdlib/SparseArrays/src/linalg.jl. I'll add one as soon as I get to it. Other than that, this could be closed. The remaining aspect of this one was to (silently) materialize certain adjoints/transposes, but in JuliaLang/LinearAlgebra.jl#822 I found this can be sometimes advantageous and sometimes not, so it doesn't look like a good general solution and materializing should be left to the user.

Sacha0 added linear algebra Linear algebra potential benchmark Could make a good benchmark in BaseBenchmarks performance Must go faster sparse Sparse arrays labels Oct 7, 2017

Sacha0 force-pushed the dsmult branch 3 times, most recently from cf44d90 to 302917c Compare October 8, 2017 19:24

Sacha0 changed the title ~~[WIP] improved dense-sparse matrix multiplication kernels~~ [WIP] improved dense-sparse and sparse-dense matrix multiplication kernels Oct 8, 2017

Improved dense-sparse matrix multiplication kernels.

c3d2444

ararslan requested a review from andreasnoack October 8, 2017 20:28

Sacha0 force-pushed the dsmult branch from 302917c to 0a5af3c Compare October 8, 2017 20:44

Improved sparse-dense multiplication kernels.

37f34d4

Sacha0 force-pushed the dsmult branch from 0a5af3c to 37f34d4 Compare October 13, 2017 18:20

ararslan mentioned this pull request Oct 13, 2017

Uninline trig functions. #24117

Merged

andreasnoack mentioned this pull request Nov 8, 2017

Missing Ac_mul_B(StridedMatrix,SparseMatrixCSC) causes dispatch to generic matmul JuliaLang/LinearAlgebra.jl#443

Closed

KristofferC mentioned this pull request Jun 13, 2018

Sparse matrix multiplication error; difference between sparse and dense matrix multiplication JuliaLang/LinearAlgebra.jl#533

Closed

carstenbauer mentioned this pull request Nov 8, 2018

Asymmetric speed of in-place sparse*dense matrix product #29956

Closed

dkarrasch mentioned this pull request Dec 14, 2020

[WIP] Speed up dense-sparse matmul #38876

Merged

vtjnash closed this Apr 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] improved dense-sparse and sparse-dense matrix multiplication kernels #24045

[WIP] improved dense-sparse and sparse-dense matrix multiplication kernels #24045

Sacha0 commented Oct 7, 2017 •

edited

Loading

Sacha0 commented Oct 8, 2017

ararslan commented Oct 10, 2017

nanosoldier commented Oct 10, 2017

Sacha0 commented Oct 10, 2017 •

edited

Loading

ararslan commented Oct 10, 2017

nanosoldier commented Oct 11, 2017

Sacha0 commented Oct 13, 2017

Sacha0 commented Oct 13, 2017

nanosoldier commented Oct 13, 2017

andreasnoack commented Nov 8, 2017

KristofferC commented Nov 8, 2017

nanosoldier commented Nov 8, 2017

ararslan commented Nov 8, 2017

ViralBShah commented Mar 4, 2018 •

edited

Loading

Sacha0 commented Mar 4, 2018

KristofferC commented Apr 10, 2018 •

edited

Loading

Sacha0 commented Apr 10, 2018

jebej commented May 9, 2018

carstenbauer commented Nov 6, 2018 •

edited

Loading

mschauer commented Nov 21, 2018

ViralBShah commented Dec 17, 2018

Sacha0 commented Dec 17, 2018

ViralBShah commented Dec 17, 2018

vtjnash commented Apr 13, 2021

dkarrasch commented Apr 13, 2021

[WIP] improved dense-sparse and sparse-dense matrix multiplication kernels #24045

[WIP] improved dense-sparse and sparse-dense matrix multiplication kernels #24045

Conversation

Sacha0 commented Oct 7, 2017 • edited Loading

Sacha0 commented Oct 8, 2017

ararslan commented Oct 10, 2017

nanosoldier commented Oct 10, 2017

Sacha0 commented Oct 10, 2017 • edited Loading

ararslan commented Oct 10, 2017

nanosoldier commented Oct 11, 2017

Sacha0 commented Oct 13, 2017

Sacha0 commented Oct 13, 2017

nanosoldier commented Oct 13, 2017

andreasnoack commented Nov 8, 2017

KristofferC commented Nov 8, 2017

nanosoldier commented Nov 8, 2017

ararslan commented Nov 8, 2017

ViralBShah commented Mar 4, 2018 • edited Loading

Sacha0 commented Mar 4, 2018

KristofferC commented Apr 10, 2018 • edited Loading

Sacha0 commented Apr 10, 2018

jebej commented May 9, 2018

carstenbauer commented Nov 6, 2018 • edited Loading

mschauer commented Nov 21, 2018

ViralBShah commented Dec 17, 2018

Sacha0 commented Dec 17, 2018

ViralBShah commented Dec 17, 2018

vtjnash commented Apr 13, 2021

dkarrasch commented Apr 13, 2021

Sacha0 commented Oct 7, 2017 •

edited

Loading

Sacha0 commented Oct 10, 2017 •

edited

Loading

ViralBShah commented Mar 4, 2018 •

edited

Loading

KristofferC commented Apr 10, 2018 •

edited

Loading

carstenbauer commented Nov 6, 2018 •

edited

Loading