Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialisation of mul! for left multiplication with scalars #173

Merged
merged 40 commits into from
Jun 2, 2022

Conversation

krcools
Copy link
Contributor

@krcools krcools commented Mar 31, 2022

In Base.LinearAlgebra, mul! has specialisations for when s in mul!(C,s,B,alpha,beta) is a scalar. This is needed to make mul! essentially act like inplace addition (add! does not seem to exist and add!(C,B) is equivalent to mul!(C,1,B,1,1)).

I suggest to echo the treatment of this special but important case also in LinearMaps. As a motivation, what I want is the conversion of a LinearCombination to a Matrix, with minimal overhead:

function convert_to_dense(A::LinearMaps.LinearCombination)
    M = zeros(eltype(A), axes(A))
    for map in A.maps
        mul!(M, 1, map, 1, 1)
    end
    return M
end

as opposed to

function convert_to_dense(A::LinearMaps.LinearCombination)
    M = zeros(eltype(A), axes(A))
    for map in A.maps
        M .+= Matrix(map)
    end
    return M
end

The latter does a lot of spurious copying and allocating, and does not allow for clever specialisations in case map is a WrappedMap around a sparse matrix.

Closes #181.

@codecov
Copy link

codecov bot commented Mar 31, 2022

Codecov Report

Merging #173 (c4a9745) into master (98843dd) will increase coverage by 0.13%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #173      +/-   ##
==========================================
+ Coverage   98.40%   98.54%   +0.13%     
==========================================
  Files          15       15              
  Lines        1193     1307     +114     
==========================================
+ Hits         1174     1288     +114     
  Misses         19       19              
Impacted Files Coverage Δ
src/LinearMaps.jl 100.00% <100.00%> (ø)
src/blockmap.jl 99.27% <100.00%> (+0.21%) ⬆️
src/conversion.jl 100.00% <100.00%> (ø)
src/embeddedmap.jl 68.75% <100.00%> (+6.25%) ⬆️
src/fillmap.jl 100.00% <100.00%> (ø)
src/linearcombination.jl 100.00% <100.00%> (ø)
src/scaledmap.jl 100.00% <100.00%> (ø)
src/transpose.jl 100.00% <100.00%> (ø)
src/uniformscalingmap.jl 100.00% <100.00%> (ø)
src/wrappedmap.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98843dd...c4a9745. Read the comment docs.

@dkarrasch
Copy link
Member

Sounds like a good idea. Perhaps, the 1 should be one(eltype(M)) (or true?). Could you maybe rebase onto current master? The current function Base.Matrix{T}(ΣA::LinearCombination{<:Any, <:Tuple{Vararg{VecOrMatMap}}}) where {T} uses sum on the tuple/vector of matrices. Does that create intermediate arrays while summing up or does it do fancy in-place operations? In other words, would be interesting to see some benchmarks.

@dkarrasch
Copy link
Member

Also, since this is a new feature, feel free to bump the minor version right away, and leave a comment in the history.md file to briefly mention the feature.

@krcools
Copy link
Contributor Author

krcools commented Mar 31, 2022

I am getting some interesting results. The method at

function Base.Matrix{T}(ΣA::LinearCombination{<:Any, <:Tuple{Vararg{VecOrMatMap}}}) where {T}

Is only called when all coefficients in the linear combination are unity and when the linear combination is entered by hand, i.e. when the storage type is Tuple, not Vector. This implementation beats my code path by about 12%. My guess is that it called BLAS.axpy under the hood...

However, the conditions for entering this code path are very brittle, and in all other cases the fallback implementation is 300 slower. In terms of memory my method is on par.

using LinearAlgebra
using LinearMaps
using BenchmarkTools

n = 1024
A = [rand(n,n) for i in 1:10]
L = [LinearMap(a) for a in A]
L2 = [2*LinearMap(a) for a in A]

LC1 = sum(L)
LC2 = sum(L2)

LC3 = L[1] + L[2] + L[3] + L[4] + L[5] + L[6] + L[7] + L[8] + L[9] + L[10]
LC4 = 2L[1] + L[2] + L[3] + L[4] + L[5] + L[6] + L[7] + L[8] + L[9] + L[10]

function convert_to_dense(A::LinearMaps.LinearCombination)
    M = zeros(eltype(A), axes(A))
    for map in A.maps
        mul!(M, 1, map, 1, 1)
    end
    return M
end

@which Matrix{Float64}(LC1)
@which Matrix{Float64}(LC2)
@which Matrix{Float64}(LC3)
@which Matrix{Float64}(LC4)

@btime B1 = Matrix{Float64}(LC1);
@btime B2 = Matrix{Float64}(LC2);
@btime B2 = Matrix{Float64}(LC3);
@btime B3 = Matrix{Float64}(LC4);

@btime C1 = convert_to_dense(LC1);
@btime C2 = convert_to_dense(LC2);
@btime C3 = convert_to_dense(LC3);
@btime C4 = convert_to_dense(LC4);
  2.618 s (10243 allocations: 8.32 MiB)
  2.644 s (11267 allocations: 16.45 MiB)
  7.000 ms (30 allocations: 16.00 MiB)
  2.614 s (1027 allocations: 16.13 MiB)
  8.642 ms (32 allocations: 8.00 MiB)
  8.547 ms (32 allocations: 8.00 MiB)
  8.608 ms (32 allocations: 8.00 MiB)
  8.467 ms (51 allocations: 8.00 MiB)

src/scaledmap.jl Outdated Show resolved Hide resolved
@dkarrasch
Copy link
Member

This is getting very close to what I imagined. I think we should make the most generic Matrix(::LinearMap) fallback use generic_mapnum_mul! and use exactly the previous code in there. In particular, don't overwrite in each loop iteration the whole vector xi with zeros, but only overwrite the current entry. Then we have a fully operational fallback even for the wildest custom map types. Given that, you may then be able to simplify (or even improve) the code in a few places. For instance, statements like Y .+= Matrix(L) "hurt my eyes", because we allocate the entire matrix on the right, and then copy it to Y. I would guess/hope that we can avoid that if make it write into Y columnwise directly via the generic... code?

@krcools
Copy link
Contributor Author

krcools commented Apr 7, 2022

Fully agree. Came to the same conclusion as I progressed. I will implement these suggestions in the next couple of days.

@krcools
Copy link
Contributor Author

krcools commented Apr 8, 2022

There is actually one thing I can't figure out:

If a user defines a custom LinearMap CustomMap in an external package, the docs say that implementing mul! suffices. However, assume that this custom map is then scaled, giving rise to ScaledMap{T, CustomMap} and mul! is called on this scaled map, the following stack unfolds:

mul!(y, S::ScaledMap, x)
unsafe_mul!(y, S::ScaledMap, x)
unsafe_mul!(y, C::LinearMap, x)
__generic_mapvec_mul!(y, C::CustomMap, x)

In other words, when the custom map appears at a lower level of a more complicated LinearMap, the user supplied mul!(y, C::CustomMap, x) is not called. I encounter this issue also in the context of mul! by scalars, but it is not unique to that situation.

Am I missing something here?

Currently I am working around this by not only supplying mul! but also _unsafe_mul! in LiftedMaps.jl.

@dkarrasch
Copy link
Member

Can you provide a minimal example? Because of

_unsafe_mul!(y, A::MapOrVecOrMat, x) = mul!(y, A, x)
_unsafe_mul!(y, A::AbstractVecOrMat, x, α, β) = mul!(y, A, x, α, β)

a "missing" _unsafe_mul! should be redirected back to mul!. Even if not, then in _generic_mapvec_mul! the 3-arg mul! gets called. OTOH, _generic_mapvec_mul! is called only in the 5-arg case here:

function _unsafe_mul!(y::AbstractVecOrMat, A::LinearMap, x::AbstractVector, α, β)
return _generic_mapvec_mul!(y, A, x, α, β)
end

so, unless I am missing some path, it should really end up with the user-defined mul!.

@krcools
Copy link
Contributor Author

krcools commented Apr 8, 2022

I see... I didn't bother drilling down deeper after encountering __generic_mapvec_mul!.

The problem still exist for mapmat and mapnum I think. For example, __generic_mapmat_mul! is implemented in terms of mul!(x, L, y::AbstractVector), skipping user supplied specialisations of this level BLAS 3 type of operation.

MWE:

using LinearMaps

struct CustomMap{T} <: LinearMap{T}
    A::Matrix{T}
end

LinearMaps.MulStyle(C::CustomMap) = LinearMaps.FiveArg()
LinearMaps.size(C::CustomMap) = size(C.A)
LinearMaps.mul!(y::AbstractVector, C::CustomMap, x::AbstractVector, a::Number, b::Number) = LinearMaps.mul!(y, C.A, x, a, b)
function LinearMaps.mul!(Y::AbstractMatrix, C::CustomMap, X::AbstractMatrix, a::Number, b::Number)
    println("User supplied BLAS-3")
    LinearMaps.mul!(Y, C.A, X, a, b)
end

A = rand(3,3)
C = CustomMap(A)
S = 2C

X = rand(3,4)
Y = zeros(3,4)

LinearMaps.mul!(Y,C,X,1,1);
LinearMaps.mul!(Y,S,X,1,1);

The same is true for matnum, which is written also written in terms of mul!(x, L, y::AbstractVector), skipping mul!(x, L, s::Number).

It seems to me that __generic means something different for mapvec on one hand and for mapmat and matnum on the other. For the former, it exploits special values of alpha and beta, for the latter it is a true fallback that breaks the operation down in primitives that always need to be supplied.

@dkarrasch
Copy link
Member

Awesome, thanks for finding and sharing that. There are two things that bite each other here. The _unsafe_mul! redirection was meant to save size checks for all "submaps", when you, due to checks at construction, should already know that everything is consistent. If we replaced all those by mul!, then all this effort is in vain. Could we then solve the issue by asking users to define _unsafe_mul! for custom map types? Like tell them to make important checks at construction, then leave the size compatibility to LinearMaps.jl, and then only define the "multiplication kernel" _unsafe_mul!? As you say, the good thing is that this affects currently only the mapmat case (and mapnum in the future).

@dkarrasch
Copy link
Member

Currently I am working around this by not only supplying mul! but also _unsafe_mul! in LiftedMaps.jl.

Yea, perhaps only unsafe_mul! should be required, and all the "high-level" stuff like allocating the output in A*x or checking sizes in mul! is left to LinearMaps.jl?

@krcools
Copy link
Contributor Author

krcools commented Apr 8, 2022

Yes, that seems to be the way out. Just a shame _unsafe_mul! is not a particularly inviting name to import and extend...

Copy link
Member

@dkarrasch dkarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few remarks and suggestions. This looks pretty good, and we should get it done and out soon (edit: this is referring to myself and not putting pressure on you, of course!).

src/LinearMaps.jl Outdated Show resolved Hide resolved
src/conversion.jl Outdated Show resolved Hide resolved
src/conversion.jl Outdated Show resolved Hide resolved
src/conversion.jl Outdated Show resolved Hide resolved
src/conversion.jl Show resolved Hide resolved
src/transpose.jl Show resolved Hide resolved
src/uniformscalingmap.jl Outdated Show resolved Hide resolved
src/uniformscalingmap.jl Outdated Show resolved Hide resolved
src/uniformscalingmap.jl Outdated Show resolved Hide resolved
src/uniformscalingmap.jl Outdated Show resolved Hide resolved
Copy link
Member

@dkarrasch dkarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, sorry for being so picky. I'm not 100% sure, but I believe we should provide targetted methods for 3-arg multiplication because some map types may not come with 5-arg implementations. Calling 5-arg methods (even with default values) requires much care that no intermediate arrays are allocated.

src/LinearMaps.jl Outdated Show resolved Hide resolved
docs/src/history.md Outdated Show resolved Hide resolved
src/LinearMaps.jl Outdated Show resolved Hide resolved
src/LinearMaps.jl Outdated Show resolved Hide resolved
src/LinearMaps.jl Outdated Show resolved Hide resolved
src/conversion.jl Outdated Show resolved Hide resolved
src/linearcombination.jl Outdated Show resolved Hide resolved
src/linearcombination.jl Outdated Show resolved Hide resolved
src/uniformscalingmap.jl Outdated Show resolved Hide resolved
docs/src/custom.jl Outdated Show resolved Hide resolved
@dkarrasch
Copy link
Member

I need to fix the blockmul case, but otherwise this is really close to being finished.

It is somewhat unfortunate that even for matrices, the blockmul code falls back to _generic_matmatmul! because of the views and the non-stridedness. It would be interesting to have some real-world benchmark case to see if it would be beneficial to compute the individual block muls "out of place" and then copyto! the target. I should perhaps file an issue and reach out to potential users.

@dkarrasch
Copy link
Member

For me, all your benchmarks from #173 (comment) are now on the order of milliseconds, with very small overhead for the linear combination of the doubled maps. I used this code for benchmarking:

using LinearAlgebra, LinearMaps, BenchmarkTools

n = 1024
A = [rand(n,n) for i in 1:10]

LC1 = sum((LinearMap(a) for a in A)) # creates a LinearMapTuple
LC2 = sum((2*LinearMap(a) for a in A)) # creates a LinearMapTuple
LC3 = sum([LinearMap(a) for a in A]) # creates a LinearMapVector
LC4 = sum([2*LinearMap(a) for a in A]) # creates a LinearMapVector

@btime B1 = Matrix{Float64}(LC1);
@btime B2 = Matrix{Float64}(LC2);
@btime B2 = Matrix{Float64}(LC3);
@btime B3 = Matrix{Float64}(LC4);

  12.059 ms (29 allocations: 8.00 MiB)
  12.252 ms (29 allocations: 8.00 MiB)
  12.147 ms (39 allocations: 8.00 MiB)
  12.884 ms (39 allocations: 8.00 MiB)

I agree that the slight regression for the one special case is acceptable, given that this "recurses" into the _unsafe_mul!(::Matrix, ::LinearMap, ::Number) hierarchy and makes use of possible optimizations at every step of the map hierarchy, hence the very little overhead for the linear combination of scaled matrices. As you say, this will further auto-optimize for sparse matrices etc. I also like that with this approach, we first allocate the target, and then map into it, rather than possibly creating intermediate arrays and then stack them together as needed. Local tests suggested that this is also advantageous for BlockMaps. And finally, we missed to provide some specialized Matrix constructor for our new EmbeddedMaps, so relative to the naive implementation, this will speed up things as we avoid computing all-zero columns.

@dkarrasch dkarrasch merged commit 12c13fb into JuliaLinearAlgebra:master Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Doc Instructions for Transpose Insufficient for Full Functionality
2 participants