svd allocates even with small matrices #1255

ronisbr · 2024-05-16T22:21:39Z

Hi!

When computing the SVD of a static matrix, we see some allocations:

julia> using StaticArrays, BenchmarkTools

julia> S = @SMatrix rand(3, 3);

julia> @btime StaticArrays.svd($S)
  1.333 μs (8 allocations: 2.48 KiB)

which is probably caused because the algorithm just converts the input to Matrix and call the svd in Base.

The problem here is that svd is used to compute pinv that is somewhat very used when inverting matrices for embedded Kalman filters.

I am pursing to embed a full satellite attitude control algorithm written entirely in Julia. However, I must avoid all allocations. In this specific case, I have not yet found a workaround to avoid allocations. Can anyone help me?

The text was updated successfully, but these errors were encountered:

mateuszbaran · 2024-05-17T10:24:10Z

Hi!

This is a complicated problem. So far no one has implemented the right algorithm in StaticArrays.jl. See here for the 3x3 case for example: https://pages.cs.wisc.edu/%7Esifakis/papers/SVD_TR1690.pdf . If you want to work on it, I could review your code but I don't have time to work on it myself.

ronisbr · 2024-05-17T17:25:26Z

Thanks @mateuszbaran !

Implementing state-of-the-art algorithm is indeed very difficult (I was reading BLAS code). However, since SMatrix are supposed to be small, I am wondering if we can implement a naive approach. Since it will not allocate, maybe we can have a performance better than what we have now (converting to Matrix and using BLAS). What do you think?

ronisbr · 2024-05-17T23:38:09Z

Just for comparison, I implemented a QR-based SVD decomposition and it was 50% faster than the current svd for matrices 4 x 4. However, it became slower when the dimension was higher than 7. I have no idea if such approach would be good here. What do you think?

ronisbr · 2024-05-18T16:46:14Z

Eigen uses the complete ortogonal decomposition to compute pseudo-inverse, which seems a little better than computing SVD for this application. However, it requires pivoted QR, which StaticArrays also uses the algorithm in Base by converting to Matrix :)

Now, let's see how difficult it is to implement pivoted QR here.

ronisbr · 2024-05-19T00:47:25Z

On second thoughts, although having a Julia version of most BLAS functions would be awesome, it is really an Herculean task. However, since Julia has such a good interface with BLAS (svd, qr, etc all use functions from it). I was thinking: "why can't we have a direct interface here in StaticArrays but trying to avoid allocations?"

Since some version of Julia, the compiler became smart enough to avoid all allocations using MMatrix if nothing gets out the scope and if you convert to SMatrix at end. Hence, I created a simple code to compute the SVD for static matrices using this approach:

function blas_svd(A::StaticMatrix{N, M, T}) where {N, M, T}
    K = min(N, M)

    # Convert the input to a `MMatrix` and allocate the 
    Am  = MMatrix{N, M, T}(A)
    Um  = MMatrix{N, K, T}(undef)
    Sm  = MVector{K, T}(undef)
    Vtm = MMatrix{M, K, T}(undef)

    work  = MVector{1_000, Float64}(undef)
    lwork = -1
    iwork = MVector{8K, Int64}(undef)
    info  = Ref(1)

    # Call the function to obtain the optimum size of the work array.
    ccall(
        (BLAS.@blasfunc(dgesvd_), BLAS.libblas),
        Cvoid,
        (
            Ref{UInt8},
            Ref{UInt8},
            Ref{BLAS.BlasInt},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ref{BLAS.BlasInt},
        ),
        'S',
        'S',
        N,
        M,
        Am,
        N,
        Sm,
        Um,
        N,
        Vtm,
        M,
        work,
        lwork,
        info
    )

    # Obtain the optimal work array size and compute the SVD.
    lwork = round(BLAS.BlasInt, work[1])
    ccall(
        (BLAS.@blasfunc(dgesvd_), BLAS.libblas),
        Cvoid,
        (
            Ref{UInt8},
            Ref{UInt8},
            Ref{BLAS.BlasInt},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ref{BLAS.BlasInt},
        ),
        'S',
        'S',
        N,
        M,
        Am,
        N,
        Sm,
        Um,
        N,
        Vtm,
        M,
        work,
        lwork,
        info
    )

    # Convert the matrices to static arrays and return.
    U  = SMatrix{N, K, T}(Um)
    S  = SVector{K, T}(Sm)
    Vt = SMatrix{M, K, T}(Vtm)

    return U, S, Vt'
end

The result is the same as using svd if we use square matrices (I did not debug for non-square matrices yet):

julia> A = @SMatrix randn(6, 6)
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
 -0.981944   -1.12979    0.310915    0.179723   -0.667651    0.182524
  1.09949     0.522666  -0.0878819  -0.0383786  -1.17082    -0.219113
 -1.07534     0.961367  -1.12684    -0.23116    -0.0602622  -1.38193
  0.0982899  -0.678585  -2.14742     0.164939   -1.12818     1.33526
 -0.60989     1.2469    -0.238621   -0.237353    0.695344   -0.514536
 -1.39276     0.353875  -0.551829   -0.283986    0.207744   -1.20008

julia> U1, S1, V1 = svd(A);

julia> U2, S2, V2 = blas_svd(A);

julia> U1
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
  0.127537  0.115343    0.695209    0.470797   0.492761   -0.150506
  0.155633  0.0389598  -0.624687    0.696828   0.231023    0.212336
 -0.542549  0.463331   -0.138643    0.219714  -0.173881   -0.627084
  0.479565  0.825676   -0.0673256  -0.264451   0.0609473   0.100476
 -0.432199  0.0425817  -0.20474    -0.379913   0.784298    0.100079
 -0.498372  0.294871    0.246531    0.173895  -0.233973    0.720358

julia> U2
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
  0.127537  0.115343    0.695209    0.470797   0.492761   -0.150506
  0.155633  0.0389598  -0.624687    0.696828   0.231023    0.212336
 -0.542549  0.463331   -0.138643    0.219714  -0.173881   -0.627084
  0.479565  0.825676   -0.0673256  -0.264451   0.0609473   0.100476
 -0.432199  0.0425817  -0.20474    -0.379913   0.784298    0.100079
 -0.498372  0.294871    0.246531    0.173895  -0.233973    0.720358

julia> V1
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
  0.47655     -0.331218   -0.695862    0.0226213   -0.422302    0.0109475
 -0.47389     -0.0241441  -0.658351   -0.137571     0.559031   -0.0998872
 -0.00418107  -0.872921    0.233316    0.290965     0.312066    0.038845
  0.135771    -0.0163176   0.071462    0.00309199   0.0228163  -0.987756
 -0.344044    -0.355242    0.125355   -0.780332    -0.359093   -0.0430901
  0.641491     0.0354597   0.0842224  -0.535699     0.531311    0.104279

julia> V2
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
  0.47655     -0.331218   -0.695862    0.0226213   -0.422302    0.0109475
 -0.47389     -0.0241441  -0.658351   -0.137571     0.559031   -0.0998872
 -0.00418107  -0.872921    0.233316    0.290965     0.312066    0.038845
  0.135771    -0.0163176   0.071462    0.00309199   0.0228163  -0.987756
 -0.344044    -0.355242    0.125355   -0.780332    -0.359093   -0.0430901
  0.641491     0.0354597   0.0842224  -0.535699     0.531311    0.104279

julia> S1
6-element SVector{6, Float64} with indices SOneTo(6):
 3.429126570487767
 2.790190909977814
 2.077297239613754
 1.3752132074075802
 0.4483791256059501
 0.10325942415912782

julia> S2
6-element SVector{6, Float64} with indices SOneTo(6):
 3.429126570487767
 2.790190909977814
 2.077297239613754
 1.3752132074075802
 0.4483791256059501
 0.10325942415912782

Now, let's analyze the allocation and execution time for different matrix sizes between the two approaches:

Rows	Columns	`svd` time	`svd` alloc.	`blas_svd` time	Gain [%]
2	2	0.954	8	0.64	32.914
3	3	1.488	8	1.175	21.0349
4	4	2.296	8	1.908	16.899
5	5	3.042	8	2.653	12.7876
6	6	3.713	8	3.26	12.2004
7	7	5.312	8	4.774	10.128
8	8	5.923	8	5.147	13.1015
9	9	7.25	8	6.683	7.82069
10	10	9.166	8	8.43	8.02967

Notice that we have a noticeable gain with small matrices without any allocation (my use case).

@mateuszbaran If I can improve the algorithm to work for all matrix sizes, would this approach be useful for StaticArrays.jl?

For my project, I will implement this type of function to compute the pinv in my case.

mateuszbaran · 2024-05-19T07:45:10Z

@mateuszbaran If I can improve the algorithm to work for all matrix sizes, would this approach be useful for StaticArrays.jl?

Hmm, that's a significant amount of additional code with relatively limited gains. I'd prefer to have a second opinion if it's worth it.

Directly calling BLAS would have to be done a bit more carefully than your blas_svd. First, I would limit it to blas_svd(A::StaticMatrix{N, M, Float64}).

Call the function to obtain the optimum size of the work array.

Querying doesn't make sense if you don't allocate the requested amount of memory on the heap. You have a fixed size there (1000 elements) and that what you have to provide. I'd suggest, though, a different approach: using a formula from LAPACK docs (MAX(3MIN(M,N)+MAX(M,N),5MIN(M,N)) or something larger), see here: https://docs.oracle.com/cd/E19422-01/819-3691/dgesvd.html .

I'd also suggest using LinearAlgebra.LAPACK.chklapackerror to detect any possible errors.

ronisbr · 2024-05-19T13:50:32Z

Hi @mateuszbaran!

Thanks for the tips.

Hmm, that's a significant amount of additional code with relatively limited gains. I'd prefer to have a second opinion if it's worth it.

I see. Maybe I will create a package called StaticArrayBlasInterface to add those algorithms (important to my use) and after we can decide if we migrate them to here.

ronisbr · 2024-05-22T00:52:54Z

Hi @mateuszbaran!

I finished the initial code of the package (still needs polishing and error verification):

https://github.com/ronisbr/StaticArraysBlasInterfaces.jl

It is working perfectly for Float64 and Float32 matrices. I overloaded only the svd function, but it also provided allocation-free pinv (because computing the SVD was the only part in pinv that allocates).

I also opened a thread in Discourse to discuss more about the implementation:

https://discourse.julialang.org/t/package-to-avoid-allocations-in-some-functions-when-using-staticarrays-jl/114539

Thanks!

mateuszbaran · 2024-05-22T07:22:12Z

Hi!
Making a new package is a good solution for now. I'm not knowledgeable about every aspect of StaticArrays and linear algebra and maintainer availability here is sometimes less then ideal, so working around it is necessary sometimes.

fredrikekre · 2024-05-25T20:12:43Z

I think it is fine to include it here. The BLAS interface file in LinearAlgebra have basically not changed since the start and I expect the same for this new code so shouldn't require any maintenance. It is a strict improvement and StaticArrays already depend on LinearAlgebra too.

mateuszbaran · 2024-05-26T08:28:44Z

OK, then let's put it in StaticArrays.jl.

fredrikekre · 2024-05-26T10:10:27Z

Would be good to at least measure the impact on using StaticArrays with/without the additional code though.

ronisbr · 2024-05-26T18:11:23Z

Perfect! I will submit the PR and perform all the analysis.

ronisbr · 2024-05-29T20:26:46Z

Done @mateuszbaran and @fredrikekre ! I submitted the PR for review. If it got merged, I can also implement the same approach for other functions that allocates due to call functions in Julia Base (like qr).

ronisbr · 2024-06-06T16:16:40Z

Thank you very much @mateuszbaran and @fredrikekre for the help adding this feature to StaticArrays.jl :)

mateuszbaran · 2024-06-06T16:29:07Z

Thank you for your contribution 🙂

ronisbr mentioned this issue May 23, 2024

New package: StaticArraysBlasInterfaces v0.1.0 JuliaRegistries/General#107403

Merged

ronisbr mentioned this issue May 29, 2024

Add direct calls to BLAS to compute SVDs #1259

Merged

mateuszbaran closed this as completed in #1259 Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

svd allocates even with small matrices #1255

svd allocates even with small matrices #1255

ronisbr commented May 16, 2024

mateuszbaran commented May 17, 2024

ronisbr commented May 17, 2024

ronisbr commented May 17, 2024

ronisbr commented May 18, 2024

ronisbr commented May 19, 2024

mateuszbaran commented May 19, 2024

ronisbr commented May 19, 2024

ronisbr commented May 22, 2024

mateuszbaran commented May 22, 2024

fredrikekre commented May 25, 2024

mateuszbaran commented May 26, 2024

fredrikekre commented May 26, 2024

ronisbr commented May 26, 2024

ronisbr commented May 29, 2024

ronisbr commented Jun 6, 2024

mateuszbaran commented Jun 6, 2024

svd allocates even with small matrices #1255

svd allocates even with small matrices #1255

Comments

ronisbr commented May 16, 2024

mateuszbaran commented May 17, 2024

ronisbr commented May 17, 2024

ronisbr commented May 17, 2024

ronisbr commented May 18, 2024

ronisbr commented May 19, 2024

mateuszbaran commented May 19, 2024

ronisbr commented May 19, 2024

ronisbr commented May 22, 2024

mateuszbaran commented May 22, 2024

fredrikekre commented May 25, 2024

mateuszbaran commented May 26, 2024

fredrikekre commented May 26, 2024

ronisbr commented May 26, 2024

ronisbr commented May 29, 2024

ronisbr commented Jun 6, 2024

mateuszbaran commented Jun 6, 2024