Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

svd allocates even with small matrices #1255

Closed
ronisbr opened this issue May 16, 2024 · 16 comments · Fixed by #1259
Closed

svd allocates even with small matrices #1255

ronisbr opened this issue May 16, 2024 · 16 comments · Fixed by #1259

Comments

@ronisbr
Copy link
Contributor

ronisbr commented May 16, 2024

Hi!

When computing the SVD of a static matrix, we see some allocations:

julia> using StaticArrays, BenchmarkTools

julia> S = @SMatrix rand(3, 3);

julia> @btime StaticArrays.svd($S)
  1.333 μs (8 allocations: 2.48 KiB)

which is probably caused because the algorithm just converts the input to Matrix and call the svd in Base.

The problem here is that svd is used to compute pinv that is somewhat very used when inverting matrices for embedded Kalman filters.

I am pursing to embed a full satellite attitude control algorithm written entirely in Julia. However, I must avoid all allocations. In this specific case, I have not yet found a workaround to avoid allocations. Can anyone help me?

@mateuszbaran
Copy link
Collaborator

Hi!

This is a complicated problem. So far no one has implemented the right algorithm in StaticArrays.jl. See here for the 3x3 case for example: https://pages.cs.wisc.edu/%7Esifakis/papers/SVD_TR1690.pdf . If you want to work on it, I could review your code but I don't have time to work on it myself.

@ronisbr
Copy link
Contributor Author

ronisbr commented May 17, 2024

Thanks @mateuszbaran !

Implementing state-of-the-art algorithm is indeed very difficult (I was reading BLAS code). However, since SMatrix are supposed to be small, I am wondering if we can implement a naive approach. Since it will not allocate, maybe we can have a performance better than what we have now (converting to Matrix and using BLAS). What do you think?

@ronisbr
Copy link
Contributor Author

ronisbr commented May 17, 2024

Just for comparison, I implemented a QR-based SVD decomposition and it was 50% faster than the current svd for matrices 4 x 4. However, it became slower when the dimension was higher than 7. I have no idea if such approach would be good here. What do you think?

@ronisbr
Copy link
Contributor Author

ronisbr commented May 18, 2024

Eigen uses the complete ortogonal decomposition to compute pseudo-inverse, which seems a little better than computing SVD for this application. However, it requires pivoted QR, which StaticArrays also uses the algorithm in Base by converting to Matrix :)

Now, let's see how difficult it is to implement pivoted QR here.

@ronisbr
Copy link
Contributor Author

ronisbr commented May 19, 2024

On second thoughts, although having a Julia version of most BLAS functions would be awesome, it is really an Herculean task. However, since Julia has such a good interface with BLAS (svd, qr, etc all use functions from it). I was thinking: "why can't we have a direct interface here in StaticArrays but trying to avoid allocations?"

Since some version of Julia, the compiler became smart enough to avoid all allocations using MMatrix if nothing gets out the scope and if you convert to SMatrix at end. Hence, I created a simple code to compute the SVD for static matrices using this approach:

function blas_svd(A::StaticMatrix{N, M, T}) where {N, M, T}
    K = min(N, M)

    # Convert the input to a `MMatrix` and allocate the 
    Am  = MMatrix{N, M, T}(A)
    Um  = MMatrix{N, K, T}(undef)
    Sm  = MVector{K, T}(undef)
    Vtm = MMatrix{M, K, T}(undef)

    work  = MVector{1_000, Float64}(undef)
    lwork = -1
    iwork = MVector{8K, Int64}(undef)
    info  = Ref(1)

    # Call the function to obtain the optimum size of the work array.
    ccall(
        (BLAS.@blasfunc(dgesvd_), BLAS.libblas),
        Cvoid,
        (
            Ref{UInt8},
            Ref{UInt8},
            Ref{BLAS.BlasInt},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ref{BLAS.BlasInt},
        ),
        'S',
        'S',
        N,
        M,
        Am,
        N,
        Sm,
        Um,
        N,
        Vtm,
        M,
        work,
        lwork,
        info
    )

    # Obtain the optimal work array size and compute the SVD.
    lwork = round(BLAS.BlasInt, work[1])
    ccall(
        (BLAS.@blasfunc(dgesvd_), BLAS.libblas),
        Cvoid,
        (
            Ref{UInt8},
            Ref{UInt8},
            Ref{BLAS.BlasInt},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ref{BLAS.BlasInt},
        ),
        'S',
        'S',
        N,
        M,
        Am,
        N,
        Sm,
        Um,
        N,
        Vtm,
        M,
        work,
        lwork,
        info
    )

    # Convert the matrices to static arrays and return.
    U  = SMatrix{N, K, T}(Um)
    S  = SVector{K, T}(Sm)
    Vt = SMatrix{M, K, T}(Vtm)

    return U, S, Vt'
end

The result is the same as using svd if we use square matrices (I did not debug for non-square matrices yet):

julia> A = @SMatrix randn(6, 6)
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
 -0.981944   -1.12979    0.310915    0.179723   -0.667651    0.182524
  1.09949     0.522666  -0.0878819  -0.0383786  -1.17082    -0.219113
 -1.07534     0.961367  -1.12684    -0.23116    -0.0602622  -1.38193
  0.0982899  -0.678585  -2.14742     0.164939   -1.12818     1.33526
 -0.60989     1.2469    -0.238621   -0.237353    0.695344   -0.514536
 -1.39276     0.353875  -0.551829   -0.283986    0.207744   -1.20008

julia> U1, S1, V1 = svd(A);

julia> U2, S2, V2 = blas_svd(A);

julia> U1
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
  0.127537  0.115343    0.695209    0.470797   0.492761   -0.150506
  0.155633  0.0389598  -0.624687    0.696828   0.231023    0.212336
 -0.542549  0.463331   -0.138643    0.219714  -0.173881   -0.627084
  0.479565  0.825676   -0.0673256  -0.264451   0.0609473   0.100476
 -0.432199  0.0425817  -0.20474    -0.379913   0.784298    0.100079
 -0.498372  0.294871    0.246531    0.173895  -0.233973    0.720358

julia> U2
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
  0.127537  0.115343    0.695209    0.470797   0.492761   -0.150506
  0.155633  0.0389598  -0.624687    0.696828   0.231023    0.212336
 -0.542549  0.463331   -0.138643    0.219714  -0.173881   -0.627084
  0.479565  0.825676   -0.0673256  -0.264451   0.0609473   0.100476
 -0.432199  0.0425817  -0.20474    -0.379913   0.784298    0.100079
 -0.498372  0.294871    0.246531    0.173895  -0.233973    0.720358

julia> V1
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
  0.47655     -0.331218   -0.695862    0.0226213   -0.422302    0.0109475
 -0.47389     -0.0241441  -0.658351   -0.137571     0.559031   -0.0998872
 -0.00418107  -0.872921    0.233316    0.290965     0.312066    0.038845
  0.135771    -0.0163176   0.071462    0.00309199   0.0228163  -0.987756
 -0.344044    -0.355242    0.125355   -0.780332    -0.359093   -0.0430901
  0.641491     0.0354597   0.0842224  -0.535699     0.531311    0.104279

julia> V2
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
  0.47655     -0.331218   -0.695862    0.0226213   -0.422302    0.0109475
 -0.47389     -0.0241441  -0.658351   -0.137571     0.559031   -0.0998872
 -0.00418107  -0.872921    0.233316    0.290965     0.312066    0.038845
  0.135771    -0.0163176   0.071462    0.00309199   0.0228163  -0.987756
 -0.344044    -0.355242    0.125355   -0.780332    -0.359093   -0.0430901
  0.641491     0.0354597   0.0842224  -0.535699     0.531311    0.104279

julia> S1
6-element SVector{6, Float64} with indices SOneTo(6):
 3.429126570487767
 2.790190909977814
 2.077297239613754
 1.3752132074075802
 0.4483791256059501
 0.10325942415912782

julia> S2
6-element SVector{6, Float64} with indices SOneTo(6):
 3.429126570487767
 2.790190909977814
 2.077297239613754
 1.3752132074075802
 0.4483791256059501
 0.10325942415912782

Now, let's analyze the allocation and execution time for different matrix sizes between the two approaches:

Rows Columns `svd` time `svd` alloc. `blas_svd` time `blas_svd` alloc Gain [%]
2 2 0.954 8 0.64 0 32.914
3 3 1.488 8 1.175 0 21.0349
4 4 2.296 8 1.908 0 16.899
5 5 3.042 8 2.653 0 12.7876
6 6 3.713 8 3.26 0 12.2004
7 7 5.312 8 4.774 0 10.128
8 8 5.923 8 5.147 0 13.1015
9 9 7.25 8 6.683 0 7.82069
10 10 9.166 8 8.43 0 8.02967

Notice that we have a noticeable gain with small matrices without any allocation (my use case).

@mateuszbaran If I can improve the algorithm to work for all matrix sizes, would this approach be useful for StaticArrays.jl?

For my project, I will implement this type of function to compute the pinv in my case.

@mateuszbaran
Copy link
Collaborator

@mateuszbaran If I can improve the algorithm to work for all matrix sizes, would this approach be useful for StaticArrays.jl?

Hmm, that's a significant amount of additional code with relatively limited gains. I'd prefer to have a second opinion if it's worth it.

Directly calling BLAS would have to be done a bit more carefully than your blas_svd. First, I would limit it to blas_svd(A::StaticMatrix{N, M, Float64}).

Call the function to obtain the optimum size of the work array.

Querying doesn't make sense if you don't allocate the requested amount of memory on the heap. You have a fixed size there (1000 elements) and that what you have to provide. I'd suggest, though, a different approach: using a formula from LAPACK docs (MAX(3MIN(M,N)+MAX(M,N),5MIN(M,N)) or something larger), see here: https://docs.oracle.com/cd/E19422-01/819-3691/dgesvd.html .

I'd also suggest using LinearAlgebra.LAPACK.chklapackerror to detect any possible errors.

@ronisbr
Copy link
Contributor Author

ronisbr commented May 19, 2024

Hi @mateuszbaran!

Thanks for the tips.

Hmm, that's a significant amount of additional code with relatively limited gains. I'd prefer to have a second opinion if it's worth it.

I see. Maybe I will create a package called StaticArrayBlasInterface to add those algorithms (important to my use) and after we can decide if we migrate them to here.

@ronisbr
Copy link
Contributor Author

ronisbr commented May 22, 2024

Hi @mateuszbaran!

I finished the initial code of the package (still needs polishing and error verification):

https://github.com/ronisbr/StaticArraysBlasInterfaces.jl

It is working perfectly for Float64 and Float32 matrices. I overloaded only the svd function, but it also provided allocation-free pinv (because computing the SVD was the only part in pinv that allocates).

I also opened a thread in Discourse to discuss more about the implementation:

https://discourse.julialang.org/t/package-to-avoid-allocations-in-some-functions-when-using-staticarrays-jl/114539

Thanks!

@mateuszbaran
Copy link
Collaborator

Hi!
Making a new package is a good solution for now. I'm not knowledgeable about every aspect of StaticArrays and linear algebra and maintainer availability here is sometimes less then ideal, so working around it is necessary sometimes.

@fredrikekre
Copy link
Member

I think it is fine to include it here. The BLAS interface file in LinearAlgebra have basically not changed since the start and I expect the same for this new code so shouldn't require any maintenance. It is a strict improvement and StaticArrays already depend on LinearAlgebra too.

@mateuszbaran
Copy link
Collaborator

OK, then let's put it in StaticArrays.jl.

@fredrikekre
Copy link
Member

Would be good to at least measure the impact on using StaticArrays with/without the additional code though.

@ronisbr
Copy link
Contributor Author

ronisbr commented May 26, 2024

Perfect! I will submit the PR and perform all the analysis.

@ronisbr
Copy link
Contributor Author

ronisbr commented May 29, 2024

Done @mateuszbaran and @fredrikekre ! I submitted the PR for review. If it got merged, I can also implement the same approach for other functions that allocates due to call functions in Julia Base (like qr).

@ronisbr
Copy link
Contributor Author

ronisbr commented Jun 6, 2024

Thank you very much @mateuszbaran and @fredrikekre for the help adding this feature to StaticArrays.jl :)

@mateuszbaran
Copy link
Collaborator

Thank you for your contribution 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants