-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
svd allocates even with small matrices #1255
Comments
Hi! This is a complicated problem. So far no one has implemented the right algorithm in StaticArrays.jl. See here for the 3x3 case for example: https://pages.cs.wisc.edu/%7Esifakis/papers/SVD_TR1690.pdf . If you want to work on it, I could review your code but I don't have time to work on it myself. |
Thanks @mateuszbaran ! Implementing state-of-the-art algorithm is indeed very difficult (I was reading BLAS code). However, since |
Just for comparison, I implemented a QR-based SVD decomposition and it was 50% faster than the current |
Eigen uses the complete ortogonal decomposition to compute pseudo-inverse, which seems a little better than computing SVD for this application. However, it requires pivoted QR, which StaticArrays also uses the algorithm in Now, let's see how difficult it is to implement pivoted QR here. |
On second thoughts, although having a Julia version of most BLAS functions would be awesome, it is really an Herculean task. However, since Julia has such a good interface with BLAS ( Since some version of Julia, the compiler became smart enough to avoid all allocations using function blas_svd(A::StaticMatrix{N, M, T}) where {N, M, T}
K = min(N, M)
# Convert the input to a `MMatrix` and allocate the
Am = MMatrix{N, M, T}(A)
Um = MMatrix{N, K, T}(undef)
Sm = MVector{K, T}(undef)
Vtm = MMatrix{M, K, T}(undef)
work = MVector{1_000, Float64}(undef)
lwork = -1
iwork = MVector{8K, Int64}(undef)
info = Ref(1)
# Call the function to obtain the optimum size of the work array.
ccall(
(BLAS.@blasfunc(dgesvd_), BLAS.libblas),
Cvoid,
(
Ref{UInt8},
Ref{UInt8},
Ref{BLAS.BlasInt},
Ref{BLAS.BlasInt},
Ptr{T},
Ref{BLAS.BlasInt},
Ptr{T},
Ptr{T},
Ref{BLAS.BlasInt},
Ptr{T},
Ref{BLAS.BlasInt},
Ptr{T},
Ref{BLAS.BlasInt},
Ref{BLAS.BlasInt},
),
'S',
'S',
N,
M,
Am,
N,
Sm,
Um,
N,
Vtm,
M,
work,
lwork,
info
)
# Obtain the optimal work array size and compute the SVD.
lwork = round(BLAS.BlasInt, work[1])
ccall(
(BLAS.@blasfunc(dgesvd_), BLAS.libblas),
Cvoid,
(
Ref{UInt8},
Ref{UInt8},
Ref{BLAS.BlasInt},
Ref{BLAS.BlasInt},
Ptr{T},
Ref{BLAS.BlasInt},
Ptr{T},
Ptr{T},
Ref{BLAS.BlasInt},
Ptr{T},
Ref{BLAS.BlasInt},
Ptr{T},
Ref{BLAS.BlasInt},
Ref{BLAS.BlasInt},
),
'S',
'S',
N,
M,
Am,
N,
Sm,
Um,
N,
Vtm,
M,
work,
lwork,
info
)
# Convert the matrices to static arrays and return.
U = SMatrix{N, K, T}(Um)
S = SVector{K, T}(Sm)
Vt = SMatrix{M, K, T}(Vtm)
return U, S, Vt'
end The result is the same as using julia> A = @SMatrix randn(6, 6)
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
-0.981944 -1.12979 0.310915 0.179723 -0.667651 0.182524
1.09949 0.522666 -0.0878819 -0.0383786 -1.17082 -0.219113
-1.07534 0.961367 -1.12684 -0.23116 -0.0602622 -1.38193
0.0982899 -0.678585 -2.14742 0.164939 -1.12818 1.33526
-0.60989 1.2469 -0.238621 -0.237353 0.695344 -0.514536
-1.39276 0.353875 -0.551829 -0.283986 0.207744 -1.20008
julia> U1, S1, V1 = svd(A);
julia> U2, S2, V2 = blas_svd(A);
julia> U1
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
0.127537 0.115343 0.695209 0.470797 0.492761 -0.150506
0.155633 0.0389598 -0.624687 0.696828 0.231023 0.212336
-0.542549 0.463331 -0.138643 0.219714 -0.173881 -0.627084
0.479565 0.825676 -0.0673256 -0.264451 0.0609473 0.100476
-0.432199 0.0425817 -0.20474 -0.379913 0.784298 0.100079
-0.498372 0.294871 0.246531 0.173895 -0.233973 0.720358
julia> U2
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
0.127537 0.115343 0.695209 0.470797 0.492761 -0.150506
0.155633 0.0389598 -0.624687 0.696828 0.231023 0.212336
-0.542549 0.463331 -0.138643 0.219714 -0.173881 -0.627084
0.479565 0.825676 -0.0673256 -0.264451 0.0609473 0.100476
-0.432199 0.0425817 -0.20474 -0.379913 0.784298 0.100079
-0.498372 0.294871 0.246531 0.173895 -0.233973 0.720358
julia> V1
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
0.47655 -0.331218 -0.695862 0.0226213 -0.422302 0.0109475
-0.47389 -0.0241441 -0.658351 -0.137571 0.559031 -0.0998872
-0.00418107 -0.872921 0.233316 0.290965 0.312066 0.038845
0.135771 -0.0163176 0.071462 0.00309199 0.0228163 -0.987756
-0.344044 -0.355242 0.125355 -0.780332 -0.359093 -0.0430901
0.641491 0.0354597 0.0842224 -0.535699 0.531311 0.104279
julia> V2
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
0.47655 -0.331218 -0.695862 0.0226213 -0.422302 0.0109475
-0.47389 -0.0241441 -0.658351 -0.137571 0.559031 -0.0998872
-0.00418107 -0.872921 0.233316 0.290965 0.312066 0.038845
0.135771 -0.0163176 0.071462 0.00309199 0.0228163 -0.987756
-0.344044 -0.355242 0.125355 -0.780332 -0.359093 -0.0430901
0.641491 0.0354597 0.0842224 -0.535699 0.531311 0.104279
julia> S1
6-element SVector{6, Float64} with indices SOneTo(6):
3.429126570487767
2.790190909977814
2.077297239613754
1.3752132074075802
0.4483791256059501
0.10325942415912782
julia> S2
6-element SVector{6, Float64} with indices SOneTo(6):
3.429126570487767
2.790190909977814
2.077297239613754
1.3752132074075802
0.4483791256059501
0.10325942415912782 Now, let's analyze the allocation and execution time for different matrix sizes between the two approaches:
Notice that we have a noticeable gain with small matrices without any allocation (my use case). @mateuszbaran If I can improve the algorithm to work for all matrix sizes, would this approach be useful for StaticArrays.jl? For my project, I will implement this type of function to compute the |
Hmm, that's a significant amount of additional code with relatively limited gains. I'd prefer to have a second opinion if it's worth it. Directly calling BLAS would have to be done a bit more carefully than your
Querying doesn't make sense if you don't allocate the requested amount of memory on the heap. You have a fixed size there (1000 elements) and that what you have to provide. I'd suggest, though, a different approach: using a formula from LAPACK docs (MAX(3MIN(M,N)+MAX(M,N),5MIN(M,N)) or something larger), see here: https://docs.oracle.com/cd/E19422-01/819-3691/dgesvd.html . I'd also suggest using |
Hi @mateuszbaran! Thanks for the tips.
I see. Maybe I will create a package called StaticArrayBlasInterface to add those algorithms (important to my use) and after we can decide if we migrate them to here. |
Hi @mateuszbaran! I finished the initial code of the package (still needs polishing and error verification): https://github.com/ronisbr/StaticArraysBlasInterfaces.jl It is working perfectly for I also opened a thread in Discourse to discuss more about the implementation: Thanks! |
Hi! |
I think it is fine to include it here. The BLAS interface file in LinearAlgebra have basically not changed since the start and I expect the same for this new code so shouldn't require any maintenance. It is a strict improvement and StaticArrays already depend on LinearAlgebra too. |
OK, then let's put it in StaticArrays.jl. |
Would be good to at least measure the impact on |
Perfect! I will submit the PR and perform all the analysis. |
Done @mateuszbaran and @fredrikekre ! I submitted the PR for review. If it got merged, I can also implement the same approach for other functions that allocates due to call functions in Julia Base (like |
Thank you very much @mateuszbaran and @fredrikekre for the help adding this feature to StaticArrays.jl :) |
Thank you for your contribution 🙂 |
Hi!
When computing the SVD of a static matrix, we see some allocations:
which is probably caused because the algorithm just converts the input to
Matrix
and call thesvd
inBase
.The problem here is that
svd
is used to computepinv
that is somewhat very used when inverting matrices for embedded Kalman filters.I am pursing to embed a full satellite attitude control algorithm written entirely in Julia. However, I must avoid all allocations. In this specific case, I have not yet found a workaround to avoid allocations. Can anyone help me?
The text was updated successfully, but these errors were encountered: