-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid underflow and overflow in norm() #975
Conversation
src/linalg.jl
Outdated
zero_a = _init_zero(a) | ||
aₘ = maxabs_nested(a) | ||
if iszero(aₘ) | ||
return zero_a | ||
else | ||
@inbounds return aₘ * sqrt($expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always takes the max, and then for nonzero norm always computes with the division. How does this compare to the speed of the existing code?
The alternative strategy would be just to compute the norm, and only if that is zero or infinite, worry about scaling. More work in the worst case but hopefully most cases in real use will be neither huge nor tiny.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always taking the max and dividing by it is what is done in LinearAlgebra
to avoid overflow and underflow; see JuliaLang/julia#1861 (reference).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the PR I linked wants to change that.
I timed things:
julia> @btime norm($(@SVector rand(10))); # tagged version
1.833 ns (0 allocations: 0 bytes)
julia> @btime sqrt(sum(abs2, $(@SVector rand(10))));
1.833 ns (0 allocations: 0 bytes)
julia> @btime norm_gen($(@SVector rand(10))); # this PR
23.971 ns (0 allocations: 0 bytes) # (fixed, maybe)
julia> @btime mapreduce(abs, max, $(@SVector rand(10)));
13.652 ns (0 allocations: 0 bytes)
julia> @btime @fastmath mapreduce(abs, max, $(@SVector rand(10)));
2.708 ns (0 allocations: 0 bytes)
Here @fastmath
will get Inf
, NaN
and -0.0
wrong, but if it can be safely used, helps a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current Julia master, the relevant code is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. And is quite slow, which 43256 might fix.
julia> @btime LinearAlgebra.generic_norm2($(randn(10)));
20.269 ns (0 allocations: 0 bytes)
julia> @btime sqrt(sum(abs2, $(randn(10))));
5.208 ns (0 allocations: 0 bytes)
# longer
julia> @btime LinearAlgebra.generic_norm2($(randn(10^4)));
15.750 μs (0 allocations: 0 bytes)
julia> @btime sqrt(sum(abs2, $(randn(10^4))));
2.083 μs (0 allocations: 0 bytes)
julia> @btime norm($(randn(10^4))); # BLAS
12.416 μs (0 allocations: 0 bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I understand your concern. It turns out that norm(::StaticArray)
is slower than norm(::AbstractArray)
for the same array size:
julia> v = rand(10); @btime norm($v);
20.800 ns (0 allocations: 0 bytes)
julia> sv = @SVector rand(10); @btime norm($sv);
28.727 ns (0 allocations: 0 bytes)
I will experiment with your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't know that the PR you Xref'ed was julia
's; I thought it was StaticArrays
's PR. Now what you wrote makes much more sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed a change that performs scaling only when the norm is 0
or Inf
. Now the benchmark is much better:
julia> v = rand(10); @btime norm($v);
21.335 ns (0 allocations: 0 bytes)
julia> sv = @SVector rand(10); @btime norm($sv);
5.730 ns (0 allocations: 0 bytes)
src/linalg.jl
Outdated
return m | ||
end | ||
|
||
@generated function maxabs_nested(a::StaticArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see speedups from this generated function? Or perhaps it serves another purpose?
julia> for N in [3, 10, 30]
@show N
x = @SVector randn(N)
@btime maximum(abs, $x)
@btime maxabs_nested($x) # PR, without generated path
@btime maxabs_nested_g($x) # with
end
N = 3
2.458 ns (0 allocations: 0 bytes)
2.417 ns (0 allocations: 0 bytes)
2.416 ns (0 allocations: 0 bytes)
N = 10
13.652 ns (0 allocations: 0 bytes)
13.096 ns (0 allocations: 0 bytes)
13.652 ns (0 allocations: 0 bytes)
N = 30
77.830 ns (0 allocations: 0 bytes)
73.826 ns (0 allocations: 0 bytes)
77.786 ns (0 allocations: 0 bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't perform performance comparison on this; I merely thought that @generated
version would be always better for StaticArrays
. Now that I think of it, @generated
version of maxabs_nested()
wouldn't be necessary as the function is non-allocating even without @generated
.
If the current push passes all CI tests, I will remove the @generated
version of maxabs_nested()
. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might even be simpler just to write mapreduce(abs, max, x)
. Or normInf I think. Unless you have a compelling reason to write it out for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mapreduce(abs, max, x)
was my initial approach, but it didn't pass the unit tests for nested array x
.
normInf()
sounds great. I will use it and remove maxabs_nested()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed a change that uses normInf()
instead of maxabs_nested()
. Please let me know if you have any other suggestions!
The PR is failing tests on Julia nightly (1.8). julia> @code_warntype LinearAlgebra.generic_normInf(a)
MethodInstance for LinearAlgebra.generic_normInf(::SVector{2, SVector{2, Int64}})
from generic_normInf(x) in LinearAlgebra at /Users/wsshin/packages/julia/julia-master/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/generic.jl:454
Arguments
#self#::Core.Const(LinearAlgebra.generic_normInf)
x::SVector{2, SVector{2, Int64}}
Body::Any
1 ─ %1 = LinearAlgebra.mapreduce(LinearAlgebra.norm, LinearAlgebra.max, x)::Any
│ %2 = LinearAlgebra.float(%1)::Any
└── return %2 I guess this problem will be eventually fixed in the official Julia 1.8 release, but if this is a concern, maybe we should roll back to using @mcabbott, please advise. |
Re normInf, I just wondered how much code could be shared. Not sure that However, it might be possible to do better. |
As I wrote earlier, To solve this problem, I think you are suggesting to use julia> VERSION
v"1.8.0-DEV.1202"
julia> a = SA[SA[1,2], SA[3,4]]; @inferred mapreduce(norm, max, a)
ERROR: return type Float64 does not match inferred return type Any
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] top-level scope
@ REPL[11]:1 I tried This is why I think defining |
I don't see this inference failure, BTW:
|
I think the result is different because the I've just pushed the version I am currently using, so please verify if you get the instability. By the way, the behavior is very strange: the instability depends on the order of executing _ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.8.0-DEV.1202 (2022-01-02)
_/ |\__'_|_|_|\__'_| | Commit fa63a00672 (0 days old master)
|__/ |
julia> using StaticArrays, LinearAlgebra, Test
[ Info: Precompiling StaticArrays [90137ffa-7385-5640-81b9-e52037218182]
julia> a = SA[SA[1,2], SA[3,4]]
2-element SVector{2, SVector{2, Int64}} with indices SOneTo(2):
[1, 2]
[3, 4]
julia> @inferred norm(a) # error
ERROR: return type Float64 does not match inferred return type Any
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] top-level scope
@ REPL[3]:1
julia> @inferred mapreduce(norm, @fastmath(max), a) # error
ERROR: return type Float64 does not match inferred return type Any
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] top-level scope
@ REPL[4]:1 On the other hand, restart Julia again, and execute _ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.8.0-DEV.1202 (2022-01-02)
_/ |\__'_|_|_|\__'_| | Commit fa63a00672 (0 days old master)
|__/ |
julia> using StaticArrays, LinearAlgebra, Test
julia> a = SA[SA[1,2], SA[3,4]]
2-element SVector{2, SVector{2, Int64}} with indices SOneTo(2):
[1, 2]
[3, 4]
julia> @inferred mapreduce(norm, @fastmath(max), a) # no error
5.0
julia> @inferred norm(a) # no error
5.477225575051661 I don't understand this behavior. Any insights? |
@mcabbott, if you don't have any other solution to this instability, I would like to re-introduce |
Haven't had a chance to look; generated functions are always a pain but load order dep. is pretty surprising. But the type instability involves |
I just wanted to echo the point that was also made in #913 (comment) by @andyferris, namely that:
I.e., personally, I think it is more important that |
I think the timing suggested by that comment is like this: julia> fastnorm(x) = sqrt(sum(abs2, x));
julia> safenorm(x, y=fastnorm(x)) = iszero(y) ? zero(maximum(abs, x)) : y;
# not correct, just a branch and a slow function?
julia> xs = [@SVector(rand(10)) for _ in 1:100];
julia> @btime fastnorm.($xs);
210.360 ns (1 allocation: 896 bytes)
julia> @btime safenorm.($xs);
236.366 ns (1 allocation: 896 bytes)
julia> xs = [@SVector(rand(3)) for _ in 1:1000]; ys = rand(1000);
julia> @btime $ys .= fastnorm.($xs);
396.348 ns (0 allocations: 0 bytes)
julia> @btime $ys .= safenorm.($xs);
472.367 ns (0 allocations: 0 bytes)
julia> @btime $ys .= maximum.(abs, $xs);
1.333 μs (0 allocations: 0 bytes) So in the best case, all numbers order 1, the cost of adding a branch to check for underflow can be 20%. Actually taking the maximum (which is what LinearAlgebra currently does, always) is much slower. |
@mcabbott, thanks for the benchmarks. Is an extra cost of 20% for branching too much, or it is something we can tolerate? |
That's above my pay grade, I just thought this question ought to have actual numbers attached. Not certain these are the right ones. Would be good to check if they differ on other computers, too:
|
The extra 20% of computation time could be significant to some users, so we should probably keep the current However, there are applications where more accurate norm is desirable in spite of the extra computation time (e.g., automatic differentiation of a function involving I wonder if we could define some macro for this purpose. I am not familiar with metaprogramming, but is it easy to define a macro named, for example, |
20% could be a reasonable price to pay though, at least in my view: my original comment was based mainly off some of the earlier benchmarking from this thread. |
Note that I don't think the AD issue is underflow, it's just that the slow path designed to handle underflow happens to produce different results with ForwardDiff, as I think I tried to say here: JuliaDiff/ForwardDiff.jl#547 (comment). That may change; I guess you could test |
I think it would be better to just have a single I think the crucial question is how much a good implementation of an under/overflow checked |
I am not able to find an easy way to avoid type instability mentioned in #975 (comment), so I am introducing Here is the benchmark result: julia> x = @SVector rand(3)
3-element SVector{3, Float64} with indices SOneTo(3):
0.2000072904640665
0.1841830247591485
0.029486864461234608
julia> @btime norm($x) # with present PR
2.851 ns (0 allocations: 0 bytes)
0.27348816797799824
julia> @btime norm($x) # with master
2.328 ns (0 allocations: 0 bytes)
0.27348816797799824
julia> x = @SVector rand(10);
julia> @btime norm($x) # with present PR
4.874 ns (0 allocations: 0 bytes)
2.337172993625254
julia> @btime norm($x) # with master
4.347 ns (0 allocations: 0 bytes)
2.337172993625254 The present PR's |
Can this PR be approved? The only failed tests are for Julia nightly on ubuntu-latest - x86, but they don't seem caused by the changes made in the present PR. |
Bump. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Could you bump the version to v1.5.13?
I will merge this PR in a few days if there are no objections. |
Well, this is really bad for performance. I guess I have to copy the old code of norm into my project to avoid this performance penalty. |
Maybe you could include some benchmark comparisons here if you think it ought to be rolled back? |
See #1132 You can just check out https://github.com/ufechner7/KiteModels.jl and run Pkg.test(), then add StaticArrays#master and run the tests again... I don't say that it should be rolled back, I would only say if you introduce such a breaking change there should be a documented way to achieve the old behavior again without having to install an old version of the package. |
This is not a breaking change in any semver sense though. Additionally, I don't think 12-15% drops for methods that should only very rarely be compute-bottlenecks - like The PR above has sat unmerged for ~1 year without anyone raising serious objections to that level of performance regression. As you can see from the comments, consideration was given to allowing both the old and new version to live side by side, e.g., via |
Well, in the past I told people, if you want high performance, use StaticArrays.jl . Now I have to tell them, if you want high performance, write your own linear algebra library... Not nice. |
@ufechner7, could you point out which unit tests of your KiteModels regress a lot due to this merge? That will help us to have a more productive discussion. |
Well, you can use a very simple benchmark: using StaticArrays, LinearAlgebra, BenchmarkTools
const KVec3 = MVector{3, Float64}
norm2(vec3) = sqrt(vec3[1]^2 + vec3[2]^2 + vec3[3]^2)
@benchmark norm2(vec3) setup=(vec3 = KVec3(10*rand(3)))
# 2.668 ns ± 0.573 ns StaticArrays 1.5.2
# 3.020 ns ± 0.579 ns StaticArrays#master Or you run https://github.com/ufechner7/KiteModels.jl/blob/main/test/bench3.jl or https://github.com/ufechner7/KiteModels.jl/blob/main/test/bench4.jl , but they are pretty large... |
As far as I know, StaticArrays.jl has never been "best performance possible". More like a middle ground between execution speed, accuracy and ease of use, skewed towards execution speed. https://github.com/JuliaSIMD/LoopVectorization.jl can often produce faster code even for small arrays. Different people prefer different trade-offs in this space. By the way, I think we could have a |
@ufechner7, If you would like to recover the performance of the old behavior while sacrificing accuracy, I think one way is to use julia> v = @SVector rand(3);
julia> @btime norm($v);
4.299 ns (0 allocations: 0 bytes)
julia> @btime √sum(abs2, $v);
4.111 ns (0 allocations: 0 bytes) Hope this is a reasonable solution for you. But as others pointed out, I don't think the new |
This PR fixes #913 by avoiding underflow in
norm()
. In the referenced issue,norm(SA[0, 1e-180])
generates0.0
instead of the accurate value1e-180
, because squaring1e-180
insidenorm()
produces1e-360
, whose exponent is less than the smallest exponent allowed in the IEEE double-precision floating point standard:2^-1022 ≈ 1e-308
.The standard practice to avoid this underflow is to pre-normalize the vector by its largest-magnitude entry, calculate the norm, and then undo the normalization outside
norm()
:This PR implements the above procedure for
StaticArrays
. This PR also fixes JuliaDiff/ForwardDiff.jl#547.