Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rounding errors in Normed to Float conversions #129

Closed
kimikage opened this issue Oct 12, 2019 · 26 comments
Closed

Rounding errors in Normed to Float conversions #129

kimikage opened this issue Oct 12, 2019 · 26 comments

Comments

@kimikage
Copy link
Collaborator

While I was reviewing the PR #123, I found a minor bug.

julia> float(3N3f5) # this is ok
3.0f0

julia> float(3N4f12)
3.0000002f0

julia> float(3N8f8)
3.0000002f0

julia> float(3N10f6)
3.0000002f0

julia> float(3N12f4)
3.0000002f0

julia> float(3N13f3)
3.0000002f0

julia> float(3N2f6)
3.0000002f0

julia> float(3N4f4)
3.0000002f0

julia> float(3N5f3)
3.0000002f0

I think this is a problem with the division optimization.

function (::Type{T})(x::Normed) where {T <: AbstractFloat}
y = reinterpret(x)*(one(rawtype(x))/convert(T, rawone(x)))
convert(T, y) # needed for types like Float16 which promote arithmetic to Float32
end

The rounding errors can occur everywhere, but especially the errors at integer values are critical.

julia> isinteger(float(3N5f3))
false
@kimikage
Copy link
Collaborator Author

kimikage commented Oct 13, 2019

I was underestimating this issue. It is not so easy to fix this without performance regression.

WIP:

function (::Type{T})(x::Normed) where {T <: AbstractFloat}
    # The following optimization for constant division may cause rounding errors.
    # y = reinterpret(x)*(one(rawtype(x))/convert(T, rawone(x)))
    # Therefore, we use a simple form here.
    y = reinterpret(x) / convert(promote_type(T, floattype(x)), rawone(x))
    convert(T, y)  # needed for types like Float16 which promote arithmetic to Float32
end
function (::Type{T})(x::Normed{Ti,f}) where {T <: Union{Float32, Float16}, Ti <: Union{UInt8, UInt16}, f}
    f == 1 && return convert(T, reinterpret(x))
    r = Int64(reinterpret(x))
    s = UInt64(1)<<48
    k = Int64(s ÷ rawone(Normed{Ti,f})) # since `rawone()>=3>2`, `r*k` is always less than 2^63.
    y = Float32(r * k) * (1.0f0 / s)
    convert(T, y)
end

The length of the mantissa part of Float32 is 23 (+1) bits, so Int64 for the intermediate variable may seem excessive. However, I think the default rounding mode (round half to even) may require some bitwise hacks, and as a result, some conditional branches may be generated in the native code.

julia> Float32(0x01010101) == 0x1.010100p24 # I would like to round half up (`0x1.010102p24`), though.
true

Fortunately, all x86_64 CPUs have CVTSI2SS xmm, r64, which converts one Int64 value to one Float32 value and is not so slow. Most modern x86_64 CPUs have the AVX version VCVTSI2SS. FYI, VCVTUSI2SS xmm, xmm, r64, which converts one UInt64 value, is in AVX-512F instruction set.
However, with the other CPUs, I cannot imagine what the performance will be. In particular, the methods will be inlined, so I think the performance will depend greatly on the SIMD optimization.

Of course, only for Normed{UInt8}, the lookup tables may be nice.

@kimikage kimikage changed the title Rounding errors in Normed to Float32 conversions Rounding errors in Normed to Float conversions Oct 14, 2019
@kimikage
Copy link
Collaborator Author

kimikage commented Oct 14, 2019

Although the conversions from integer Normed values to Float64 are OK, in some fractional cases, the rounding errors can be noticeable.

julia> Float64(7.4N4f4)
7.3999999999999995

julia> Float64(reinterpret(7.4N4f4) / 15)
7.4

The results of the following method agree with the ideal values, but I think the following is not so good. (see Edit2)

function Base.Float64(x::Normed{Ti,f}) where {Ti <: Union{UInt8, UInt16}, f}
    f == 1 && return Float64(reinterpret(x))
    r = Int64(reinterpret(x))
    s1 = UInt64(1)<<((53 - 16 + f - 1)÷f * f)
    s2 = UInt64(1)<<(((53 - 16)÷f + 1) * f)
    k = Int64(s1 ÷ rawone(Normed{Ti,f}))
    f64 = Float64(r * k)
    f64 * (1.0 / s1) + f64 * (1.0 / s1 / s2)
end

Edit:
Specialization for Normed{UInt8} -> Float64. 🎃

function Base.Float64(x::Normed{UInt8,f}) where f
    f == 1 && return Float64(reinterpret(x))
    r = Int64(reinterpret(x))
    s = UInt64(1)<<(64 - 8 - 2 + f)
    k = Int64(s ÷ rawone(Normed{UInt8,f}))
    i64 = r * k
    i32 = unsafe_trunc(UInt32, i64) >> (f*2)
    Float64((i64>>12<<12) | i32) * (1.0 / s)
end

Edit2:
I don't know why but I misunderstood. The FMA part in the original method can be replaced with a single multiply. 👻

function Base.Float64(x::Normed{Ti,f}) where {Ti <: Union{UInt8, UInt16}, f}
    f == 1 && return Float64(reinterpret(x))
    r = Int64(reinterpret(x))
    s = UInt64(1)<<(((53 - 16)÷f + 1) * f)
    k = Int64(s ÷ rawone(Normed{Ti,f}))
    Float64(r * k) * (1.0 / s + 1.0 / s / s)
end

In the case of Normed{UInt32,f} -> Float64 where f <= 8, the similar technique (with 1.0/s + 1.0/s/s + 1.0/s/s/s) may be available. (not tested)

@kimikage
Copy link
Collaborator Author

kimikage commented Oct 16, 2019

In the case of Normed{UInt8,f} -> Float32 where f < 8 (i.e. except in the case of Normed{UInt16}), the length of the intermediate variable can be reduced to 32 bits as follows:

function Base.Float32(x::Normed{UInt8,2})
    r = Int32(reinterpret(x))
    s = 0b0100_0000_0000_0000 
    k = Int32(s ÷ rawone(Normed{UInt8,2}))
    Float32(r * k) * (1.0f0 / s + 1.0f0 / s / s)
end

However, on the x86_64 CPU with AVX/AVX2 extensions, the Int32 version is slightly slower. In other words, although the modified version have an additional integer multiply, the Int64 version is slightly faster than the current low-accuracy version, which is also internally uses Int32.

I do not understand what is happening (what causes the stall). My knowledge of the SIMD optimization almost ends at Pentium III.:stuck_out_tongue_closed_eyes: The trouble is that the explicit conversion Int64(reinterpret(x)) is removed away by the optimizer, and it results in a stall. I think this is a primitive (or compiler-related) issue.

Edit:
It appears that the cause is the benchmark script. Actually, their net latency values are almost the same and, as mentioned above, strongly depending on the optimization. Especially simple loops may be optimized drastically, so it is difficult to judge superiority or inferiority based on the simple @benchmark results. On the other hand, the suitability for optimization is also an important indicator of code qualities. So, I will run more practical benchmarks.

Edit2:
ugly...

function (::Type{T})(x::Normed{UInt8,f}) where {T <: Union{Float32, Float16}, f}
    f == 1 && return T(reinterpret(x))
    if f == 8
        k = Int32(0xF0F1) # 0xF0F1 * 0xFF < 2^24
        return T(Float32(Int32(reinterpret(x)) * k) * Float32(0x1.1111p-24))
    end
    s = UInt32(1)<<((f * 3÷2 + 11)÷f * f)
    k = Int32(s ÷ rawone(Normed{UInt8,f}))
    T(Float32(Int32(reinterpret(x)) * k) * (1.0f0 / s + 1.0f0 / s / s))
end

@kimikage
Copy link
Collaborator Author

kimikage commented Oct 19, 2019

Implementation status

FYI, the results on Julia v1.3.0-rc4 seem to be the same as on Julia v1.2.0.

Julia v1.2.0 x86_64

from \ to Float16 Float32 Float64 BigFloat
Normed{UInt8} ✔️ ✔️ ✔️
Normed{UInt16} ✔️ ✔️ ✔️
Normed{UInt32} ✔️ ✔️ ✔️
Normed{UInt64} ✔️ ✔️ ✔️
Normed{UInt128} ✔️ ✔️ ✔️

Edit:
I would never use Normed{UInt128}s for purposes other than the software testing, but using BigFloat is a foul play in my game rules. 😄 Although I will also rarely use Normed{UInt64}s, they can have more significand bits than Float64 and they might be useful as extended precision formats.

Tolerance

The rounding mode is assumed to be RoundNearest. If the global rounding mode is changed, the conversion methods may return incorrect (inaccurate) values.

from \ to Float16 Float32 Float64 BigFloat
Normed{UInt8} 0 0 0 N/A
Normed{UInt16} 0 0 0 N/A
Normed{UInt32} ≈0 1 lsb 1 lsb N/A
Normed{UInt64} ≈0 ≈0 1 lsb N/A
Normed{UInt128} ≈0 ≈0 1 lsb N/A

"1 lsb" means that prevfloat(expected) <= return_value <= nextfloat(expected).
"≈0" means that the tolerance should be zero, but may be "1 lsb" for some particular bit sequences.

@kimikage
Copy link
Collaborator Author

kimikage commented Oct 19, 2019

Do you have any ideas to simplify the following method, and to calculate it fast and perfectly, where f in (17, 24, 25, 26, 29, 31)?

function Base.Float64(x::Normed{UInt32,f}) where f
    f == 1 && return Float64(reinterpret(x))
    r = Int64(reinterpret(x))
    f ==  2 && return Float64(r * 0x040001) * 0x15555000015555p-72
    f ==  3 && return Float64(r * 0x108421) * 0x11b6db76924929p-75
    f ==  4 && return Float64(r * 0x010101) * 0x11000011000011p-72
    f ==  5 && return Float64(r * 0x108421) * 0x04000002000001p-75
    f ==  6 && return Float64(r * 0x09dfb1) * 0x1a56b8e38e6d91p-78
    f ==  7 && return Float64(r * 0x000899) * 0x0f01480001e029p-70
    f ==  8 && return Float64(r * 0x0a5a5b) * 0x18d300000018d3p-80
    f ==  9 && return Float64(r * 0x001001) * 0x080381c8e3f201p-72
    f == 10 && return Float64(r * 0x100001) * 0x04010000000401p-80
    f == 11 && return Float64(r * 0x000009) * 0x0e3aaae3955639p-66
    f == 12 && return Float64(r * 0x0a8055) * 0x186246e46e4cfdp-84
    f == 13 && return Float64(r * 0x002001) * 0x10000004000001p-78
    f == 14 && return Float64(r * 0x03400d) * 0x13b13b14ec4ec5p-84
    f == 15 && return Float64(r * 0x000259) * 0x06d0c5a4f3a5e9p-75
    f == 16 && return Float64(r * 0x011111) * 0x00f000ff00fff1p-80
    f == 18 && return Float64(r * 0x0b06d1) * 0x17377445dd1231p-90
    f == 19 && return Float64(r * 0x080001) * 0x00004000000001p-76
    f == 20 && return Float64(r * 0x000101) * 0x0ff010ef10ff01p-80
    f == 21 && return Float64(r * 0x004001) * 0x01fff8101fc001p-84
    f == 22 && return Float64(r * 0x002945) * 0x18d0000000018dp-88
    f == 23 && return Float64(r * 0x044819) * 0x07794a23729429p-92
    f == 27 && return Float64(r * 0x000a21) * 0x0006518c7df9e1p-81
    f == 28 && return Float64(r * 0x00000d) * 0x13b13b14ec4ec5p-84
    f == 30 && return Float64(r * 0x001041) * 0x00fc003f03ffc1p-90
    f == 32 && return Float64(r * 0x010101) * 0x00ff0000ffff01p-96

    f == 17 && return Float64(r) * 0x8000400020001p-68 # imperfect
    f == 24 && return Float64(r) * 0x1000001000001p-72 # imperfect
    f == 25 && return Float64(r) * 0x4000002000001p-75 # imperfect
    f == 26 && return Float64(r) * 0x10000004000001p-78 # imperfect
    f == 29 && return Float64(r) * 0x20000001p-58 # imperfect
    f == 31 && return Float64(r) * 0x80000001p-62 # imperfect
end

Edit:
FYI, these magic numbers come from:

using Primes
b = ones(Int128, 32)
#         _____________________________________________________--------------------
b[ 2] = 0b10101010101010101010101010101010101010101010101010101010101010101010101
b[ 3] = 0b1001001001001001001001001001001001001001001001001001001001001001001001001
b[ 4] = 0b100010001000100010001000100010001000100010001000100010001000100010001
b[ 5] = 0b10000100001000010000100001000010000100001000010000100001000010000100001
b[ 6] = 0b1000001000001000001000001000001000001000001000001000001000001000001000001
b[ 7] = 0b1000000100000010000001000000100000010000001000000100000010000001
b[ 8] = 0b1000000010000000100000001000000010000000100000001000000010000000100000001
b[ 9] = 0b1000000001000000001000000001000000001000000001000000001000000001
b[10] = 0b10000000001000000000100000000010000000001000000000100000000010000000001
b[11] = 0b10000000000100000000001000000000010000000000100000000001
b[12] = 0b1000000000001000000000001000000000001000000000001000000000001000000000001
b[13] = 0b100000000000010000000000001000000000000100000000000010000000000001
b[14] = 0b10000000000000100000000000001000000000000010000000000000100000000000001
b[15] = 0b1000000000000001000000000000001000000000000001000000000000001
b[16] = 0b10000000000000001000000000000000100000000000000010000000000000001
#         _____________________________________________________--------------------
b[17] = 0b1000000000000000010000000000000000100000000000000001 #imperfect
b[18] = 0b1000000000000000001000000000000000001000000000000000001000000000000000001
b[19] = 0b1000000000000000000100000000000000000010000000000000000001
b[20] = 0b1000000000000000000010000000000000000000100000000000000000001
b[21] = 0b1000000000000000000001000000000000000000001000000000000000000001
b[22] = 0b1000000000000000000000100000000000000000000010000000000000000000001
b[23] = 0b1000000000000000000000010000000000000000000000100000000000000000000001
b[24] = 0b1000000000000000000000001000000000000000000000001 # imperfect
b[25] = 0b100000000000000000000000010000000000000000000000001 # imperfect
b[26] = 0b10000000000000000000000000100000000000000000000000001 # imperfect
b[27] = 0b1000000000000000000000000001000000000000000000000000001
b[28] = 0b100000000000000000000000000010000000000000000000000000001
b[29] = 0b100000000000000000000000000001 # imperfect
b[30] = 0b1000000000000000000000000000001000000000000000000000000000001
b[31] = 0b10000000000000000000000000000001 # imperfect
b[32] = 0b10000000000000000000000000000000100000000000000000000000000000001
factors = [factor(Vector,b[i]) for i=1:32];

@show f = factors[31] # select
k1 = []
k2 = []
for i=1:2^length(f)
    ka = UInt64(1)
    kb = UInt64(1)
    for k = 0:length(f)-1
       if (i >> k) & 1 != 0
            ka *= f[k+1]
        else
            kb *= f[k+1]
        end
    end
    ka * 0xFFFFFFFF >= 2^53 && continue
    kb >= 2^53 && continue
    if !(ka in k1)
        push!(k1, ka)
        push!(k2, kb)
    end
end
sort!(k1)
reverse!(sort!(k2))
for i =1:length(k1)
    bc = length(replace(bitstring(k1[i]),"0"=>""))
    #bc > 3 && continue
    println(bc, ", ", string(k1[i], base=16),", ", string(k2[i], base=16))
end

Edit2:
My original plan was to use the Int64 variables because the MSB of UInt32 may be 1 and (V)CVTDQ2PS may convert the values to negative floats. However, as commented below, using Int64 seems to make the compiler abandon the SIMD vectorization. So, I changed the codes into compiler-friendly (dull) form:

Base.Float64(x::Normed{UInt32, 2}) = (Float64(reinterpret(x)) * 0x040001) * 0x15555000015555p-72
Base.Float64(x::Normed{UInt32, 3}) = (Float64(reinterpret(x)) * 0x108421) * 0x11b6db76924929p-75
Base.Float64(x::Normed{UInt32, 4}) = (Float64(reinterpret(x)) * 0x010101) * 0x11000011000011p-72
...

FYI, to convert UInt32 value to Float64 on x86_64, the compiler seems to generate the code which converts the upper 16-bit word and the lower 16-bit word separately and adds them.

Edit3:
FWIW, there is another way to convert UInt32 value to Float64.

function u32_to_f64(x::UInt32)
    i32 = unsafe_trunc(Int32, x)
    # Float64(i32) + (i32 < 0 ? 0x1p32 : 0.0)
    Float64(i32) + reinterpret(Float64, Int64(i32>>31) & reinterpret(UInt64, 0x1p32))
end

@kimikage
Copy link
Collaborator Author

kimikage commented Oct 20, 2019

Benchmark

Script

This script enables us to get a glimpse of problems of #125. The core conversion codes should be almost OS-independent, but the broadcast or memory allocation is OS-dependent. In particular, there seems to be a problem with the broadcast on Windows.

using BenchmarkTools
using FixedPointNumbers
struct Vec3{T <: Real}
    x::T; y::T; z::T
end
struct Vec4{T <: Real}
    x::T; y::T; z::T; w::T
end
Vec3{T}(v::Vec3{T}) where {T} = v
Vec3{T}(v::Vec3{U}) where {T, U} = Vec3{T}(v.x, v.y, v.z) 
Vec4{T}(v::Vec4{T}) where {T} = v
Vec4{T}(v::Vec4{U}) where {T, U} = Vec4{T}(v.x, v.y, v.z, v.w) 
Base.rand(::Type{Vec3{T}}) where {T} = Vec3{T}(rand(T), rand(T), rand(T))
Base.rand(::Type{Vec4{T}}) where {T} = Vec4{T}(rand(T), rand(T), rand(T), rand(T))
function Base.rand(::Type{T}, sz::Dims) where {T <: Union{Vec3, Vec4}}
    A = Array{T}(undef, sz)
    for i in eachindex(A); A[i] = rand(T); end
    return A
end
const N0f128 = Normed{UInt128,128}
const N125f3 = Normed{UInt128,3}
Ts = (N0f8, N5f3, N0f16, N13f3, N0f32, N8f24, N29f3, N0f64, N61f3, N0f128, N125f3)
# The impact of the difference in random numbers should be less than that of the CPU throttling or cache.
vec1s = [rand(T,  64 * 64 * 4) for T in Ts];
mat3s = [rand(Vec3{T}, 64, 64) for T in Ts];
mat4s = [rand(Vec4{T}, 64, 64) for T in Ts];
# vec1s = [[rand(T) for i = 1:64 * 64 * 4] for T in Ts]; # workaround for #125

for vec in vec1s
    println(eltype(vec), "-> Float32")
    @btime Float32.(view($vec,:))
end

for vec in vec1s
    println(eltype(vec), "-> Float64")
    @btime Float64.(view($vec,:))
end

for mat in mat3s
    println(eltype(mat), "-> Float32")
    @btime Vec3{Float32}.(view($mat,:,:))
end

for mat in mat3s
    println(eltype(mat), "-> Float64")
    @btime Vec3{Float64}.(view($mat,:,:))
end

for mat in mat4s
    println(eltype(mat), "-> Float32")
    @btime Vec4{Float32}.(view($mat,:,:))
end

for mat in mat4s
    println(eltype(mat), "-> Float64")
    @btime Vec4{Float64}.(view($mat,:,:))
end

The values in the tables below ​​follow the display of @btime, but the number of significant digits is actually about 2.

Julia v1.2.0 x86_64-w64-mingw32

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

Vector (unit: μs)

w64 Float32

master
Float32
high
accuracy
Float32

simple div
Float64

master
Float64
high
accuracy
Float64

simple div
N0f8 40.599 37.900 39.500 40.200 37.599 40.199
N5f3 39.299 40.599 38.000 41.200 39.301 38.000
N0f16 117.300 132.499 144.900 117.399 132.099 154.099
N13f3 117.500 132.499 144.800 120.100 132.099 154.099
N0f32 131.900 150.100 183.200 128.399 143.100 165.199
N8f24 131.900 186.900 183.299 128.400 164.899 165.100
N29f3 131.899 186.400 183.400 128.499 143.199 165.100
N0f64 108.501 47.900 7.809e3 43.800 45.600 5.868e3
N61f3 104.299 45.100 8.122e3 43.301 41.800 6.158e3
N0f128 94.099 74.699 7.952e3 94.199 71.600 5.673e3
N125f3 92.701 81.499 8.844e3 94.300 75.700 5.870e3

The current Normed{UInt64}->Float32 conversions are much slower than Normed{UInt64}->Float64.
The high-accuracy Normed{UInt8}->Float64 conversions are slightly slower than the conversions with simple divisions.

Matrix of Vec3 (unit: μs)

w64 Float32

master
Float32
high
accuracy
Float32

simple div
Float64

master
Float64
high
accuracy
Float64

simple div
N0f8 2.812 3.237 3.914 4.920 5.500 8.367
N5f3 2.812 3.288 3.943 4.880 5.400 8.099
N0f16 5.033 3.657 4.243 5.120 5.640 8.300
N13f3 4.983 3.614 4.286 5.040 5.450 8.099
N0f32 6.180 6.300 11.500 6.125 6.825 8.199
N8f24 6.200 7.867 11.200 5.975 8.100 8.501
N29f3 6.160 7.833 11.300 6.100 6.967 8.333
N0f64 30.600 17.399 5.868e3 9.301 17.299 5.950e3
N61f3 25.799 15.900 6.158e3 9.600 8.200 5.971e3
N0f128 54.999 39.899 5.673e3 56.499 38.200 5.963e3
N125f3 55.999 40.099 5.870e3 53.800 38.300 5.870e3

The current Normed{UInt16}->Float32 conversions are slower than the conversions with simple divisions.

Matrix of Vec4 (unit: μs)

w64 Float32

master
Float32
high
accuracy
Float32

simple div
Float64

master
Float64
high
accuracy
Float64

simple div
N0f8 3.300 3.814 3.914 5.400 4.499 8.299
N5f3 3.300 3.786 4.186 4.501 5.400 8.299
N0f16 3.286 4.000 4.016 4.799 5.100 8.300
N13f3 3.372 3.800 4.028 4.400 4.800 8.200
N0f32 4.243 4.583 8.100 5.200 5.599 8.500
N8f24 4.228 5.033 8.400 5.867 7.800 8.399
N29f3 4.129 4.933 8.367 6.000 6.600 8.299
N0f64 41.500 13.399 7.831e3 12.399 12.600 7.830e3
N61f3 41.599 13.200 8.289e3 12.199 11.400 8.017e3
N0f128 66.599 38.800 7.691e3 69.100 35.600 7.593e3
N125f3 66.799 44.099 8.197e3 69.700 38.500 8.092e3

The high-accuracy Normed{UInt8} and Normed{UInt16}->Float32 conversions are slower than the conversions with simple divisions.


Julia v1.2.0 x86_64-pc-linux-gnu (Debian on WSL)

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

Vector (unit: μs)

linux Float32

master
Float32
high
accuracy
Float32

simple div
Float64

master
Float64
high
accuracy
Float64

simple div
N0f8 14.100 15.800 14.600 14.500 16.000 17.600
N5f3 14.100 16.000 14.600 14.500 16.100 16.600
N0f16 14.300 16.000 14.600 15.300 16.100 17.700
N13f3 14.200 16.000 14.600 15.600 16.100 17.500
N0f32 15.800 21.100 18.000 15.400 16.200 17.400
N8f24 15.900 68.600 18.300 14.500 16.200 16.300
N29f3 16.200 69.400 18.100 15.000 16.200 16.300
N0f64 74.600 30.000 2.945e3 18.600 23.900 3.047e3
N61f3 74.800 24.000 3.045e3 18.900 18.800 3.048e3
N0f128 59.500 58.200 6.017e3 60.100 45.100 6.078e3
N125f3 59.700 53.000 6.171e3 61.300 45.200 6.191e3

In cases of Normed{UInt32,f}->Float32 where 2 <= f <= 24, the LLVM (not Julia front-end) generates a destructive conditional branch.

Matrix of Vec3 (unit: μs)

linux Float32

master
Float32
high
accuracy
Float32

simple div
Float64

master
Float64
high
accuracy
Float64

simple div
N0f8 3.175 3.675 3.914 5.800 5.660 8.667
N5f3 3.300 3.413 3.914 5.650 5.675 8.633
N0f16 5.467 3.786 4.371 5.120 6.020 8.500
N13f3 5.467 3.825 4.286 5.040 5.900 8.767
N0f32 6.740 6.420 12.500 6.125 7.075 8.267
N8f24 6.900 7.975 12.500 5.975 8.100 9.067
N29f3 6.620 7.975 12.500 6.100 7.025 9.200
N0f64 31.300 18.000 2.272e3 9.301 17.900 2.281e3
N61f3 30.500 18.200 2.384e3 9.600 8.633 2.386e3
N0f128 53.300 43.900 4.616e3 55.900 37.100 4.483e3
N125f3 53.800 39.300 4.585e3 53.800 37.300 4.702e3

Matrix of Vec4 (unit: μs)

linux Float32

master
Float32
high
accuracy
Float32

simple div
Float64

master
Float64
high
accuracy
Float64

simple div
N0f8 3.529 3.629 4.000 6.867 7.267 8.300
N5f3 3.471 3.857 3.957 6.500 7.500 8.300
N0f16 3.786 3.857 4.014 6.900 7.700 8.400
N13f3 3.614 3.843 4.171 6.267 7.133 8.400
N0f32 4.333 4.350 8.867 7.167 7.333 8.600
N8f24 4.557 4.533 7.825 7.200 8.600 8.600
N29f3 4.367 4.700 7.800 6.800 7.600 8.700
N0f64 42.800 14.200 2.977e3 12.900 13.400 2.955e3
N61f3 42.900 13.100 3.086e3 13.400 12.500 3.035e3
N0f128 66.500 38.200 6.313e3 68.400 36.600 5.791e3
N125f3 66.799 39.900 6.281e3 68.600 38.000 6.367e3

Linux 32-bit (ARMv7)

see #129 (comment)


Windows 32-bit (i686) (WoW64)

see #129 (comment)

@kimikage
Copy link
Collaborator Author

kimikage commented Oct 21, 2019

Fortunately, all x86_64 CPUs have CVTSI2SS xmm, r64, which converts one Int64 value to one Float32 value and is not so slow. Most modern x86_64 CPUs have the AVX version VCVTSI2SS.

Although the conversion from Int64 to Float32 (with VCVTSI2SS, which is a scalar operation) itself is not so slow, it seems to degrade the performance. The compiler seems to abandon the SIMD vectorization when the Int64 -> Float32 conversion is used, and it can result in a long sequence of scalar operations. Unfortunately, the long sequences can interfere with inlining and make function calls, which have significant overheads. @inline would be of no help since the SIMD vectorization is a matter outside of a single conversion method.

So, I'll try to avoid Int64->Float32 whenever possible, even if it has the lower performance in a single conversion.


Edit:
In the case of Normed{UInt16,f} -> Float32, where f < 8 || f == 16, the conversion method does not use the Int64 variable, and is suitable for the SIMD vectorization now.

function (::Type{T})(x::Normed{UInt16,f}) where {T <: Union{Float32, Float16}, f}
    f == 1 && return T(reinterpret(x))
    f32 = Float32(Int32(reinterpret(x)))
    f ==  2 && return T(f32 * Float32(0x55p-8)  + f32 * Float32(0x555555p-32))
    f ==  3 && return T(f32 * Float32(0x49p-9)  + f32 * Float32(0x249249p-33))
    f ==  4 && return T(f32 * Float32(0x11p-8)  + f32 * Float32(0x111111p-32))
    f ==  5 && return T(f32 * Float32(0x21p-10) + f32 * Float32(0x108421p-35))
    f ==  6 && return T(f32 * Float32(0x41p-12) + f32 * Float32(0x041041p-36))
    f ==  7 && return T(f32 * Float32(0x81p-14) + f32 * Float32(0x204081p-42))
    f == 16 && return T(f32 * Float32(0x1p-16)  + f32 * Float32(0x010001p-48))
    s = UInt64(1)<<(64 - 16)
    k = Int64(s ÷ rawone(Normed{UInt16,f})) # since `rawone()>=3>2`, `x.i*k` is always less than 2^63.
    T(Float32(Int64(reinterpret(x)) * k) * (1.0f0 / s))
end

@timholy
Copy link
Member

timholy commented Oct 31, 2019

Wow, I've been focused on other tasks and missed all of this. Truly awesome analysis, @kimikage!

However, with the other CPUs, I cannot imagine what the performance will be.

I may own one of those older machines myself, but I think we should target the future more than the present.

Do you have any ideas to simplify the following method, and to calculate it fast and perfectly, where f in (17, 24, 25, 26, 29, 31)

Not off the top of my head. Your lookup table idea is genius. I'm sure you know that the method will be statically optimized. What if we use a slow generic method for the troublesome cases and adopt your magic-number approach for the rest? At least for the common cases this should be a huge improvement.

This actually seems worth writing up as a paper. Have you considered that? I am a fine one to talk because I never get around to doing that for any of my Julia work, though I am trying to change that now.

@kimikage
Copy link
Collaborator Author

kimikage commented Nov 1, 2019

Thank you for your favorable feedback. My concern was (is) that this is out of balance between the benefit and the amount of codes. (I see this as a puzzle game, so I think this is worth the time I spent.)

I'm sure you know that the method will be statically optimized.

Yes, I know that, and I just learned that Julia may give up the optimization due to just a little things. Although the compiling time and debugging of the C++ templates really disgust me, I'd like something like the constexpr keyword.

What if we use a slow generic method for the troublesome cases and adopt your magic-number approach for the rest?

Yes, that is the plan. I'm going to finish this, but I will do my best for N8f8, N6f10, N4f12, N2f14 and N8f24.

This actually seems worth writing up as a paper. Have you considered that?

When I started this, I searched GitHub for some distinctive magic numbers, but I could not find a similar technique. (Well, the search engine of GitHub is somewhat poor, though.)
However, I think what I am actually doing is reinvent the wheel. I also investigated what codes the C++ compilers generate for the division by a constant. As the benchmark results above show, in short, the C++ compilers' answer is that the simple division with SIMD vectorization is the best! 🎃
My job is setting up the grave-post for the girl(?) who is not good at the SIMD optimization. I believe this will not be necessary in the near future.
Although I think my method has no academic value, but its background issues might be a little instructive to Julia programmers.

@timholy
Copy link
Member

timholy commented Nov 1, 2019

Yes, I know that, and I just learned that Julia may give up the optimization due to just a little things.

You could always write those methods as

i64(x) = Int64(reinterpret(x))
Base.Float64(x::Normed{UInt32,1}) = Float64(reinterpret(x))
Base.Float64(x::Normed{UInt32,2}) = Float64(i64(x) * 0x040001) * 0x15555000015555p-72

and then there's no risk they won't be optimized. Not really any longer, either.

@kimikage
Copy link
Collaborator Author

kimikage commented Nov 1, 2019

Well, the optimization for eliminating dead branches is quite excellent. I have no complaints or inconvenience about it. My complaint is more about inlining and compile-time calculation of constants.
For example, we know empirically that the i64(x) will be inlined. However, is it really obvious? The problem is that we cannot make sure it generally just by looking inside the method, i.e. it may depend on the caller or the caller's caller and so on. As you know and I mentioned above, @inline cannot control the inlining of its caller.
I came across an interesting issue about inlining and compile-time calculation, but I can't recall it. (I will show it when I can recall it.)

This is not a place to say a request, but I hope exp2(::Integer), ldexp(x,::Integer), etc. are calculated into constants at compile time.

@timholy
Copy link
Member

timholy commented Nov 1, 2019

Inlining is definitely a tricky & sensitive optimization. If you haven't found it, there's a bit of documentation about the internals: https://docs.julialang.org/en/latest/devdocs/inference/#The-inlining-algorithm-(inline_worthy)-1.

@kimikage
Copy link
Collaborator Author

kimikage commented Nov 4, 2019

I'm going to add a simple macro @f32.

macro f32(x::Float64) # for hexadecimal floating-point literals
    :(Float32($x))
end

I think it makes the macro more convenient to accept expressions as well as literals, but I'm not going to write a compiler.:smile:

In the case of Normed{UInt16,f} -> Float32, where f < 8 || f == 16, the conversion method does not use the Int64 variable, and is suitable for the SIMD vectorization now.

For N3f13, N2f14 and N1f15:

    i32 = Int32(reinterpret(x))
    f == 13 && return f32 * @f32(0x01p-13) + Float32(i32 * 0x07) * @f32(0x924db7p-52)
    f == 14 && return f32 * @f32(0x01p-14) + Float32(i32 * 0x15) * @f32(0xc30f3dp-56)
    f == 15 && return f32 * @f32(0x01p-15) + Float32(i32 * 0x49) * @f32(0xe071f9p-60)

Edit:
I had a bad feeling from the start, but the above method seems to be slower than the simple division.

@kimikage
Copy link
Collaborator Author

kimikage commented Nov 13, 2019

Benchmark 2

Julia v1.0.3 arm-linux-gnueabihf (Raspbian Buster on RPi2 v1.2)

julia> versioninfo()
Julia Version 1.0.3
Platform Info:
  OS: Linux (arm-linux-gnueabihf)
  CPU: ARMv7 Processor rev 4 (v7l)
  WORD_SIZE: 32
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, cortex-a53)

The NEON instruction set has native support for both sitofp and uitofp (i.e. VCVT.S32.F32 and VCVT.U32.F32) with same costs. This means that the optimization techniques for x86_64 processors may have the negative effect. However, it might not be the bottleneck, and some problems can be solved by the LLVM back-end.

Matrix of Vec4 (unit: μs)

ARMv7 Float32

master
Float32
high
accuracy
Float32

simple div
Float64

master
Float64
high
accuracy
Float64

simple div
N0f8 249.843 100.312 251.199 289.270 110.313 418.751
N5f3 249.895 100.313 250.938 290.260 110.365 419.219
N0f16 242.031 91.250 251.094 263.333 110.834 418.907
N13f3 241.822 104.791 250.729 263.125 110.833 418.750
N0f32 229.479 123.437 419.792 246.719 107.500 410.208
N8f24 234.010 223.647 419.949 246.875 410.155 410.157
N29f3 233.958 223.543 420.000 245.937 107.447 410.313
N0f64 705.728 909.691 410.5e3 998.072 914.534 411.3e3
N61f3 705.624 1.094e3 328.7e3 998.957 1.092e3 331.8e3
N0f128 1.347e3 2.223e3 851.8e3 1.584e3 2.158e3 866.8e3
N125f3 1.347e3 2.525e3 474.5e3 1.584e3 2.440e3 475.2e3

Julia 1.0.x doesn't seem to apply the constant propagation before the inlining.

@timholy
Copy link
Member

timholy commented Nov 13, 2019

Wow, those are all over the map. This is mostly due to the architecture or the Julia version?

@kimikage
Copy link
Collaborator Author

kimikage commented Nov 13, 2019

I think the inlining problem is mostly due to the Julia version. However, I don't know which commit changed the behavior. (JuliaLang/julia#24362 has been merged also into Julia v1.0.x branch.)
I'm going to try some workarounds (e.g. use of @inline, the pre-calculated literals, the LUTs with const keyword, the macros for generating bit patterns...).
BTW, I think the SIMD vectorization is so great that the inlining bonus should be added for the effect.

@kimikage
Copy link
Collaborator Author

kimikage commented Nov 14, 2019

Note to Self: Workarounds for the inlining problem

  • @inline
    • This is not so helpful since this has no effect to the caller.
    • This may be helpful for separated sub functions (i.e. callees), because the optimization techniques (e.g. constant propagation, dead branch elimination) are applied when the sub functions are inlined.
      • Actually, the sub functions may not need the explicit @inline, though.
  • Pre-calculated literals
    • This is very effective.
    • However, this introduces more magic numbers and verbose method definitions.
  • LUTs with const keyword
    • This is never good due to inline_nonleaf_penalty, even though the actual costs are very low.
  • Macros for generating bit patterns
    • This may help remove expensive codes such as /.
      • div_float does not seem to be folded by Julia v1.0.x optimizer around the inlining stage, although it is finally folded.
    • This brings difficulties on readability and maintainability .

@kimikage
Copy link
Collaborator Author

kimikage commented Nov 15, 2019

This is not a place to say a request, but I hope exp2(::Integer), ldexp(x,::Integer), etc. are calculated into constants at compile time.

Where Do We Come From? What Are We? Where Are We Going? 😵

macro exp2(n)
     :(_exp2(Val($(esc(n)))))
end
@generated _exp2(::Val{N}) where {N} = (x = exp2(N); :($x))

# for Julia v1.0, which does not fold `div_float` before inlining
@generated inv_rawone(::T) where {T <: Normed} = (x = 1.0 / rawone(T); :($x))

Edit: see: #138 (comment)

@kimikage
Copy link
Collaborator Author

kimikage commented Nov 17, 2019

Benchmark 3

Julia v1.0.5 i686-w64-mingw32 (WoW64)

julia> versioninfo()
Julia Version 1.0.5
Commit 3af96bcefc (2019-09-09 19:06 UTC)
Platform Info:
  OS: Windows (i686-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 32
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

Vector (unit: μs)

wow64 Float32

master
Float32
high
accuracy
Float32

simple div
Float64

master
Float64
high
accuracy
Float64

simple div
N0f8 60.600 60.600 60.700 60.900 62.400 61.600
N5f3 60.700 61.900 60.700 60.900 61.100 61.600
N0f16 138.000 152.500 163.500 136.100 150.400 172.300
N13f3 138.000 152.500 163.500 136.100 150.600 172.400
N0f32 65.300 69.400 66.700 64.800 64.900 65.500
N8f24 65.200 122.700 66.700 64.800 65.100 71.600
N29f3 65.200 122.200 66.800 64.800 66.300 69.900
N0f64 183.100 91.300 40.11e3 161.100 87.700 41.25e3
N61f3 183.200 215.800 28.38e3 161.100 197.600 28.25e3
N0f128 312.100 115.700 83.83e3 328.600 112.400 83.61e3
N125f3 310.000 126.500 45.05e3 328.800 120.300 45.17e3

Normed{UInt32}s are faster than native 64-bit version. Normed{UInt64}s and Normed{UInt128}s are slower.

Matrix of Vec4 (unit: μs)

wow64 Float32

master
Float32
high
accuracy
Float32

simple div
Float64

master
Float64
high
accuracy
Float64

simple div
N0f8 18.300 3.263 3.350 18.600 4.000 7.900
N5f3 18.300 3.288 3.362 18.600 3.671 7.900
N0f16 19.300 3.200 3.400 19.600 3.957 7.933
N13f3 19.200 3.250 3.500 19.600 3.733 7.900
N0f32 18.800 4.043 7.900 18.500 5.050 7.933
N8f24 18.900 4.171 7.833 19.000 7.633 7.967
N29f3 18.700 4.514 8.650 22.500 5.217 8.033
N0f64 37.600 20.900 42.44e3 41.000 18.500 42.44e3
N61f3 40.800 18.800 29.09e3 39.800 17.300 28.05e3
N0f128 177.100 70.500 83.82e3 220.000 66.400 83.91e3
N125f3 176.200 76.000 46.27e3 218.900 68.200 45.70e3

Normed{UInt64}s and Normed{UInt128}s are slower than native 64-bit version. The others are faster or the same.

@kimikage
Copy link
Collaborator Author

kimikage commented Nov 17, 2019

BTW, I think there is a problem in interoperability with specialization using integer literal parameters.

julia> Normed{UInt32,Int64(1)} === Normed{UInt32,Int32(1)}
false

julia> Normed{UInt32,Int64(1)} == Normed{UInt32,Int32(1)}
false

I don't know how serializers (e.g. JLD2) do (de-)serialization.

@kimikage
Copy link
Collaborator Author

Well, you might wonder, "What about the Fixed?"

(::Type{TF})(x::Fixed{T,f}) where {TF <: AbstractFloat,T,f} =
TF(x.i>>f) + TF(x.i&(one(widen1(T))<<f - 1))/TF(one(widen1(T))<<f)

I could not believe my eyes. It far exceeds my understanding, so I'm not going to touch it.:see_no_evil:

@timholy
Copy link
Member

timholy commented Nov 19, 2019

Definitely not beyond your understanding, and indeed the Fixed cases are easier: they are scaled by 2^f rather than 2^f-1, so conversion operations are just bit-shifts. Here are the pieces:

  • TF(x.i>>f): this just converts the integer portion. x.i>>f truncates everything after the "virtual decimal point"
  • y = x.i&(one(widen1(T))<<f - 1): this zeros the bits corresponding to the integer portion, leaving only the fractional portion (the last f bits) represented as an integer
  • TF(y)/TF(one(widen1(T))<<f): this computes the fractional part as a floating-point number

@kimikage
Copy link
Collaborator Author

kimikage commented Nov 19, 2019

I apologize that I probably misled you. I mean..., I don't know the reason for not doing the following specialization.

Base.Float32(x::Fixed{T,f}) where {T <: Union{Int8, Int16}, f} = Float32(x.i) * (1.0f0 / (1<<f))
Base.Float64(x::Fixed{T,f}) where {T <: Union{Int8, Int16, Int32}, f} = Float64(x.i) * (1.0 / (Int64(1)<<f))

Oops! 🙊

@timholy
Copy link
Member

timholy commented Nov 19, 2019

I also wondered if it was more complicated than it needed to be. Assuming it works, seems worth changing. I've not really paid much attention (as you can tell) to Fixed.

@kimikage
Copy link
Collaborator Author

Fixed by PR #138.
The related issues found in this issue report and PR #138 will be considered separately.

@kimikage
Copy link
Collaborator Author

kimikage commented Dec 25, 2019

FYI, the problem with rounding error also occurs in BigFloat conversion. I am not going to fix this. Edit: fixed (improved) by PR #186

julia> 5.0N32f32 == 5.0
false

julia> Float64(5.0N32f32) == 5.0
true

julia> promote_type(N32f32, Float64)
BigFloat

julia> BigFloat(5.0N32f32)
5.000000000000000000000000000000000000000000000000000000000000000000000000000069

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants