Rounding errors in `Normed` to `Float` conversions #129

kimikage · 2019-10-12T03:57:05Z

While I was reviewing the PR #123, I found a minor bug.

julia> float(3N3f5) # this is ok
3.0f0

julia> float(3N4f12)
3.0000002f0

julia> float(3N8f8)
3.0000002f0

julia> float(3N10f6)
3.0000002f0

julia> float(3N12f4)
3.0000002f0

julia> float(3N13f3)
3.0000002f0

julia> float(3N2f6)
3.0000002f0

julia> float(3N4f4)
3.0000002f0

julia> float(3N5f3)
3.0000002f0

I think this is a problem with the division optimization.

FixedPointNumbers.jl/src/normed.jl

Lines 75 to 78 in da39318

    
           function (::Type{T})(x::Normed) where {T <: AbstractFloat} 
        
               y = reinterpret(x)*(one(rawtype(x))/convert(T, rawone(x))) 
        
               convert(T, y)  # needed for types like Float16 which promote arithmetic to Float32 
        
           end

The rounding errors can occur everywhere, but especially the errors at integer values are critical.

julia> isinteger(float(3N5f3))
false

kimikage · 2019-10-13T19:30:48Z

I was underestimating this issue. It is not so easy to fix this without performance regression.

WIP:

function (::Type{T})(x::Normed) where {T <: AbstractFloat}
    # The following optimization for constant division may cause rounding errors.
    # y = reinterpret(x)*(one(rawtype(x))/convert(T, rawone(x)))
    # Therefore, we use a simple form here.
    y = reinterpret(x) / convert(promote_type(T, floattype(x)), rawone(x))
    convert(T, y)  # needed for types like Float16 which promote arithmetic to Float32
end
function (::Type{T})(x::Normed{Ti,f}) where {T <: Union{Float32, Float16}, Ti <: Union{UInt8, UInt16}, f}
    f == 1 && return convert(T, reinterpret(x))
    r = Int64(reinterpret(x))
    s = UInt64(1)<<48
    k = Int64(s ÷ rawone(Normed{Ti,f})) # since `rawone()>=3>2`, `r*k` is always less than 2^63.
    y = Float32(r * k) * (1.0f0 / s)
    convert(T, y)
end

The length of the mantissa part of Float32 is 23 (+1) bits, so Int64 for the intermediate variable may seem excessive. However, I think the default rounding mode (round half to even) may require some bitwise hacks, and as a result, some conditional branches may be generated in the native code.

julia> Float32(0x01010101) == 0x1.010100p24 # I would like to round half up (`0x1.010102p24`), though.
true

Fortunately, all x86_64 CPUs have CVTSI2SS xmm, r64, which converts one Int64 value to one Float32 value and is not so slow. Most modern x86_64 CPUs have the AVX version VCVTSI2SS. FYI, VCVTUSI2SS xmm, xmm, r64, which converts one UInt64 value, is in AVX-512F instruction set.
However, with the other CPUs, I cannot imagine what the performance will be. In particular, the methods will be inlined, so I think the performance will depend greatly on the SIMD optimization.

Of course, only for Normed{UInt8}, the lookup tables may be nice.

kimikage · 2019-10-14T15:25:44Z

Although the conversions from integer Normed values to Float64 are OK, in some fractional cases, the rounding errors can be noticeable.

julia> Float64(7.4N4f4)
7.3999999999999995

julia> Float64(reinterpret(7.4N4f4) / 15)
7.4

The results of the following method agree with the ideal values, but I think the following is not so good. (see Edit2)

function Base.Float64(x::Normed{Ti,f}) where {Ti <: Union{UInt8, UInt16}, f}
    f == 1 && return Float64(reinterpret(x))
    r = Int64(reinterpret(x))
    s1 = UInt64(1)<<((53 - 16 + f - 1)÷f * f)
    s2 = UInt64(1)<<(((53 - 16)÷f + 1) * f)
    k = Int64(s1 ÷ rawone(Normed{Ti,f}))
    f64 = Float64(r * k)
    f64 * (1.0 / s1) + f64 * (1.0 / s1 / s2)
end

Edit:
Specialization for Normed{UInt8} -> Float64. 🎃

function Base.Float64(x::Normed{UInt8,f}) where f
    f == 1 && return Float64(reinterpret(x))
    r = Int64(reinterpret(x))
    s = UInt64(1)<<(64 - 8 - 2 + f)
    k = Int64(s ÷ rawone(Normed{UInt8,f}))
    i64 = r * k
    i32 = unsafe_trunc(UInt32, i64) >> (f*2)
    Float64((i64>>12<<12) | i32) * (1.0 / s)
end

Edit2:
I don't know why but I misunderstood. The FMA part in the original method can be replaced with a single multiply. 👻

function Base.Float64(x::Normed{Ti,f}) where {Ti <: Union{UInt8, UInt16}, f}
    f == 1 && return Float64(reinterpret(x))
    r = Int64(reinterpret(x))
    s = UInt64(1)<<(((53 - 16)÷f + 1) * f)
    k = Int64(s ÷ rawone(Normed{Ti,f}))
    Float64(r * k) * (1.0 / s + 1.0 / s / s)
end

In the case of Normed{UInt32,f} -> Float64 where f <= 8, the similar technique (with 1.0/s + 1.0/s/s + 1.0/s/s/s) may be available. (not tested)

kimikage · 2019-10-16T03:21:38Z

In the case of Normed{UInt8,f} -> Float32 where f < 8 (i.e. except in the case of Normed{UInt16}), the length of the intermediate variable can be reduced to 32 bits as follows:

function Base.Float32(x::Normed{UInt8,2})
    r = Int32(reinterpret(x))
    s = 0b0100_0000_0000_0000 
    k = Int32(s ÷ rawone(Normed{UInt8,2}))
    Float32(r * k) * (1.0f0 / s + 1.0f0 / s / s)
end

However, on the x86_64 CPU with AVX/AVX2 extensions, the Int32 version is slightly slower. In other words, although the modified version have an additional integer multiply, the Int64 version is slightly faster than the current low-accuracy version, which is also internally uses Int32.

~~I do not understand what is happening (what causes the stall).~~ My knowledge of the SIMD optimization almost ends at Pentium III.:stuck_out_tongue_closed_eyes: The trouble is that the explicit conversion Int64(reinterpret(x)) is removed away by the optimizer, and it results in a stall. I think this is a primitive (or compiler-related) issue.

Edit:
It appears that the cause is the benchmark script. Actually, their net latency values are almost the same and, as mentioned above, strongly depending on the optimization. Especially simple loops may be optimized drastically, so it is difficult to judge superiority or inferiority based on the simple @benchmark results. On the other hand, the suitability for optimization is also an important indicator of code qualities. So, I will run more practical benchmarks.

Edit2:
ugly...

function (::Type{T})(x::Normed{UInt8,f}) where {T <: Union{Float32, Float16}, f}
    f == 1 && return T(reinterpret(x))
    if f == 8
        k = Int32(0xF0F1) # 0xF0F1 * 0xFF < 2^24
        return T(Float32(Int32(reinterpret(x)) * k) * Float32(0x1.1111p-24))
    end
    s = UInt32(1)<<((f * 3÷2 + 11)÷f * f)
    k = Int32(s ÷ rawone(Normed{UInt8,f}))
    T(Float32(Int32(reinterpret(x)) * k) * (1.0f0 / s + 1.0f0 / s / s))
end

kimikage · 2019-10-19T09:22:21Z

Implementation status

FYI, the results on Julia v1.3.0-rc4 seem to be the same as on Julia v1.2.0.

Julia v1.2.0 x86_64

from \ to	Float16	Float32	Float64	BigFloat
Normed{UInt8}	✔️	✔️	✔️	❔
Normed{UInt16}	✔️	✔️	✔️	❔
Normed{UInt32}	✔️	✔️	✔️	❔
Normed{UInt64}	✔️	✔️	✔️	❔
Normed{UInt128}	✔️	✔️	✔️	❔

Edit:
I would never use Normed{UInt128}s for purposes other than the software testing, but using BigFloat is a foul play in my game rules. 😄 Although I will also rarely use Normed{UInt64}s, they can have more significand bits than Float64 and they might be useful as extended precision formats.

Tolerance

The rounding mode is assumed to be RoundNearest. If the global rounding mode is changed, the conversion methods may return incorrect (inaccurate) values.

from \ to	Float16	Float32	Float64	BigFloat
Normed{UInt8}	0	0	0	N/A
Normed{UInt16}	0	0	0	N/A
Normed{UInt32}	≈0	1 lsb	1 lsb	N/A
Normed{UInt64}	≈0	≈0	1 lsb	N/A
Normed{UInt128}	≈0	≈0	1 lsb	N/A

"1 lsb" means that prevfloat(expected) <= return_value <= nextfloat(expected).
"≈0" means that the tolerance should be zero, but may be "1 lsb" for some particular bit sequences.

kimikage · 2019-10-19T16:20:15Z

Do you have any ideas to simplify the following method, and to calculate it fast and perfectly, where f in (17, 24, 25, 26, 29, 31)?

function Base.Float64(x::Normed{UInt32,f}) where f
    f == 1 && return Float64(reinterpret(x))
    r = Int64(reinterpret(x))
    f ==  2 && return Float64(r * 0x040001) * 0x15555000015555p-72
    f ==  3 && return Float64(r * 0x108421) * 0x11b6db76924929p-75
    f ==  4 && return Float64(r * 0x010101) * 0x11000011000011p-72
    f ==  5 && return Float64(r * 0x108421) * 0x04000002000001p-75
    f ==  6 && return Float64(r * 0x09dfb1) * 0x1a56b8e38e6d91p-78
    f ==  7 && return Float64(r * 0x000899) * 0x0f01480001e029p-70
    f ==  8 && return Float64(r * 0x0a5a5b) * 0x18d300000018d3p-80
    f ==  9 && return Float64(r * 0x001001) * 0x080381c8e3f201p-72
    f == 10 && return Float64(r * 0x100001) * 0x04010000000401p-80
    f == 11 && return Float64(r * 0x000009) * 0x0e3aaae3955639p-66
    f == 12 && return Float64(r * 0x0a8055) * 0x186246e46e4cfdp-84
    f == 13 && return Float64(r * 0x002001) * 0x10000004000001p-78
    f == 14 && return Float64(r * 0x03400d) * 0x13b13b14ec4ec5p-84
    f == 15 && return Float64(r * 0x000259) * 0x06d0c5a4f3a5e9p-75
    f == 16 && return Float64(r * 0x011111) * 0x00f000ff00fff1p-80
    f == 18 && return Float64(r * 0x0b06d1) * 0x17377445dd1231p-90
    f == 19 && return Float64(r * 0x080001) * 0x00004000000001p-76
    f == 20 && return Float64(r * 0x000101) * 0x0ff010ef10ff01p-80
    f == 21 && return Float64(r * 0x004001) * 0x01fff8101fc001p-84
    f == 22 && return Float64(r * 0x002945) * 0x18d0000000018dp-88
    f == 23 && return Float64(r * 0x044819) * 0x07794a23729429p-92
    f == 27 && return Float64(r * 0x000a21) * 0x0006518c7df9e1p-81
    f == 28 && return Float64(r * 0x00000d) * 0x13b13b14ec4ec5p-84
    f == 30 && return Float64(r * 0x001041) * 0x00fc003f03ffc1p-90
    f == 32 && return Float64(r * 0x010101) * 0x00ff0000ffff01p-96

    f == 17 && return Float64(r) * 0x8000400020001p-68 # imperfect
    f == 24 && return Float64(r) * 0x1000001000001p-72 # imperfect
    f == 25 && return Float64(r) * 0x4000002000001p-75 # imperfect
    f == 26 && return Float64(r) * 0x10000004000001p-78 # imperfect
    f == 29 && return Float64(r) * 0x20000001p-58 # imperfect
    f == 31 && return Float64(r) * 0x80000001p-62 # imperfect
end

Edit:
FYI, these magic numbers come from:

using Primes
b = ones(Int128, 32)
#         _____________________________________________________--------------------
b[ 2] = 0b10101010101010101010101010101010101010101010101010101010101010101010101
b[ 3] = 0b1001001001001001001001001001001001001001001001001001001001001001001001001
b[ 4] = 0b100010001000100010001000100010001000100010001000100010001000100010001
b[ 5] = 0b10000100001000010000100001000010000100001000010000100001000010000100001
b[ 6] = 0b1000001000001000001000001000001000001000001000001000001000001000001000001
b[ 7] = 0b1000000100000010000001000000100000010000001000000100000010000001
b[ 8] = 0b1000000010000000100000001000000010000000100000001000000010000000100000001
b[ 9] = 0b1000000001000000001000000001000000001000000001000000001000000001
b[10] = 0b10000000001000000000100000000010000000001000000000100000000010000000001
b[11] = 0b10000000000100000000001000000000010000000000100000000001
b[12] = 0b1000000000001000000000001000000000001000000000001000000000001000000000001
b[13] = 0b100000000000010000000000001000000000000100000000000010000000000001
b[14] = 0b10000000000000100000000000001000000000000010000000000000100000000000001
b[15] = 0b1000000000000001000000000000001000000000000001000000000000001
b[16] = 0b10000000000000001000000000000000100000000000000010000000000000001
#         _____________________________________________________--------------------
b[17] = 0b1000000000000000010000000000000000100000000000000001 #imperfect
b[18] = 0b1000000000000000001000000000000000001000000000000000001000000000000000001
b[19] = 0b1000000000000000000100000000000000000010000000000000000001
b[20] = 0b1000000000000000000010000000000000000000100000000000000000001
b[21] = 0b1000000000000000000001000000000000000000001000000000000000000001
b[22] = 0b1000000000000000000000100000000000000000000010000000000000000000001
b[23] = 0b1000000000000000000000010000000000000000000000100000000000000000000001
b[24] = 0b1000000000000000000000001000000000000000000000001 # imperfect
b[25] = 0b100000000000000000000000010000000000000000000000001 # imperfect
b[26] = 0b10000000000000000000000000100000000000000000000000001 # imperfect
b[27] = 0b1000000000000000000000000001000000000000000000000000001
b[28] = 0b100000000000000000000000000010000000000000000000000000001
b[29] = 0b100000000000000000000000000001 # imperfect
b[30] = 0b1000000000000000000000000000001000000000000000000000000000001
b[31] = 0b10000000000000000000000000000001 # imperfect
b[32] = 0b10000000000000000000000000000000100000000000000000000000000000001
factors = [factor(Vector,b[i]) for i=1:32];

@show f = factors[31] # select
k1 = []
k2 = []
for i=1:2^length(f)
    ka = UInt64(1)
    kb = UInt64(1)
    for k = 0:length(f)-1
       if (i >> k) & 1 != 0
            ka *= f[k+1]
        else
            kb *= f[k+1]
        end
    end
    ka * 0xFFFFFFFF >= 2^53 && continue
    kb >= 2^53 && continue
    if !(ka in k1)
        push!(k1, ka)
        push!(k2, kb)
    end
end
sort!(k1)
reverse!(sort!(k2))
for i =1:length(k1)
    bc = length(replace(bitstring(k1[i]),"0"=>""))
    #bc > 3 && continue
    println(bc, ", ", string(k1[i], base=16),", ", string(k2[i], base=16))
end

Edit2:
My original plan was to use the Int64 variables because the MSB of UInt32 may be 1 and (V)CVTDQ2PS may convert the values to negative floats. However, as commented below, using Int64 seems to make the compiler abandon the SIMD vectorization. So, I changed the codes into compiler-friendly (dull) form:

Base.Float64(x::Normed{UInt32, 2}) = (Float64(reinterpret(x)) * 0x040001) * 0x15555000015555p-72
Base.Float64(x::Normed{UInt32, 3}) = (Float64(reinterpret(x)) * 0x108421) * 0x11b6db76924929p-75
Base.Float64(x::Normed{UInt32, 4}) = (Float64(reinterpret(x)) * 0x010101) * 0x11000011000011p-72
...

FYI, to convert UInt32 value to Float64 on x86_64, the compiler seems to generate the code which converts the upper 16-bit word and the lower 16-bit word separately and adds them.

Edit3:
FWIW, there is another way to convert UInt32 value to Float64.

function u32_to_f64(x::UInt32)
    i32 = unsafe_trunc(Int32, x)
    # Float64(i32) + (i32 < 0 ? 0x1p32 : 0.0)
    Float64(i32) + reinterpret(Float64, Int64(i32>>31) & reinterpret(UInt64, 0x1p32))
end

kimikage · 2019-10-20T11:40:52Z

Benchmark

Script

This script enables us to get a glimpse of problems of #125. The core conversion codes should be almost OS-independent, but the broadcast or memory allocation is OS-dependent. In particular, there seems to be a problem with the broadcast on Windows.

using BenchmarkTools
using FixedPointNumbers
struct Vec3{T <: Real}
    x::T; y::T; z::T
end
struct Vec4{T <: Real}
    x::T; y::T; z::T; w::T
end
Vec3{T}(v::Vec3{T}) where {T} = v
Vec3{T}(v::Vec3{U}) where {T, U} = Vec3{T}(v.x, v.y, v.z) 
Vec4{T}(v::Vec4{T}) where {T} = v
Vec4{T}(v::Vec4{U}) where {T, U} = Vec4{T}(v.x, v.y, v.z, v.w) 
Base.rand(::Type{Vec3{T}}) where {T} = Vec3{T}(rand(T), rand(T), rand(T))
Base.rand(::Type{Vec4{T}}) where {T} = Vec4{T}(rand(T), rand(T), rand(T), rand(T))
function Base.rand(::Type{T}, sz::Dims) where {T <: Union{Vec3, Vec4}}
    A = Array{T}(undef, sz)
    for i in eachindex(A); A[i] = rand(T); end
    return A
end
const N0f128 = Normed{UInt128,128}
const N125f3 = Normed{UInt128,3}
Ts = (N0f8, N5f3, N0f16, N13f3, N0f32, N8f24, N29f3, N0f64, N61f3, N0f128, N125f3)
# The impact of the difference in random numbers should be less than that of the CPU throttling or cache.
vec1s = [rand(T,  64 * 64 * 4) for T in Ts];
mat3s = [rand(Vec3{T}, 64, 64) for T in Ts];
mat4s = [rand(Vec4{T}, 64, 64) for T in Ts];
# vec1s = [[rand(T) for i = 1:64 * 64 * 4] for T in Ts]; # workaround for #125

for vec in vec1s
    println(eltype(vec), "-> Float32")
    @btime Float32.(view($vec,:))
end

for vec in vec1s
    println(eltype(vec), "-> Float64")
    @btime Float64.(view($vec,:))
end

for mat in mat3s
    println(eltype(mat), "-> Float32")
    @btime Vec3{Float32}.(view($mat,:,:))
end

for mat in mat3s
    println(eltype(mat), "-> Float64")
    @btime Vec3{Float64}.(view($mat,:,:))
end

for mat in mat4s
    println(eltype(mat), "-> Float32")
    @btime Vec4{Float32}.(view($mat,:,:))
end

for mat in mat4s
    println(eltype(mat), "-> Float64")
    @btime Vec4{Float64}.(view($mat,:,:))
end

The values in the tables below follow the display of @btime, but the number of significant digits is actually about 2.

Julia v1.2.0 x86_64-w64-mingw32

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

Vector (unit: μs)

w64	`Float32` master	`Float32` high accuracy	`Float32` simple div	`Float64` master	`Float64` high accuracy	`Float64` simple div
`N0f8`	40.599	37.900	39.500	40.200	37.599	40.199
`N5f3`	39.299	40.599	38.000	41.200	39.301	38.000
`N0f16`	117.300	132.499	144.900	117.399	132.099	154.099
`N13f3`	117.500	132.499	144.800	120.100	132.099	154.099
`N0f32`	131.900	150.100	183.200	128.399	143.100	165.199
`N8f24`	131.900	186.900	183.299	128.400	164.899	165.100
`N29f3`	131.899	186.400	183.400	128.499	143.199	165.100
`N0f64`	108.501	47.900	7.809e3	43.800	45.600	5.868e3
`N61f3`	104.299	45.100	8.122e3	43.301	41.800	6.158e3
`N0f128`	94.099	74.699	7.952e3	94.199	71.600	5.673e3
`N125f3`	92.701	81.499	8.844e3	94.300	75.700	5.870e3

The current Normed{UInt64}->Float32 conversions are much slower than Normed{UInt64}->Float64.
~~The high-accuracy Normed{UInt8}->Float64 conversions are slightly slower than the conversions with simple divisions.~~

Matrix of Vec3 (unit: μs)

w64	`Float32` master	`Float32` high accuracy	`Float32` simple div	`Float64` master	`Float64` high accuracy	`Float64` simple div
`N0f8`	2.812	3.237	3.914	4.920	5.500	8.367
`N5f3`	2.812	3.288	3.943	4.880	5.400	8.099
`N0f16`	5.033	3.657	4.243	5.120	5.640	8.300
`N13f3`	4.983	3.614	4.286	5.040	5.450	8.099
`N0f32`	6.180	6.300	11.500	6.125	6.825	8.199
`N8f24`	6.200	7.867	11.200	5.975	8.100	8.501
`N29f3`	6.160	7.833	11.300	6.100	6.967	8.333
`N0f64`	30.600	17.399	5.868e3	9.301	17.299	5.950e3
`N61f3`	25.799	15.900	6.158e3	9.600	8.200	5.971e3
`N0f128`	54.999	39.899	5.673e3	56.499	38.200	5.963e3
`N125f3`	55.999	40.099	5.870e3	53.800	38.300	5.870e3

The current Normed{UInt16}->Float32 conversions are slower than the conversions with simple divisions.

Matrix of Vec4 (unit: μs)

w64	`Float32` master	`Float32` high accuracy	`Float32` simple div	`Float64` master	`Float64` high accuracy	`Float64` simple div
`N0f8`	3.300	3.814	3.914	5.400	4.499	8.299
`N5f3`	3.300	3.786	4.186	4.501	5.400	8.299
`N0f16`	3.286	4.000	4.016	4.799	5.100	8.300
`N13f3`	3.372	3.800	4.028	4.400	4.800	8.200
`N0f32`	4.243	4.583	8.100	5.200	5.599	8.500
`N8f24`	4.228	5.033	8.400	5.867	7.800	8.399
`N29f3`	4.129	4.933	8.367	6.000	6.600	8.299
`N0f64`	41.500	13.399	7.831e3	12.399	12.600	7.830e3
`N61f3`	41.599	13.200	8.289e3	12.199	11.400	8.017e3
`N0f128`	66.599	38.800	7.691e3	69.100	35.600	7.593e3
`N125f3`	66.799	44.099	8.197e3	69.700	38.500	8.092e3

~~The high-accuracy Normed{UInt8} and Normed{UInt16}->Float32 conversions are slower than the conversions with simple divisions.~~

Julia v1.2.0 x86_64-pc-linux-gnu (Debian on WSL)

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

Vector (unit: μs)

linux	`Float32` master	`Float32` high accuracy	`Float32` simple div	`Float64` master	`Float64` high accuracy	`Float64` simple div
`N0f8`	14.100	15.800	14.600	14.500	16.000	17.600
`N5f3`	14.100	16.000	14.600	14.500	16.100	16.600
`N0f16`	14.300	16.000	14.600	15.300	16.100	17.700
`N13f3`	14.200	16.000	14.600	15.600	16.100	17.500
`N0f32`	15.800	21.100	18.000	15.400	16.200	17.400
`N8f24`	15.900	68.600	18.300	14.500	16.200	16.300
`N29f3`	16.200	69.400	18.100	15.000	16.200	16.300
`N0f64`	74.600	30.000	2.945e3	18.600	23.900	3.047e3
`N61f3`	74.800	24.000	3.045e3	18.900	18.800	3.048e3
`N0f128`	59.500	58.200	6.017e3	60.100	45.100	6.078e3
`N125f3`	59.700	53.000	6.171e3	61.300	45.200	6.191e3

In cases of Normed{UInt32,f}->Float32 where 2 <= f <= 24, the LLVM (not Julia front-end) generates a destructive conditional branch.

Matrix of Vec3 (unit: μs)

linux	`Float32` master	`Float32` high accuracy	`Float32` simple div	`Float64` master	`Float64` high accuracy	`Float64` simple div
`N0f8`	3.175	3.675	3.914	5.800	5.660	8.667
`N5f3`	3.300	3.413	3.914	5.650	5.675	8.633
`N0f16`	5.467	3.786	4.371	5.120	6.020	8.500
`N13f3`	5.467	3.825	4.286	5.040	5.900	8.767
`N0f32`	6.740	6.420	12.500	6.125	7.075	8.267
`N8f24`	6.900	7.975	12.500	5.975	8.100	9.067
`N29f3`	6.620	7.975	12.500	6.100	7.025	9.200
`N0f64`	31.300	18.000	2.272e3	9.301	17.900	2.281e3
`N61f3`	30.500	18.200	2.384e3	9.600	8.633	2.386e3
`N0f128`	53.300	43.900	4.616e3	55.900	37.100	4.483e3
`N125f3`	53.800	39.300	4.585e3	53.800	37.300	4.702e3

Matrix of Vec4 (unit: μs)

linux	`Float32` master	`Float32` high accuracy	`Float32` simple div	`Float64` master	`Float64` high accuracy	`Float64` simple div
`N0f8`	3.529	3.629	4.000	6.867	7.267	8.300
`N5f3`	3.471	3.857	3.957	6.500	7.500	8.300
`N0f16`	3.786	3.857	4.014	6.900	7.700	8.400
`N13f3`	3.614	3.843	4.171	6.267	7.133	8.400
`N0f32`	4.333	4.350	8.867	7.167	7.333	8.600
`N8f24`	4.557	4.533	7.825	7.200	8.600	8.600
`N29f3`	4.367	4.700	7.800	6.800	7.600	8.700
`N0f64`	42.800	14.200	2.977e3	12.900	13.400	2.955e3
`N61f3`	42.900	13.100	3.086e3	13.400	12.500	3.035e3
`N0f128`	66.500	38.200	6.313e3	68.400	36.600	5.791e3
`N125f3`	66.799	39.900	6.281e3	68.600	38.000	6.367e3

Linux 32-bit (ARMv7)

see #129 (comment)

Windows 32-bit (i686) (WoW64)

see #129 (comment)

kimikage · 2019-10-21T03:19:57Z

Fortunately, all x86_64 CPUs have CVTSI2SS xmm, r64, which converts one Int64 value to one Float32 value and is not so slow. Most modern x86_64 CPUs have the AVX version VCVTSI2SS.

Although the conversion from Int64 to Float32 (with VCVTSI2SS, which is a scalar operation) itself is not so slow, it seems to degrade the performance. The compiler seems to abandon the SIMD vectorization when the Int64 -> Float32 conversion is used, and it can result in a long sequence of scalar operations. Unfortunately, the long sequences can interfere with inlining and make function calls, which have significant overheads. @inline would be of no help since the SIMD vectorization is a matter outside of a single conversion method.

So, I'll try to avoid Int64->Float32 whenever possible, even if it has the lower performance in a single conversion.

Edit:
In the case of Normed{UInt16,f} -> Float32, where f < 8 || f == 16, the conversion method does not use the Int64 variable, and is suitable for the SIMD vectorization now.

function (::Type{T})(x::Normed{UInt16,f}) where {T <: Union{Float32, Float16}, f}
    f == 1 && return T(reinterpret(x))
    f32 = Float32(Int32(reinterpret(x)))
    f ==  2 && return T(f32 * Float32(0x55p-8)  + f32 * Float32(0x555555p-32))
    f ==  3 && return T(f32 * Float32(0x49p-9)  + f32 * Float32(0x249249p-33))
    f ==  4 && return T(f32 * Float32(0x11p-8)  + f32 * Float32(0x111111p-32))
    f ==  5 && return T(f32 * Float32(0x21p-10) + f32 * Float32(0x108421p-35))
    f ==  6 && return T(f32 * Float32(0x41p-12) + f32 * Float32(0x041041p-36))
    f ==  7 && return T(f32 * Float32(0x81p-14) + f32 * Float32(0x204081p-42))
    f == 16 && return T(f32 * Float32(0x1p-16)  + f32 * Float32(0x010001p-48))
    s = UInt64(1)<<(64 - 16)
    k = Int64(s ÷ rawone(Normed{UInt16,f})) # since `rawone()>=3>2`, `x.i*k` is always less than 2^63.
    T(Float32(Int64(reinterpret(x)) * k) * (1.0f0 / s))
end

timholy · 2019-10-31T12:05:14Z

Wow, I've been focused on other tasks and missed all of this. Truly awesome analysis, @kimikage!

However, with the other CPUs, I cannot imagine what the performance will be.

I may own one of those older machines myself, but I think we should target the future more than the present.

Do you have any ideas to simplify the following method, and to calculate it fast and perfectly, where f in (17, 24, 25, 26, 29, 31)

Not off the top of my head. Your lookup table idea is genius. I'm sure you know that the method will be statically optimized. What if we use a slow generic method for the troublesome cases and adopt your magic-number approach for the rest? At least for the common cases this should be a huge improvement.

This actually seems worth writing up as a paper. Have you considered that? I am a fine one to talk because I never get around to doing that for any of my Julia work, though I am trying to change that now.

kimikage · 2019-11-01T11:27:13Z

Thank you for your favorable feedback. My concern was (is) that this is out of balance between the benefit and the amount of codes. (I see this as a puzzle game, so I think this is worth the time I spent.)

I'm sure you know that the method will be statically optimized.

Yes, I know that, and I just learned that Julia may give up the optimization due to just a little things. Although the compiling time and debugging of the C++ templates really disgust me, I'd like something like the constexpr keyword.

What if we use a slow generic method for the troublesome cases and adopt your magic-number approach for the rest?

Yes, that is the plan. I'm going to finish this, but I will do my best for N8f8, N6f10, N4f12, N2f14 and N8f24.

This actually seems worth writing up as a paper. Have you considered that?

When I started this, I searched GitHub for some distinctive magic numbers, but I could not find a similar technique. (Well, the search engine of GitHub is somewhat poor, though.)
However, I think what I am actually doing is reinvent the wheel. I also investigated what codes the C++ compilers generate for the division by a constant. As the benchmark results above show, in short, the C++ compilers' answer is that the simple division with SIMD vectorization is the best! 🎃
My job is setting up the grave-post for the girl(?) who is not good at the SIMD optimization. I believe this will not be necessary in the near future.
Although I think my method has no academic value, but its background issues might be a little instructive to Julia programmers.

timholy · 2019-11-01T14:48:36Z

Yes, I know that, and I just learned that Julia may give up the optimization due to just a little things.

You could always write those methods as

i64(x) = Int64(reinterpret(x))
Base.Float64(x::Normed{UInt32,1}) = Float64(reinterpret(x))
Base.Float64(x::Normed{UInt32,2}) = Float64(i64(x) * 0x040001) * 0x15555000015555p-72

and then there's no risk they won't be optimized. Not really any longer, either.

kimikage · 2019-11-01T17:56:33Z

Well, the optimization for eliminating dead branches is quite excellent. I have no complaints or inconvenience about it. My complaint is more about inlining and compile-time calculation of constants.
For example, we know empirically that the i64(x) will be inlined. However, is it really obvious? The problem is that we cannot make sure it generally just by looking inside the method, i.e. it may depend on the caller or the caller's caller and so on. As you know and I mentioned above, @inline cannot control the inlining of its caller.
I came across an interesting issue about inlining and compile-time calculation, but I can't recall it. (I will show it when I can recall it.)

This is not a place to say a request, but I hope exp2(::Integer), ldexp(x,::Integer), etc. are calculated into constants at compile time.

timholy · 2019-11-01T19:09:13Z

Inlining is definitely a tricky & sensitive optimization. If you haven't found it, there's a bit of documentation about the internals: https://docs.julialang.org/en/latest/devdocs/inference/#The-inlining-algorithm-(inline_worthy)-1.

kimikage · 2019-11-04T03:00:02Z

I'm going to add a simple macro @f32.

macro f32(x::Float64) # for hexadecimal floating-point literals
    :(Float32($x))
end

I think it makes the macro more convenient to accept expressions as well as literals, but I'm not going to write a compiler.:smile:

In the case of Normed{UInt16,f} -> Float32, where f < 8 || f == 16, the conversion method does not use the Int64 variable, and is suitable for the SIMD vectorization now.

For N3f13, N2f14 and N1f15:

    i32 = Int32(reinterpret(x))
    f == 13 && return f32 * @f32(0x01p-13) + Float32(i32 * 0x07) * @f32(0x924db7p-52)
    f == 14 && return f32 * @f32(0x01p-14) + Float32(i32 * 0x15) * @f32(0xc30f3dp-56)
    f == 15 && return f32 * @f32(0x01p-15) + Float32(i32 * 0x49) * @f32(0xe071f9p-60)

Edit:
I had a bad feeling from the start, but the above method seems to be slower than the simple division.

kimikage · 2019-11-13T00:46:55Z

Benchmark 2

Julia v1.0.3 arm-linux-gnueabihf (Raspbian Buster on RPi2 v1.2)

julia> versioninfo()
Julia Version 1.0.3
Platform Info:
  OS: Linux (arm-linux-gnueabihf)
  CPU: ARMv7 Processor rev 4 (v7l)
  WORD_SIZE: 32
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, cortex-a53)

The NEON instruction set has native support for both sitofp and uitofp (i.e. VCVT.S32.F32 and VCVT.U32.F32) with same costs. This means that the optimization techniques for x86_64 processors may have the negative effect. However, it might not be the bottleneck, and some problems can be solved by the LLVM back-end.

Matrix of Vec4 (unit: μs)

ARMv7	`Float32` master	`Float32` high accuracy	`Float32` simple div	`Float64` master	`Float64` high accuracy	`Float64` simple div
`N0f8`	249.843	100.312	251.199	289.270	110.313	418.751
`N5f3`	249.895	100.313	250.938	290.260	110.365	419.219
`N0f16`	242.031	91.250	251.094	263.333	110.834	418.907
`N13f3`	241.822	104.791	250.729	263.125	110.833	418.750
`N0f32`	229.479	123.437	419.792	246.719	107.500	410.208
`N8f24`	234.010	223.647	419.949	246.875	410.155	410.157
`N29f3`	233.958	223.543	420.000	245.937	107.447	410.313
`N0f64`	705.728	909.691	410.5e3	998.072	914.534	411.3e3
`N61f3`	705.624	1.094e3	328.7e3	998.957	1.092e3	331.8e3
`N0f128`	1.347e3	2.223e3	851.8e3	1.584e3	2.158e3	866.8e3
`N125f3`	1.347e3	2.525e3	474.5e3	1.584e3	2.440e3	475.2e3

Julia 1.0.x doesn't seem to apply the constant propagation before the inlining.

timholy · 2019-11-13T08:43:45Z

Wow, those are all over the map. This is mostly due to the architecture or the Julia version?

kimikage · 2019-11-13T10:02:32Z

I think the inlining problem is mostly due to the Julia version. However, I don't know which commit changed the behavior. (JuliaLang/julia#24362 has been merged also into Julia v1.0.x branch.)
I'm going to try some workarounds (e.g. use of @inline, the pre-calculated literals, the LUTs with const keyword, the macros for generating bit patterns...).
BTW, I think the SIMD vectorization is so great that the inlining bonus should be added for the effect.

kimikage · 2019-11-14T12:26:27Z

Note to Self: Workarounds for the inlining problem

@inline
- This is not so helpful since this has no effect to the caller.
- This may be helpful for separated sub functions (i.e. callees), because the optimization techniques (e.g. constant propagation, dead branch elimination) are applied when the sub functions are inlined.
  - Actually, the sub functions may not need the explicit @inline, though.
Pre-calculated literals
- This is very effective.
- However, this introduces more magic numbers and verbose method definitions.
LUTs with const keyword
- This is never good due to inline_nonleaf_penalty, even though the actual costs are very low.
Macros for generating bit patterns
- This may help remove expensive codes such as /.
  - div_float does not seem to be folded by Julia v1.0.x optimizer around the inlining stage, although it is finally folded.
- This brings difficulties on readability and maintainability .

kimikage · 2019-11-15T13:27:11Z

This is not a place to say a request, but I hope exp2(::Integer), ldexp(x,::Integer), etc. are calculated into constants at compile time.

Where Do We Come From? What Are We? Where Are We Going? 😵

macro exp2(n)
     :(_exp2(Val($(esc(n)))))
end
@generated _exp2(::Val{N}) where {N} = (x = exp2(N); :($x))

# for Julia v1.0, which does not fold `div_float` before inlining
@generated inv_rawone(::T) where {T <: Normed} = (x = 1.0 / rawone(T); :($x))

Edit: see: #138 (comment)

kimikage · 2019-11-17T03:14:28Z

Benchmark 3

Julia v1.0.5 i686-w64-mingw32 (WoW64)

julia> versioninfo()
Julia Version 1.0.5
Commit 3af96bcefc (2019-09-09 19:06 UTC)
Platform Info:
  OS: Windows (i686-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 32
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

Vector (unit: μs)

wow64	`Float32` master	`Float32` high accuracy	`Float32` simple div	`Float64` master	`Float64` high accuracy	`Float64` simple div
`N0f8`	60.600	60.600	60.700	60.900	62.400	61.600
`N5f3`	60.700	61.900	60.700	60.900	61.100	61.600
`N0f16`	138.000	152.500	163.500	136.100	150.400	172.300
`N13f3`	138.000	152.500	163.500	136.100	150.600	172.400
`N0f32`	65.300	69.400	66.700	64.800	64.900	65.500
`N8f24`	65.200	122.700	66.700	64.800	65.100	71.600
`N29f3`	65.200	122.200	66.800	64.800	66.300	69.900
`N0f64`	183.100	91.300	40.11e3	161.100	87.700	41.25e3
`N61f3`	183.200	215.800	28.38e3	161.100	197.600	28.25e3
`N0f128`	312.100	115.700	83.83e3	328.600	112.400	83.61e3
`N125f3`	310.000	126.500	45.05e3	328.800	120.300	45.17e3

Normed{UInt32}s are faster than native 64-bit version. Normed{UInt64}s and Normed{UInt128}s are slower.

Matrix of Vec4 (unit: μs)

wow64	`Float32` master	`Float32` high accuracy	`Float32` simple div	`Float64` master	`Float64` high accuracy	`Float64` simple div
`N0f8`	18.300	3.263	3.350	18.600	4.000	7.900
`N5f3`	18.300	3.288	3.362	18.600	3.671	7.900
`N0f16`	19.300	3.200	3.400	19.600	3.957	7.933
`N13f3`	19.200	3.250	3.500	19.600	3.733	7.900
`N0f32`	18.800	4.043	7.900	18.500	5.050	7.933
`N8f24`	18.900	4.171	7.833	19.000	7.633	7.967
`N29f3`	18.700	4.514	8.650	22.500	5.217	8.033
`N0f64`	37.600	20.900	42.44e3	41.000	18.500	42.44e3
`N61f3`	40.800	18.800	29.09e3	39.800	17.300	28.05e3
`N0f128`	177.100	70.500	83.82e3	220.000	66.400	83.91e3
`N125f3`	176.200	76.000	46.27e3	218.900	68.200	45.70e3

Normed{UInt64}s and Normed{UInt128}s are slower than native 64-bit version. The others are faster or the same.

kimikage · 2019-11-17T03:53:26Z

BTW, I think there is a problem in interoperability with specialization using integer literal parameters.

julia> Normed{UInt32,Int64(1)} === Normed{UInt32,Int32(1)}
false

julia> Normed{UInt32,Int64(1)} == Normed{UInt32,Int32(1)}
false

I don't know how serializers (e.g. JLD2) do (de-)serialization.

kimikage · 2019-11-18T09:31:29Z

Well, you might wonder, "What about the Fixed?"

FixedPointNumbers.jl/src/fixed.jl

Lines 63 to 64 in ee5bd54

    
           (::Type{TF})(x::Fixed{T,f}) where {TF <: AbstractFloat,T,f} = 
        
               TF(x.i>>f) + TF(x.i&(one(widen1(T))<<f - 1))/TF(one(widen1(T))<<f)

I could not believe my eyes. It far exceeds my understanding, so I'm not going to touch it.:see_no_evil:

timholy · 2019-11-19T15:34:36Z

Definitely not beyond your understanding, and indeed the Fixed cases are easier: they are scaled by 2^f rather than 2^f-1, so conversion operations are just bit-shifts. Here are the pieces:

TF(x.i>>f): this just converts the integer portion. x.i>>f truncates everything after the "virtual decimal point"
y = x.i&(one(widen1(T))<<f - 1): this zeros the bits corresponding to the integer portion, leaving only the fractional portion (the last f bits) represented as an integer
TF(y)/TF(one(widen1(T))<<f): this computes the fractional part as a floating-point number

kimikage · 2019-11-19T16:38:58Z

I apologize that I probably misled you. I mean..., I don't know the reason for not doing the following specialization.

Base.Float32(x::Fixed{T,f}) where {T <: Union{Int8, Int16}, f} = Float32(x.i) * (1.0f0 / (1<<f))
Base.Float64(x::Fixed{T,f}) where {T <: Union{Int8, Int16, Int32}, f} = Float64(x.i) * (1.0 / (Int64(1)<<f))

Oops! 🙊

timholy · 2019-11-19T17:47:18Z

I also wondered if it was more complicated than it needed to be. Assuming it works, seems worth changing. I've not really paid much attention (as you can tell) to Fixed.

kimikage · 2019-11-22T11:20:57Z

Fixed by PR #138.
The related issues found in this issue report and PR #138 will be considered separately.

kimikage · 2019-12-25T08:17:51Z

FYI, the problem with rounding error also occurs in BigFloat conversion. ~~I am not going to fix this.~~ Edit: fixed (improved) by PR #186

julia> 5.0N32f32 == 5.0
false

julia> Float64(5.0N32f32) == 5.0
true

julia> promote_type(N32f32, Float64)
BigFloat

julia> BigFloat(5.0N32f32)
5.000000000000000000000000000000000000000000000000000000000000000000000000000069

kimikage changed the title ~~Rounding errors in Normed to Float32 conversions~~ Rounding errors in Normed to Float conversions Oct 14, 2019

This was referenced Oct 23, 2019

poor performance #125

Closed

What is scaledual? #130

Closed

kimikage mentioned this issue Oct 31, 2019

Improve accuracy of conversions from floating-point numbers (Fixes #102) #131

Merged

kimikage mentioned this issue Nov 2, 2019

[RFC] Whether RGB <-> XYZ conversion should be more faster or be kept simple JuliaGraphics/Colors.jl#351

Closed

kimikage mentioned this issue Nov 6, 2019

rem returns zero for negative float input on ARMv7 #134

Closed

kimikage mentioned this issue Nov 17, 2019

Improve accuracy of conversions to floating-point numbers (#129) #138

Merged

kimikage mentioned this issue Nov 19, 2019

Optimizing Normed -> Normed conversions #140

Closed

kimikage closed this as completed Nov 22, 2019

timholy mentioned this issue Nov 22, 2019

Road to 1.0 #142

Open

kimikage mentioned this issue Nov 27, 2019

Performance regression in Normed -> Float conversions on Julia v1.3.0 #144

Closed

kimikage mentioned this issue Dec 28, 2019

Upper limit of the number of fractional bits f #155

Closed

kimikage mentioned this issue Jan 11, 2020

Data interoperability between 64-bit systems and 32-bit systems #162

Closed

kimikage mentioned this issue Jan 22, 2020

Optimizing Fixed --> Float conversions #171

Closed

kimikage mentioned this issue Jul 10, 2020

Add big() and rationalize() for FixedPoint #186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rounding errors in `Normed` to `Float` conversions #129

Rounding errors in `Normed` to `Float` conversions #129

kimikage commented Oct 12, 2019

kimikage commented Oct 13, 2019 •

edited

Loading

kimikage commented Oct 14, 2019 •

edited

Loading

kimikage commented Oct 16, 2019 •

edited

Loading

kimikage commented Oct 19, 2019 •

edited

Loading

kimikage commented Oct 19, 2019 •

edited

Loading

kimikage commented Oct 20, 2019 •

edited

Loading

kimikage commented Oct 21, 2019 •

edited

Loading

timholy commented Oct 31, 2019

kimikage commented Nov 1, 2019 •

edited

Loading

timholy commented Nov 1, 2019

kimikage commented Nov 1, 2019 •

edited

Loading

timholy commented Nov 1, 2019

kimikage commented Nov 4, 2019 •

edited

Loading

kimikage commented Nov 13, 2019 •

edited

Loading

timholy commented Nov 13, 2019

kimikage commented Nov 13, 2019 •

edited

Loading

kimikage commented Nov 14, 2019 •

edited

Loading

kimikage commented Nov 15, 2019 •

edited

Loading

kimikage commented Nov 17, 2019 •

edited

Loading

kimikage commented Nov 17, 2019 •

edited

Loading

kimikage commented Nov 18, 2019

timholy commented Nov 19, 2019

kimikage commented Nov 19, 2019 •

edited

Loading

timholy commented Nov 19, 2019

kimikage commented Nov 22, 2019

kimikage commented Dec 25, 2019 •

edited

Loading

Rounding errors in Normed to Float conversions #129

Rounding errors in Normed to Float conversions #129

Comments

kimikage commented Oct 12, 2019

kimikage commented Oct 13, 2019 • edited Loading

kimikage commented Oct 14, 2019 • edited Loading

kimikage commented Oct 16, 2019 • edited Loading

kimikage commented Oct 19, 2019 • edited Loading

Implementation status

Julia v1.2.0 x86_64

Tolerance

kimikage commented Oct 19, 2019 • edited Loading

kimikage commented Oct 20, 2019 • edited Loading

Benchmark

Script

Julia v1.2.0 x86_64-w64-mingw32

Vector (unit: μs)

Matrix of Vec3 (unit: μs)

Matrix of Vec4 (unit: μs)

Julia v1.2.0 x86_64-pc-linux-gnu (Debian on WSL)

Vector (unit: μs)

Matrix of Vec3 (unit: μs)

Matrix of Vec4 (unit: μs)

Linux 32-bit (ARMv7)

Windows 32-bit (i686) (WoW64)

kimikage commented Oct 21, 2019 • edited Loading

timholy commented Oct 31, 2019

kimikage commented Nov 1, 2019 • edited Loading

timholy commented Nov 1, 2019

kimikage commented Nov 1, 2019 • edited Loading

timholy commented Nov 1, 2019

kimikage commented Nov 4, 2019 • edited Loading

kimikage commented Nov 13, 2019 • edited Loading

Benchmark 2

Julia v1.0.3 arm-linux-gnueabihf (Raspbian Buster on RPi2 v1.2)

Matrix of Vec4 (unit: μs)

timholy commented Nov 13, 2019

kimikage commented Nov 13, 2019 • edited Loading

kimikage commented Nov 14, 2019 • edited Loading

Note to Self: Workarounds for the inlining problem

kimikage commented Nov 15, 2019 • edited Loading

kimikage commented Nov 17, 2019 • edited Loading

Benchmark 3

Julia v1.0.5 i686-w64-mingw32 (WoW64)

Vector (unit: μs)

Matrix of Vec4 (unit: μs)

kimikage commented Nov 17, 2019 • edited Loading

kimikage commented Nov 18, 2019

timholy commented Nov 19, 2019

kimikage commented Nov 19, 2019 • edited Loading

timholy commented Nov 19, 2019

kimikage commented Nov 22, 2019

kimikage commented Dec 25, 2019 • edited Loading

Rounding errors in `Normed` to `Float` conversions #129

Rounding errors in `Normed` to `Float` conversions #129

kimikage commented Oct 13, 2019 •

edited

Loading

kimikage commented Oct 14, 2019 •

edited

Loading

kimikage commented Oct 16, 2019 •

edited

Loading

kimikage commented Oct 19, 2019 •

edited

Loading

kimikage commented Oct 19, 2019 •

edited

Loading

kimikage commented Oct 20, 2019 •

edited

Loading

kimikage commented Oct 21, 2019 •

edited

Loading

kimikage commented Nov 1, 2019 •

edited

Loading

kimikage commented Nov 1, 2019 •

edited

Loading

kimikage commented Nov 4, 2019 •

edited

Loading

kimikage commented Nov 13, 2019 •

edited

Loading

kimikage commented Nov 13, 2019 •

edited

Loading

kimikage commented Nov 14, 2019 •

edited

Loading

kimikage commented Nov 15, 2019 •

edited

Loading

kimikage commented Nov 17, 2019 •

edited

Loading

kimikage commented Nov 17, 2019 •

edited

Loading

kimikage commented Nov 19, 2019 •

edited

Loading

kimikage commented Dec 25, 2019 •

edited

Loading