Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conversion #12

Open
milankl opened this issue Feb 19, 2020 · 10 comments
Open

conversion #12

milankl opened this issue Feb 19, 2020 · 10 comments
Labels
bug Something isn't working

Comments

@milankl
Copy link
Member

milankl commented Feb 19, 2020

Just found this

julia> Float8(0.015)
Float8(0.0625)
julia> Float8(0.02)
Float8(0.015625)

although 0.015 <= 0.02 that's not true after conversion. Problem only occurs for subnormals. It's not the conversion back to Float32 (triggered by show):

 julia> bitstring(Float8(0.015))
"00000100"
julia> bitstring(Float8(0.02))
"00000001"
@milankl
Copy link
Member Author

milankl commented Feb 26, 2020

@JeffreySarnoff do you have an idea? There's something wrong with the rounding for subnormals I assume...

https://github.com/milankl/Float8s.jl/blob/9345556ac3d11f61bef2c70f49947113b8949066/src/float8.jl#L164-L193

@milankl milankl added the bug Something isn't working label Feb 27, 2020
@JeffreySarnoff
Copy link
Member

I'll look at it tonight.

@JeffreySarnoff
Copy link
Member

JeffreySarnoff commented Feb 28, 2020

still looking, meanwhile

Base.Float32(x::Float8) = Float8toFloat32[reinterpret(UInt8,x) + 0x01]

const Float8toFloat32 = Float32[0.0, 0.015625, 0.03125, 0.046875, 0.0625, 0.078125, 0.09375, 0.109375, 0.125, 0.140625, 0.15625, 0.171875, 0.1875, 0.203125, 0.21875, 0.234375, 0.25, 0.265625, 0.28125, 0.296875, 0.3125, 0.328125, 0.34375, 0.359375, 0.375, 0.390625, 0.40625, 0.421875, 0.4375, 0.453125, 0.46875, 0.484375, 0.5, 0.53125, 0.5625, 0.59375, 0.625, 0.65625, 0.6875, 0.71875, 0.75, 0.78125, 0.8125, 0.84375, 0.875, 0.90625, 0.9375, 0.96875, 1.0, 1.0625, 1.125, 1.1875, 1.25, 1.3125, 1.375, 1.4375, 1.5, 1.5625, 1.625, 1.6875, 1.75, 1.8125, 1.875, 1.9375, 2.0, 2.125, 2.25, 2.375, 2.5, 2.625, 2.75, 2.875, 3.0, 3.125, 3.25, 3.375, 3.5, 3.625, 3.75, 3.875, 4.0, 4.25, 4.5, 4.75, 5.0, 5.25, 5.5, 5.75, 6.0, 6.25, 6.5, 6.75, 7.0, 7.25, 7.5, 7.75, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 12.0, 12.5, 13.0, 13.5, 14.0, 14.5, 15.0, 15.5, Inf32, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, -0.0, -0.015625, -0.03125, -0.046875, -0.0625, -0.078125, -0.09375, -0.109375, -0.125, -0.140625, -0.15625, -0.171875, -0.1875, -0.203125, -0.21875, -0.234375, -0.25, -0.265625, -0.28125, -0.296875, -0.3125, -0.328125, -0.34375, -0.359375, -0.375, -0.390625, -0.40625, -0.421875, -0.4375, -0.453125, -0.46875, -0.484375, -0.5, -0.53125, -0.5625, -0.59375, -0.625, -0.65625, -0.6875, -0.71875, -0.75, -0.78125, -0.8125, -0.84375, -0.875, -0.90625, -0.9375, -0.96875, -1.0, -1.0625, -1.125, -1.1875, -1.25, -1.3125, -1.375, -1.4375, -1.5, -1.5625, -1.625, -1.6875, -1.75, -1.8125, -1.875, -1.9375, -2.0, -2.125, -2.25, -2.375, -2.5, -2.625, -2.75, -2.875, -3.0, -3.125, -3.25, -3.375, -3.5, -3.625, -3.75, -3.875, -4.0, -4.25, -4.5, -4.75, -5.0, -5.25, -5.5, -5.75, -6.0, -6.25, -6.5, -6.75, -7.0, -7.25, -7.5, -7.75, -8.0, -8.5, -9.0, -9.5, -10.0, -10.5, -11.0, -11.5, -12.0, -12.5, -13.0, -13.5, -14.0, -14.5, -15.0, -15.5, -Inf32, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN];

@milankl
Copy link
Member Author

milankl commented Feb 28, 2020

Yeah great, I always forget how small 8bit actually is. Will check how much faster that is!

@milankl
Copy link
Member Author

milankl commented Feb 28, 2020

Yep, a lookup table made it about 2x faster. I've added an @inbounds which gave another 2x speed-up.

Base.Float32(x::Float8) = @inbounds Float8toFloat32[reinterpret(UInt8,x) + 0x01]

@JeffreySarnoff
Copy link
Member

JeffreySarnoff commented Feb 28, 2020

The table method does not appear to round properly.
I put this together. The only function to call directly
is Float8(x::Float32) which accepts a Float32 value
and yields the corresponding (round nearest) Float8.

# the constants
# ---------------

#=
   One table is split into subsections of 16 entries each.
   This keeps the scan time minimal, even after accounting
   for the conditionals that select the proper subsection.

   These are Float8 values offset by 1/2 way to the next value,
   this lets `findfirst` return an appropriately rounded value.

   The `F8offsetN` tuples are used with normal values,
   values that are finite and are not subnormal. Values
   that round to zero are handled before these are used.
=#

const F8offset1 = (Float32[
   0.2578125, 0.2734375, 0.2890625, 0.3046875, 0.3203125, 0.3359375, 
   0.3515625, 0.3671875, 0.3828125, 0.3984375, 0.4140625, 0.4296875,
   0.4453125, 0.4609375, 0.4765625, 0.4921875]...,)

const F8offset2 = (Float32[
    0.515625, 0.546875, 0.578125, 0.609375, 0.640625, 0.671875, 
    0.703125, 0.734375, 0.765625, 0.796875, 0.828125, 0.859375,
    0.890625, 0.921875, 0.953125, 0.984375]...,);

const F8offset3 = (Float32[
    1.03125, 1.09375, 1.15625, 1.21875, 1.28125, 1.34375, 1.40625,
    1.46875, 1.53125, 1.59375, 1.65625, 1.71875, 1.78125, 1.84375, 
    1.90625, 1.96875]...,);

const F8offset4 = (Float32[
    2.0625, 2.1875, 2.3125, 2.4375, 2.5625, 2.6875, 2.8125,
    2.9375, 3.0625, 3.1875, 3.3125, 3.4375, 3.5625, 3.6875,
    3.8125, 3.9375]...,);

const F8offset5 = (Float32[
    4.125, 4.375, 4.625, 4.875, 5.125, 5.375, 5.625, 5.875, 6.125,
    6.375,  6.625, 6.875, 7.125, 7.375, 7.625, 7.875]...,);

const F8offset6 = (Float32[
    8.25, 8.75, 9.25, 9.75, 10.25, 10.75, 11.25, 11.75, 12.25, 12.75,
    13.25, 13.75, 14.25, 14.75, 15.25, 15.75]...,);
#=
     There is one table used with subnormal values.
      It is derived from the actual values of each
      subnormal quantity, shifted up halfway to the
      next subnormal.  This lets scanning also round.
      An initial (anchor) value is prepended, that value
      is half of the smallest subnormal.

     A corresponding table of UInt8 values also is used.
=#

const F8offset_subnormal = (
    0.0078125f0, 0.0234375f0, 0.0390625f0, 0.0546875f0, 0.0703125f0,
    0.0859375f0, 0.1015625f0, 0.1171875f0, 0.1328125f0, 0.1484375f0,
    0.1640625f0, 0.1796825f0, 0.1953125f0, 0.2109375f0, 0.2265625f0,
    0.2421875f0)

const U8subnormal = (collect(UInt8.(0:15))...,)

# some named constants to clarify the source text

const roundsto_floatmax8 = 15.25f0
const roundsto_zero8 = 0.0078125f0
const roundsto_subnormal = 0.2421875f0
const floatmaxplus8 = 15.75f0 # floatmax(Float8) + floatmin(Float8)

const UNaN8 = 0x78
const UInf8 = 0x70
const UFloatmax8 = 0x6f
const UFloatmin8 = 0x10
const UZero8 = 0x00
#  the functions
# ----------------

function Float8(x::Float32)
    s, absx = signbit(x), abs(x)
    ui =toUInt8(x)
    return reinterpret(Float8, ui)
end

function toUInt8(x::Float32)
    s, absx = signbit(x), abs(x)
    isnan(absx) && return s ? UNaN8|0x80 : UNaN8
    if absx >= roundsto_floatmax8
        if absx > floatmaxplus8
            return s ? UInf8|0x80 : UInf8
        else
            return s ? UFloatmax8|0x80 : UFloatmax8
        end
    end
    if absx < roundsto_zero8
        return s ? UZero8|0x80 : UZero8
    elseif absx < roundsto_subnormal
        return subnormal8(s, absx)
    end
    absx = min(15.5f0, max(0.25f0, absx))
    return normal8(s, absx)
end

@inline function subnormal8(s::Bool, absx::Float32)
    idx = findfirst(a->absx <= a, F8offset_subnormal)
    return s ? U8subnormal[idx] | 0x80 : U8subnormal[idx]
end

function normal8(s::Bool, absx::Float32)
   if absx <= 0.4921875f0
        idx = UInt8(15+findfirst(a->absx <= a, F8offset1))
        return s ? idx|0x80 : idx
   elseif absx <= 0.984375f0
        idx = UInt8(15+16+findfirst(a->absx <= a, F8offset2))
        return s ? idx|0x80 : idx
   elseif absx <= 1.96875f0
       idx = UInt8(15+32+findfirst(a->absx <= a, F8offset3))
       return s ? idx|0x80 : idx
   elseif absx <= 3.9375f0
       idx = UInt8(15+32+16+findfirst(a->absx <= a, F8offset4))
       return s ? idx|0x80 : idx
   elseif absx <= 7.875f0
       idx = UInt8(15+64+findfirst(a->absx <= a, F8offset5))
       return s ? idx|0x80 : idx
   elseif absx <= 15.75f0
       idx = UInt8(15+64+16+findfirst(a->absx <= a, F8offset6))
       return s ? idx|0x80 : idx
   else
       throw(DomainError(absx,"not normal for Float8"))
   end
end

@milankl
Copy link
Member Author

milankl commented Mar 3, 2020

Amazing Jeffrey, I'll implement that in the next days, thanks so much!

@JeffreySarnoff
Copy link
Member

JeffreySarnoff commented Mar 4, 2020

note correction in toUInt8
absx = min(15.5f0, max(0.25f0, absx))

This is yet faster. It uses the fact that the original discrimination ensures any input to normal8 will be found in firstof16lte and splits the search in normal8.

function normal8(s::Bool, absx::Float32)
   if absx <= 1.96875f0
     if absx <= 0.4921875f0
          idx = UInt8(15+firstof16lte(absx, F8offset1))
          return s ? idx|0x80 : idx
     elseif absx <= 0.984375f0
          idx = UInt8(15+16+firstof16lte(absx, F8offset2))
          return s ? idx|0x80 : idx
     else
         idx = UInt8(15+32+firstof16lte(absx, F8offset3))
         return s ? idx|0x80 : idx
     end    
   else  
     if absx <= 3.9375f0
         idx = UInt8(15+32+16+firstof16lte(absx, F8offset4))
         return s ? idx|0x80 : idx
     elseif absx <= 7.875f0
         idx = UInt8(15+64+firstof16lte(absx, F8offset5))
         return s ? idx|0x80 : idx
     else
         idx = UInt8(15+64+16+firstof16lte(absx, F8offset6))
         return s ? idx|0x80 : idx
     end
   end
end

function firstof16lte(needle, haystack)
    for idx = 1:16
        if needle <= haystack[idx]
            return idx
        end
    end
    error("should not be reached")
end

@milankl
Copy link
Member Author

milankl commented Mar 16, 2020

Great, yeah, master now contains your version, as mine contained a bug. However, I get considerably slower timings

julia> A = rand(Float32,300,300) .+ 1f0;  # no subnormals
julia> @btime Float8.($A); # Jeffrey's version
  2.815 ms (2 allocations: 88.02 KiB)

julia> @btime Float8_old.($A); # Milan's version (only correct for normals)
  1.652 ms (2 allocations: 88.02 KiB)

julia> @btime Float16.($A);
  1.065 ms (2 allocations: 175.89 KiB)

Any idea where this comes from?

@JeffreySarnoff
Copy link
Member

You could use a hybrid, yours instead of my normal8 iff isnormal, keeping mine for issubnormal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants