Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement Base.hash(x::DecFP.DecimalFloatingPoint) #97

Closed
Jasha10 opened this issue Jul 25, 2019 · 16 comments · Fixed by #125
Closed

implement Base.hash(x::DecFP.DecimalFloatingPoint) #97

Jasha10 opened this issue Jul 25, 2019 · 16 comments · Fixed by #125

Comments

@Jasha10
Copy link

Jasha10 commented Jul 25, 2019

We could use Base.hash(x::DecFP.DecimalFloatingPoint, h::UInt) = hash(x.x, h). But I have thought of a problem:
If x and y are of type Dec64, and if x == y, then we must be able to guarantee hash(x) == hash(y). Since the definition of Base.:(==)(x::Dec64, y::Dec64) makes a ccall to the Intel Decimal Floating-Point Math Library, I doubt that x == y will imply hash(x.x) == hash(y.x).
Any ideas about how we can provide a good hash function for use in Julia's Dict, Set, etc?

@Jasha10
Copy link
Author

Jasha10 commented Jul 28, 2019

For example, we do have d64"-0.0" == d64"0.0" but we do not have hash(d64"-0.0".x) == hash(d64"0.0".x)

@Jasha10

This comment has been minimized.

@Jasha10

This comment has been minimized.

@jmkuhn
Copy link
Contributor

jmkuhn commented Jul 28, 2019

For hashing we want isequal(x, y) to imply hash(x) == hash(y). See isequal for the differences vs == for floating point numbers. In particular:

julia> -0.0 == 0.0
true

julia> isequal(-0.0, 0.0)
false

julia> hash(-0.0) == hash(0.0)
false

So for -0.0/0.0 a simple implementation for hash() would already do the right thing. Where we would have problems is with:

julia> x = d64"1"
1.0

julia> y = d64"1.0"
1.0

julia> isequal(x, y)
true

julia> reinterpret(UInt64, x)
0x31c0000000000001

julia> reinterpret(UInt64, y)
0x31a000000000000a

In this case x is stored as 1e0 and y is stored as 10e-1. They compare as equal but would have different hashes. I'll look at the library to see if they offer any function to normalize these to the same representation for hashing.

@stevengj
Copy link
Member

stevengj commented Jul 29, 2019

For hashing we want isequal(x, y) to imply hash(x) == hash(y)

This is tricky because it should ideally work for arguments of different numeric types. For example, Base gives:

julia> hash(3.0) == hash(3.0f0) == hash(3) == hash(3+0im)
true

Accomplishing this in a reasonably efficient way was only possible after a lot of effort by @StefanKarpinski (in JuliaLang/julia#6624).

Doing hash(x.x) for DecFP types, i.e. hashing the raw binary representation, would definitely break this.

@stevengj
Copy link
Member

stevengj commented Jul 29, 2019

Right now the only thing I can think of is something like:

function Base.hash(x::Dec64, h::UInt)
    xf = Float64(x)
    isequal(xf, x) && return(xf, h)
    # ...convert to Rational{BigInt} and call hash(::Real, h) ...
end

Converting to Rational{BigInt} is, of course, crazy inefficient, but in principle I guess we could implement some version of the rational hashing algorithm that does this implicitly (one 64-bit word at a time).

@StefanKarpinski
Copy link

StefanKarpinski commented Jul 29, 2019

Here is what you need to implement:

https://github.com/JuliaLang/julia/blob/442d1599e62b1a3a453bd/base/hashing2.jl#L28-L94

If you implement Base.decompose that follows the rules described there, then you will automatically get correct hashing. Namely:

decompose(x): non-canonical decomposition of rational values as num*2^pow/den.

The decompose function is the point where rational-valued numeric types that support
hashing hook into the hashing protocol. decompose(x) should return three integer
values num, pow, den, such that the value of x is mathematically equal to

num*2^pow/den

The decomposition need not be canonical in the sense that it just needs to be some
way to express x in this form, not any particular way – with the restriction that
num and den may not share any odd common factors. They may, however, have powers
of two in common – the generic hashing code will normalize those as necessary.

Special values:

  • x is zero: num should be zero and den should have the same sign as x
  • x is infinite: den should be zero and num should have the same sign as x
  • x is not a number: num and den should both be zero

I don't really understand the details of the DecFP type, but I presume that it represents rational values of the form n*10^p where n and p are both integers. You want the following return value for Base.decompose, assuming that n and p have been extracted:

p  0 ? (n*5^p, p, 1) : (n, p, 5^-p)

You need to be careful that the return values can't overflow, of course. If n*5^p could overflow Int then you'll need a bigger integer type; similarly for 5^p (if p can have a larger negative size than positive, this could be a concern without the other happening). You'll also want to be careful of type stability between these two branches.

It may be possible to get even greater efficiency by implementing a more specialized version of Base.hash that short-circuits the use of decompose but it would have to behave as if the decompose that I've described has been used in order to be correct.

@StefanKarpinski
Copy link

Oh, you also may need to move powers of 5 from n in the negative p case.

@Jasha10
Copy link
Author

Jasha10 commented Aug 2, 2019

Thanks so much for your input, Stefan. After some thinking, I have come up with the following routine for extracting integer numbers n and p such that x==n*10^p:

using DecFP
precision(::Dec32)=7
precision(::Dec64)=16
precision(::Dec128)=34
function extract_n_and_p(x::DecFP.DecimalFloatingPoint)
    e = exponent10(x)
    p = e + 1 - precision(x)
    n = ldexp10(x, -p)
    return (n,p)
end

The precision function encodes the number of digits in the significand of the decimal floating-point number. These precision numbers are taken from Table 3.6 of the IEEE Standard 754-2008 (which is what the Intel library wrapped by DecFP.jl implements). According to the standard, computing n = ldexp10(x,-p) guarantees that n will be an integer (so long as x is finite and not NaN).

Regarding the possibility of integer overflow: we are guaranteed that n satisfies
-10^precision(x) <= n <= 10^precision(x),
where x is the DecFP number from which n is extracted. So if p is positive, we have to handle integers in the range
-10^precision(x) * 5^p <= n * 5^p <= 10^precision(x) * 5^p.
In general, we have pmin(x) <= p <= pmax(x) where pmax and pmin are given below:

pmax(::Dec32 ) =  90
pmin(::Dec32 ) = -87
pmax(::Dec64 ) =  369
pmin(::Dec64 ) = -366
pmax(::Dec128) =  6111
pmin(::Dec128) = -6108

So, for example, if x is of type Dec64 and the extracted exponent p is positive, we might have to deal with integer numbers in the range
-10^16 * 5^369 <= n*5^p <= 10^16 * 5^360.
Is it ok for the numerator returned from decompose be a BigInt?

@stevengj
Copy link
Member

stevengj commented Aug 2, 2019

You need to be careful that the return values can't overflow, of course. If n*5^p could overflow Int then you'll need a bigger integer type; similarly for 5^p

That's the basic problem here — realmax(Dec64) is 9.999999999999999e384, which means that 5^p could be as big as 5^384 ≈ 2.54e268. Hence we would have to use BigInt to use your decompose function, which will be very slow.

@StefanKarpinski
Copy link

StefanKarpinski commented Aug 2, 2019

All may not be lost. This might be avoidable with some additional cleverness. See hash_integer:

https://github.com/JuliaLang/julia/blob/442d1599e62b1a3a453bd/base/hashing2.jl#L5-L26

This implements hashing of the integer values num, pow and den. The first method is a generic one which is generally fairly efficient but not specialized while the second method is a fast method specialized for the in-memory representation of BigInts—it doesn't do any intermediate BigInt computations, which require additional allocations. You can't quite do the same thing here, but you don't need to go to BigInt either. Instead, you could compute 5^p % UInt directly and get blocks of binary digits incrementally. It's a tricky but it seems plausible with a bit of modular arithmetic.

@StefanKarpinski
Copy link

In particular, UInt(5)^p already computes the last block of binary digits, i.e. big(5)^p % UInt. The hard part is figuring out what the quotient of that division is so that we can repeat this.

There is always the "nuclear option" of changing the way Base hashing works. I doubt many packages are hooking into it at this point and we can easily figure out the ones that do so.

@StefanKarpinski
Copy link

Did this get fixed?

@stevengj
Copy link
Member

stevengj commented May 14, 2020

Yes, although the solution involved BigInt (#114). It would be nice to have faster hashing and rational comparisons in the future.

@jmkuhn
Copy link
Contributor

jmkuhn commented May 14, 2020

Yes, a working version is currently in DecFP#master. PR #114 implemented Base.decompose() and #125 added tests for hash().

@StefanKarpinski
Copy link

Ok, glad it could be fixed, that is a shame about the necessity for bigints, but hopefully it can become more optimized in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants