hashing BigInts is slow #8727

timholy · 2014-10-18T19:04:03Z

As reported here, using BigInts as keys to a Dict is slow. The culprit is hashing, specifically this line, which gets called from here. I don't know enough about BigInts to propose a better way to hash them, but I strongly suspect that the current approach is not The Answer.

StefanKarpinski · 2014-10-18T19:21:05Z

Yup. I knew that was going to be trouble some day. The rest of BigInt hashing should be ok, actually.

StefanKarpinski · 2014-10-18T19:24:11Z

This particular problem can be solved without fixing ndigits0z since we just need to know how many bits the value is, but it would be good to actually just fix that function.

JeffBezanson · 2014-10-18T20:13:39Z

Why a library that contains the world's most advanced big integer algorithms cannot accurately compute the number of digits in a number is simply beyond me.

StefanKarpinski · 2014-10-18T21:14:23Z

That one was too hard.

simonbyrne · 2014-10-18T21:43:42Z

It says that base 2 is exact, so perhaps it's worth having a specific nbits0z method that can drop the extra check? I could also use this in #8463.

StefanKarpinski · 2014-10-18T22:02:24Z

Just checking for bases that are powers inside of ndigits0z might be good enough.

`sizeinbase` from gmp is exact for powers of two, so the checks are not needed.

rfourquet · 2014-10-18T22:11:04Z

Oh sorry, I cooked an easy PR for this without seeing the updated thread here.

StefanKarpinski · 2014-10-20T15:41:49Z

Here's some investigation of when GMP is off by one:

julia> using StatsBase

julia> nd(x::BigInt, b::Integer=10) =
       int(ccall((:__gmpz_sizeinbase,:libgmp), Culong, (Ptr{BigInt}, Int32), &x, b))
nd (generic function with 2 methods)

julia> map(x->factor(x+1), cumsum(rle([ nd(big(n),7)-ndigits(big(n),7) for n=1:2^17-1 ])[2]))
13-element Array{Dict{Int64,Int64},1}:
 Dict(2=>2)
 Dict(7=>1)
 Dict(2=>5)
 Dict(7=>2)
 Dict(2=>8)
 Dict(7=>3)
 Dict(2=>11)
 Dict(7=>4)
 Dict(2=>14)
 Dict(7=>5)
 Dict(2=>16)
 Dict(7=>6)
 Dict(2=>17)

julia> map(x->factor(x+1), cumsum(rle([ nd(big(n),6)-ndigits(big(n),6) for n=1:2^17-1 ])[2]))
13-element Array{Dict{Int64,Int64},1}:
 Dict(2=>2)
 Dict(2=>1,3=>1)
 Dict(2=>5)
 Dict(2=>2,3=>2)
 Dict(2=>7)
 Dict(2=>3,3=>3)
 Dict(2=>10)
 Dict(2=>4,3=>4)
 Dict(2=>12)
 Dict(2=>5,3=>5)
 Dict(2=>15)
 Dict(2=>6,3=>6)
 Dict(2=>17)

This is largely for my own record since it's probably not super-clear to anyone else what this is showing, but in short, it looks like the answer is off by one between powers of the base and the next power of two – which makes a lot of sense. The question is how to figure out when this is the case efficiently.

timholy · 2014-10-20T18:52:29Z

@JuliaBackports

Fixes #8727, but it's kind of a bandaid since for non-power-of-two bases ndigits0z is still slow due to GMP's inexcusable laziness. At least for power-of-two bases, however, BigInt hashing is now about 3x faster and allocates no memory. (cherry picked from commit 19eb2fa)

ivarne · 2014-10-22T19:06:27Z

Backported in b1fc473

StefanKarpinski · 2014-10-22T19:12:31Z

Thanks, @ivarne.

JeffBezanson added the performance Must go faster label Oct 18, 2014

rfourquet added a commit to rfourquet/julia that referenced this issue Oct 18, 2014

faster BigInt hashing (fix JuliaLang#8727)

2e167fe

`sizeinbase` from gmp is exact for powers of two, so the checks are not needed.

StefanKarpinski closed this as completed in 19eb2fa Oct 20, 2014

StefanKarpinski mentioned this issue Oct 20, 2014

ndigits0z(n::BigInt, b::Integer) is slow for b != 2^k #8743

Closed

ivarne added the backport pending label Oct 20, 2014

ivarne removed the backport pending label Oct 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hashing BigInts is slow #8727

hashing BigInts is slow #8727

timholy commented Oct 18, 2014

StefanKarpinski commented Oct 18, 2014

StefanKarpinski commented Oct 18, 2014

JeffBezanson commented Oct 18, 2014

StefanKarpinski commented Oct 18, 2014

simonbyrne commented Oct 18, 2014

StefanKarpinski commented Oct 18, 2014

rfourquet commented Oct 18, 2014

StefanKarpinski commented Oct 20, 2014

timholy commented Oct 20, 2014

ivarne commented Oct 22, 2014

StefanKarpinski commented Oct 22, 2014

hashing BigInts is slow #8727

hashing BigInts is slow #8727

Comments

timholy commented Oct 18, 2014

StefanKarpinski commented Oct 18, 2014

StefanKarpinski commented Oct 18, 2014

JeffBezanson commented Oct 18, 2014

StefanKarpinski commented Oct 18, 2014

simonbyrne commented Oct 18, 2014

StefanKarpinski commented Oct 18, 2014

rfourquet commented Oct 18, 2014

StefanKarpinski commented Oct 20, 2014

timholy commented Oct 20, 2014

ivarne commented Oct 22, 2014

StefanKarpinski commented Oct 22, 2014