Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hashing BigInts is slow #8727

Closed
timholy opened this issue Oct 18, 2014 · 11 comments
Closed

hashing BigInts is slow #8727

timholy opened this issue Oct 18, 2014 · 11 comments
Labels
performance Must go faster

Comments

@timholy
Copy link
Member

timholy commented Oct 18, 2014

As reported here, using BigInts as keys to a Dict is slow. The culprit is hashing, specifically this line, which gets called from here. I don't know enough about BigInts to propose a better way to hash them, but I strongly suspect that the current approach is not The Answer.

@StefanKarpinski
Copy link
Member

Yup. I knew that was going to be trouble some day. The rest of BigInt hashing should be ok, actually.

@StefanKarpinski
Copy link
Member

This particular problem can be solved without fixing ndigits0z since we just need to know how many bits the value is, but it would be good to actually just fix that function.

@JeffBezanson
Copy link
Member

Why a library that contains the world's most advanced big integer algorithms cannot accurately compute the number of digits in a number is simply beyond me.

@StefanKarpinski
Copy link
Member

That one was too hard.

@JeffBezanson JeffBezanson added the performance Must go faster label Oct 18, 2014
@simonbyrne
Copy link
Contributor

It says that base 2 is exact, so perhaps it's worth having a specific nbits0z method that can drop the extra check? I could also use this in #8463.

@StefanKarpinski
Copy link
Member

Just checking for bases that are powers inside of ndigits0z might be good enough.

rfourquet added a commit to rfourquet/julia that referenced this issue Oct 18, 2014
`sizeinbase` from gmp is exact for powers of two, so the checks are not needed.
@rfourquet
Copy link
Member

Oh sorry, I cooked an easy PR for this without seeing the updated thread here.

@StefanKarpinski
Copy link
Member

Here's some investigation of when GMP is off by one:

julia> using StatsBase

julia> nd(x::BigInt, b::Integer=10) =
       int(ccall((:__gmpz_sizeinbase,:libgmp), Culong, (Ptr{BigInt}, Int32), &x, b))
nd (generic function with 2 methods)

julia> map(x->factor(x+1), cumsum(rle([ nd(big(n),7)-ndigits(big(n),7) for n=1:2^17-1 ])[2]))
13-element Array{Dict{Int64,Int64},1}:
 Dict(2=>2)
 Dict(7=>1)
 Dict(2=>5)
 Dict(7=>2)
 Dict(2=>8)
 Dict(7=>3)
 Dict(2=>11)
 Dict(7=>4)
 Dict(2=>14)
 Dict(7=>5)
 Dict(2=>16)
 Dict(7=>6)
 Dict(2=>17)

julia> map(x->factor(x+1), cumsum(rle([ nd(big(n),6)-ndigits(big(n),6) for n=1:2^17-1 ])[2]))
13-element Array{Dict{Int64,Int64},1}:
 Dict(2=>2)
 Dict(2=>1,3=>1)
 Dict(2=>5)
 Dict(2=>2,3=>2)
 Dict(2=>7)
 Dict(2=>3,3=>3)
 Dict(2=>10)
 Dict(2=>4,3=>4)
 Dict(2=>12)
 Dict(2=>5,3=>5)
 Dict(2=>15)
 Dict(2=>6,3=>6)
 Dict(2=>17)

This is largely for my own record since it's probably not super-clear to anyone else what this is showing, but in short, it looks like the answer is off by one between powers of the base and the next power of two – which makes a lot of sense. The question is how to figure out when this is the case efficiently.

@timholy
Copy link
Member Author

timholy commented Oct 20, 2014

@JuliaBackports

StefanKarpinski added a commit that referenced this issue Oct 22, 2014
Fixes #8727, but it's kind of a bandaid since for non-power-of-two
bases ndigits0z is still slow due to GMP's inexcusable laziness. At
least for power-of-two bases, however, BigInt hashing is now about
3x faster and allocates no memory.

(cherry picked from commit 19eb2fa)
@ivarne
Copy link
Member

ivarne commented Oct 22, 2014

Backported in b1fc473

@StefanKarpinski
Copy link
Member

Thanks, @ivarne.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants