Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve isprime performance for 32 bit integers #9

Merged
merged 4 commits into from
Jun 7, 2016
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 37 additions & 11 deletions src/Primes.jl
Original file line number Diff line number Diff line change
Expand Up @@ -133,12 +133,16 @@ true
function isprime(n::Integer)
# Small precomputed primes + Miller-Rabin for primality testing:
# https://en.wikipedia.org/wiki/Miller–Rabin_primality_test
#
(n < 3 || iseven(n)) && return n == 2
n <= 2^16 && return PRIMES[searchsortedlast(PRIMES,n)] == n
# https://github.com/JuliaLang/julia/issues/11594
iseven(n) && return n == 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply include 2 in the tuple below?

Copy link
Member Author

@pabloferz pabloferz Jun 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess because I tried to preserve to some extent the structure of what we already have, but I can change this also it is a little faster in average to discard all even numbers first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but in this case if you don't take the history into account, it's difficult to understand why the case for 2 is treated separately, maybe add a comment? (I didn't expect it would make a difference)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a list of random 32-bit integers have a fast exist for half of the numbers. So if you are checking primality inside a big loop (in the sense of iterations), this would help a little.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, but I under-estimated the cost of checking whether n is in the tuple.

n in (3,5,7,11,13,17,19,23) && return true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you tune this choice of 23 on specific benchmarks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member

@mschauer mschauer Jun 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. prod(1-1./primes(1,i)) is the "probability" that a big number is not divisible by the first primes. In the limit prod(1-1./primes(1,i))/sum(1./(1:i)) ≈ 1/exp(eulergamma), this normalization shows e.g 29 (because of the gap) would perform under average.

[(i,prod(1-1./primes(1,i))*sum(1./(1:i))) for i in primes(1,100)]`` 
 (2,0.75)     
 (3,0.611111) 
 (5,0.608889) 
 (7,0.592653) 
 (11,0.627507)
 (13,0.609976)
 (17,0.620926)
 (19,0.606749)
 (23,0.610886)
 (29,0.625732)
 (31,0.615573)
 (37,0.624864)
 (41,0.624328)
 (43,0.616479)
 (47,0.615564)
 (53,0.620137)
 (59,0.623846)
 (61,0.617969)
 (67,0.620812)
 (71,0.619425)
 (73,0.614417)
 (79,0.616406)
 (83,0.615015)
 (89,0.61654) 
 (97,0.620485)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the binary search slower than the 1-base MR? Otherwise maybe move it to the branch n>= 841 below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For primes smaller that 2^16 the binary seach is faster, but that is a small fraction of all the 32-bit primes (1 / 2^16). For the rest of the primes it is too much carrying around the list of the primes up to 2^16 instead of just checking the primes in the tuple and doing the 1-base MR test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand "it is too much carrying around"? If it makes the test faster for n < 2^16, and no slower beyond, I think you should keep the binary search. Those numbers are a tiny fraction, but they may be the ones used the most. (but I agree it's not a big deal).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I leave the binary search and test whether of not each prime on a list of 10^8 random UInt32s, I get

Current PR:          12.102434 seconds (26.53 M allocations: 404.740 MB, 0.40% gc time)
With binary search:  27.110936 seconds (81.06 M allocations: 1.208 GB, 0.21% gc time)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an idea how this can happen? If n<2^16, you say it's slighlty faster to have a binary search, otherwise the cost of including the binary search should be only the cost of the comparison n<2^16 (which fails when n>2^16), which is quite insignificant compared to the rest of the routine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bigger difference can have to do with cache misses, where the cash runs full in 0.4.5 and cannot simulationsly hold allocations and prime-table (explaining the bigger difference under 0.4). How old is your master, there where some performance regressions which may have been fixed the last weeks, eg JuliaLang/julia#16128?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot reproduce your benchmarks. It seems that I have 32kB L1 cache, which would be smaller than the full PRIMES table (52kB). I don't find significant differences whether or not I use BS, and on 0.4.5 this is twice as slow as on master.
I'm using this code for timing, does it look fine?

function t32(n)
    c = 0
    M = MersenneTwister(0)
    for i in 1:n
        c += isprime(rand(M, UInt32))
    end
    c
end

But anyway, removing the BS into PRIMES is not a big deal. Could be worth storing PRIMES as an UInt16 or Int32 array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache misses, as you mention, seem like a reasonable cause of the difference in performance between the PR and the BS versions.

I'm running 0.5.0-dev+4438 (today's at the writing of this), so it would seem that the performance regression seen here hasn't been fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The machine I'm using for testing has 128kB L1 cache and I see the differences mentioned. The code you are using for timing is practically the same as the one I was using.

I don't know if it is worth storing PRIMES as an UInt16, but I don't think it would affect anything so there's no harm in changing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that is a pity, the last time I checked (and gave the recommendation to add the type hint) the code ran on master without any allocations, but not anymore.

for m in (3,5,7,11,13,17,19,23)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this tuple should be pulled out as a global const / made a set ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you reason about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems unnecessary to me because it's a small tuple and it's only constructed twice. @hayd Do you imagine that it would provide a significant performance improvement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's hard to say which is better without profiling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see by profiling and benchmarking there is no difference. I'd leave like this unless someone else has a strong objection.

n % m == 0 && return false
end
n < 841 && return n > 1
s = trailing_zeros(n-1)
d = (n-1) >>> s
for a in witnesses(n)
for a in witnesses(n)::Tuple{Vararg{Int}}
x = powermod(a,d,n)
x == 1 && continue
t = s
Expand Down Expand Up @@ -166,17 +170,39 @@ true
isprime(x::BigInt, reps=25) = ccall((:__gmpz_probab_prime_p,:libgmp), Cint, (Ptr{BigInt}, Cint), &x, reps) > 0



# Miller-Rabin witness choices based on:
# http://mathoverflow.net/questions/101922/smallest-collection-of-bases-for-prime-testing-of-64-bit-numbers
# http://primes.utm.edu/prove/merged.html
# http://miller-rabin.appspot.com
#
witnesses(n::Union{UInt8,Int8,UInt16,Int16}) = (2,3)
witnesses(n::Union{UInt32,Int32}) = n < 1373653 ? (2,3) : (2,7,61)
witnesses(n::Union{UInt64,Int64}) =
n < 1373653 ? (2,3) :
n < 4759123141 ? (2,7,61) :
# https://github.com/JuliaLang/julia/issues/11594
# Forišek and Jančina, "Fast Primality Testing for Integers That Fit into a Machine Word", 2015
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this FJ32_256? that seem to be the only one described in the paper (though it's a little different?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took the hash and bases from the FJ32_256 there, since we were already had a Miller-Rabin test implemented.

That was the only thing we really needed, since everything else was already here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks! Would be good to note that this is the algorithm used :) Is there any value is the 64-bit version (for 64 bit types)? the paper seems to suggest they weren't sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll mention the algorithm used. The 64-bit version needs a way larger look-up table and is probably better to try one of the approaches discussed here JuliaLang/julia#11594

# (in particular, see function FJ32_256, from which the hash and bases were taken)
const bases = UInt16[15591,2018,166,7429,8064,16045,10503,4399,1949,1295,2776,
3620,560,3128,5212,2657,2300,2021,4652,1471,9336,4018,2398,20462,10277,8028,
2213,6219,620,3763,4852,5012,3185,1333,6227,5298,1074,2391,5113,7061,803,1269,
3875,422,751,580,4729,10239,746,2951,556,2206,3778,481,1522,3476,481,2487,3266,
5633,488,3373,6441,3344,17,15105,1490,4154,2036,1882,1813,467,3307,14042,6371,
658,1005,903,737,1887,7447,1888,2848,1784,7559,3400,951,13969,4304,177,41,
19875,3110,13221,8726,571,7043,6943,1199,352,6435,165,1169,3315,978,233,3003,
2562,2994,10587,10030,2377,1902,5354,4447,1555,263,27027,2283,305,669,1912,601,
6186,429,1930,14873,1784,1661,524,3577,236,2360,6146,2850,55637,1753,4178,8466,
222,2579,2743,2031,2226,2276,374,2132,813,23788,1610,4422,5159,1725,3597,3366,
14336,579,165,1375,10018,12616,9816,1371,536,1867,10864,857,2206,5788,434,8085,
17618,727,3639,1595,4944,2129,2029,8195,8344,6232,9183,8126,1870,3296,7455,
8947,25017,541,19115,368,566,5674,411,522,1027,8215,2050,6544,10049,614,774,
2333,3007,35201,4706,1152,1785,1028,1540,3743,493,4474,2521,26845,8354,864,
18915,5465,2447,42,4511,1660,166,1249,6259,2553,304,272,7286,73,6554,899,2816,
5197,13330,7054,2818,3199,811,922,350,7514,4452,3449,2663,4708,418,1621,1171,
3471,88,11345,412,1559,194]

function _witnesses(n::UInt64)
i = ((n >> 16) $ n) * 0x45d9f3b
i = ((i >> 16) $ i) * 0x45d9f3b
i = ((i >> 16) $ i) & 255 + 1
@inbounds return (Int(bases[i]),)
end
witnesses(n::Integer) =
n < 4294967296 ? _witnesses(UInt64(n)) :
n < 2152302898747 ? (2,3,5,7,11) :
n < 3474749660383 ? (2,3,5,7,11,13) :
(2,325,9375,28178,450775,9780504,1795265022)
Expand Down