-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better default RNG in the future? #27614
Comments
Too late for this release but the default RNG stream is explicitly documented to not be stable so we can change it in 1.x versions. Another contender is the PCG family of RNGs. |
Ok, I thought last change (people assume stable). It's 7 lines of C code. |
I'm on smartphone, I at least can't locate "(un)stable" at |
MT, Julia's current default, is "between 1/4 and 1/5 the speed of xoroshiro128+." and MT has "one statistical failure (dab_filltree2)." "PCG only generates 32-bit integers despite using 64-bit operations. To properly generate a 64-bit value we’d need 128-bit operations, which would need to be implemented in software."
|
It looks like the timings are based on his own implementation of MT. The one we use is probably quite a bit faster. Update: The Dieharder test suite is also considered outdated |
Also intriguing-just use AES:
|
Extensive testing by others (Daniel Lemire, John D. Cook) favors PCG and notes a few problems with the xor__s. One or the other or another -- it will require Julia-centric testing and benchmarking. |
Random is a |
Yes, it could have breaking changes before Julia 2.0. |
Just for fun, I coded up a prototype implementation: # A quick test to test a new RNG for Julia
using BenchmarkTools
function xorshift1024(N, s)
x = Array{UInt64}(undef, N)
p = 0
for i = 1:N
# One-based indexing strikes again
s0 = s[p + 1]
p = (p + 1)%16
s1 = s[p + 1]
# Make sure you use xor() instead of ^
s1 = xor(s1, s1 << 31)
s1 = xor(s1, s1 >> 11)
s0 = xor(s0, s0 >> 30)
s[p + 1] = xor(s0, s1)
# I love magic constants
x[i] = s[p + 1] * 1181783497276652981
end
return x
end
function xoroshiro128plus(N, s)
x = Array{UInt64}(undef, N)
@inbounds for i = 1:N
# copy state to temp variables
s1, s2 = (s[1], s[2])
# Calculate result immediately
x[i] = s1 + s2
# Mash up state
s2 = xor(s2, s1)
s1 = xor(((s1 << 55) | (s1 >> 9)), s2, (s2 << 14))
s2 = (s2 << 36) | (s2 >> 28)
# Save new state
s[1], s[2] = (s1, s2)
end
return x
end
# Collect some performance data
for N in round.(Int64, 2 .^ range(12, stop=20, length=3))
println("Testing generation of $N 64-bit numbers")
local s = rand(UInt64, 16)
# compare xorshift against rand()
@btime xorshift1024($N, $s)
@btime xoroshiro128plus($N, $s)
@btime rand(UInt64, $N)
end This gives surprisingly good timings, considering MT has been optimized for many a year:
I note that |
I don't think it's all that surprising—MT is pretty complex. One of the huge advantages of xoroshiro and PCG is that they're much simpler and easier to generate really complex code for. |
they put the suit in pseutdorandom |
The package RandomNumbers.jl implements many different RNGs, including |
Is Float rand more important than Int rand? It seems it may matter: @JeffreySarnoff thanks for the links; still "favors PCG and notes a few problems with the xor__s." seems not conclusive nor comparing to latest variants. Java's shiftmix64 is also interesting. Also Google's |
https://github.com/sunoru/RandomNumbers.jl/issues/1#issuecomment-233533191 @sunoru "Yes, I have considered accelerating some RNGs with GPU" I guess AES is not available on GPUs so that may be a factor to NOT choose AES as default on CPUs? I.e. would we want same default on GPU and CPU? |
In general, it's fine (and often good) to drop some extra output bits. In PCG, IIRC, you split the internal output into external output bits and permutation bits in a way that can be adjusted, raising the possibility that we could use different output steps based on the same state to generate 64 bits of integer output versus 53 bits of floating-point output. Another approach would be to use different RNG streams for different bit sizes, which is appealing for being able to generate random 128-bit values, for example. (A pair of 128-bit streams may not be as good as a single 128-bit stream, for example, unless the two streams have coprime periods which is hard to arrange, so it can be quite helpful to have different sized streams that take advantage of their increased size to produce better streams.) |
"The clear winner is xoroshiro128+" (seems outdated) @staticfloat thanks for coding it. Actually "xoshiro256**" seems to be their best; note dropped letters, I first thought a typo. |
I'm sorry that I haven't updated that package for long. But maybe this is worth a look: http://sunoru.github.io/RandomNumbers.jl/latest/man/benchmark/ My personal choice is |
PCG not passing big crush is indeed an issue. That seems like a strange disconnect. |
@sunoru I was just looking at your |
The real issue with PCG is that it is not being well-tended. I don't know anything about why it got crushed -- if there were more effort behind that family, a remedy modification may have been made but there is none. So, going with new information: I still like PCG .. to use it for what it does best .. and with the failure, that's not this. |
If AES-CTR turns out to be fast enough on most architectures, then I think this would be absolutely fantastic as the default RNG. This eradicates an entire class of security bugs and relieves people from the mental burden of considering the quality of the random stream and how to use it. And people who need something faster can still switch this out. The only big isssue is architectures without hardware support for AES (then it would almost surely be too slow). So the bitter pill to swallow is that the random stream would become architecture dependent (e.g. older raspberry pi don't have aes acceleration and should fall back on something else). |
I would be ok with having a different one on different architectures. |
After looking at http://www.cs.tufts.edu/comp/150FP/archive/john-salmon/parallel-random.pdf, I am more convinced that AES would be a really good default on supporting archs: Fast enough (faster than MT on a lot of CPUs, and does not eat as much L1), practically perfect quality, and no possibility of people abusing |
@ViralBShah (on xoroshiro128** or you mean on their older RNG?) "I thought those RNG don't pass the RNG testsuite." I googled, and assume you mean: https://github.com/andreasnoackjensen/RNGTest.jl/ that I found here: #1391 I think the discussion should continue here, not the startup time issue, while I'm not sure the default RNG needs to pass all tests (I'm not sure it does NOT do that), if you can easily opt into a better RNG. At http://xoshiro.di.unimi.it/ there's both e.g. xoroshiro128** and xoroshiro128+ to look into. and I only see a minor flaw on the latter:
|
An architecture-dependent default PRNG might be a little awkward since one of the primary benefits of PRNG is reproducibility. |
This is kind of true: the reinterpret itself is a no-op, but in order to do most integer operations on floating-point values in hardware, it's necessary to move the value from a floating-point register to an integer register (and back if you want to continue using as floating-point), which does take a cycle in each direction, so two cycles to go back and forth. So it's not free in practice. |
I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable. With proper simd / avx2, xoshoro256++ is almost 4x faster than dsfmt for integer bulk generation, on my setup. No way in hell is the int->float conversion going to turn that into a loss. On my machine I get down to ~0.14 cpb:
with beautiful code (allmost no branches, everything 256 bit wide; if I had a AVX512 machine, this would be even better). CF https://gist.github.com/chethega/344a8fe464a4c1cade20ae101c49360a Can anyone link what xoshiro code you benchmarked? Maybe I just suck at C? |
Wow, now you're making me shiver. 0.14 cycle/B? The vectorization is performed by the compiler or did you handcraft it (sorry, totally not proficient with Julia)? |
I'm running an The vectorization was by hand, the code is in the gist, but small enough to post here:
Note that this uses 8 independent RNGs for bulk generation: advance 8 states, generate 8 outputs, write 8 outputs to destination, repeat. This is no issue, just needs some extra seeding logic (seed in bulk). In the julia code, this looks like 1 rng whose state consists of four PS. I also implemented the 5 lines of xoshiro256** in the gist. It is much slower, presumably due to the giant latency of the multiplication. The generated inner loop for 4 x xoshiro256++ is a beauty:
|
So, after consideration: The proposed setup, if we wanted to switch to xoshiro256++, would be to reserve 5 cache-lines of RNG state (ach thread has its own completely independent generator). One CL is for sequential generation, and laid out like Then, there are 4 cache-lines for bulk generations, laid out like These 4 cachelines aren't hot for code that doesn't use bulk generation, so nobody cares about state-size. There sequential generator uses The sequential generator only exists because: (1) sequential generation should only fill in a single cache-line of state, and (2) bulk generation wants to use wide vector loads/stores for the state. Code that uses both sequential and bulk generation will therefore fill 5 instead of 4 lines; doesn't matter, who cares. The sequential generator should be properly inlined (verify that, and possibly add There is no reason to implement How do we tell the allocator to give us 64-byte aligned In total, I think this would be a good move, both for speed and in order to cut out the dependency. One would need to port a variant of my code over to Base-julia (i.e. without On the other hand, if we change the default RNG at all, I would still prefer a cryptographic choice; if that is rejected and if scientific consensus says that xoshiro gives "good enough" streams, then that's good enough for me. |
Hey, I just ported naïvely @vigna's code to Julia, I wanted to have an actualized performance comparison compared to last year and with xoshiro256++ specifically. It was good enough for me, and I currently don't have the skills to write simd-optimized code! :D What you did is amazing, with your benchmark above, this gets 2.6 times faster than my array generation routine. using Random
mutable struct Xoropp <: AbstractRNG
s0::UInt64
s1::UInt64
s2::UInt64
s3::UInt64
end
rotl(x::UInt64, k::UInt) = (x << k) | (x >> ((64 % UInt) - k))
function Base.rand(rng::Xoropp, ::Random.SamplerType{UInt64})
s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3
result = rotl(s0 + s3, UInt(23)) + s0
t = s1 << 17
s2 ⊻= s0
s3 ⊻= s1
s1 ⊻= s2
s0 ⊻= s3
s2 ⊻= t
s3 = rotl(s3, UInt(45))
rng.s0, rng.s1, rng.s2, rng.s3 = s0, s1, s2, s3
result
end
# array generation code: nothing magic, but somehow gets quite faster than the default
# code for array, which does `rand(rng, UInt64)` in a loop
function Random.rand!(rng::Xoropp, A::AbstractArray{UInt64},
::Random.SamplerType{UInt64})
s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3
@inbounds for i = eachindex(A)
A[i] = rotl(s0 + s3, UInt(23)) + s0
t = s1 << 17
s2 ⊻= s0
s3 ⊻= s1
s1 ⊻= s2
s0 ⊻= s3
s2 ⊻= t
s3 = rotl(s3, UInt(45))
end
rng.s0, rng.s1, rng.s2, rng.s3 = s0, s1, s2, s3
A
end
# for Float64 generation
Random.rng_native_52(::Xoropp) = UInt64 I can't really comment on your "proposed setup" above, as I have not much ideas above cache lines and such. |
@chethega, if I'm following your description, it sounds like there would be two separate generator states in that situation—sequential and array. Does that mean that calling |
That is correct! In even more detail, each thread would have 10 independent xoshiro instances. 1-64 bit sequential user requests would advance rng_1, 65+ bit sequential user requests would advance rng_1 and rng_2, and all bulk user requests would advance rng_3-rng_10. rng_1 and rng_2 would be laid out together in an
Sorry, I didn't want to impugn your code. You do awesome work in keeping julia's random infrastructure together, and your APIs rock!
That's too much praise ;) I just read the assembly, noticed that it sucks, had the audacity to propose 10 interleaved xoshiro instances instead of a single one, and passed a PS:
That is because you hoisted writing and reading the state out of the loop. This is a super important optimization, and is the main property of my |
I assumed so, but was surprised that the compiler doesn't do this seemingly easy optimization by itself. But your point about Also, I realized that in my benchmarks above, I compared xoshiro++ to a local instance of |
So, please help me: why won't this be vectorized by gcc or clang? Here XOSHIRO256_UNROLL=8 and I'm using
|
Try running Reference: https://llvm.org/docs/Vectorizers.html#diagnostics |
I got this from clang:
But this is the external benchmarking loop. There's never a warning saying it's not optimizing the loops above. :( From what I've read the problem would be that I'm using memory and not registers (i.e., variables aren't local). |
Have you tried with |
@vigna I don't know why your code didn't vectorize, but I am quite happy with clang's output for the following. The generated assembly is even nicer to read than julia's. Feel free to add this to your reference implementation, no attribution required.
edit: removed memory corruption if |
Ah OK! Just switching to s[4][XOSHIRO256_UNROLL] instead of s0[XOSHIRO256_UNROLL], etc. made my code vectorize with gcc 💪🏻. Now I have to understand why clang does yours and not mine, but we're getting there... |
Please note that the >> A should be << A. 🤦🏻♂️ |
And, most fundamental, it will not happen with |
For me, GCC 8.3 generates fast(er) code with With
But with
|
Are you using gcc? Most often when I get vectorization with In VectorizedRNG.jl I have a vectorized PCG implementation. It is a lot slower than xoshiro. With AVX2, it is also about 50% slower than the default MersenneTwister, but it is faster with AVX512. |
@chriselrod, despite me proposing a new PRNG, at least for some applications, e.g. for neural networks (there QRNG seems to have the best case, and for others strangely making worse?), NO PRNG may be good enough (if accuracy, not just speed, is your concern). It's at least good to have in mind. On the effects of pseudorandom and quantum-random number generators in soft computing
The paper is about QRNG (which could be a drop-in-replacemnt in Julia, but also an interesting overview of Quantum Machine Learning in general, with e.g.:
Offtopic, but intriguing (and maybe related if QRNG used?): |
I find this hard to believe, a cryptographically secure PRNG has to be undistinguishable from true randomness. It would be quite a statement to claim that no such PRNG exists. |
fixed by #40546 |
EDIT: "August 2018 Update: xoroshiro128+ fails PractRand very badly. Since this article was published, its authors have supplanted it with xoshiro256**. It has essentially the same performance, but better statistical properties. xoshiro256 is now my preferred PRNG**." I previously quoted:
"The clear winner is xoroshiro128+"https://nullprogram.com/blog/2017/09/21/
The text was updated successfully, but these errors were encountered: