Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Tasklocal xoshiro #34852

Closed
wants to merge 3 commits into from
Closed

Conversation

chethega
Copy link
Contributor

@chethega chethega commented Feb 23, 2020

This introduces a tasklocal random number generator. In view of #27614, I have chosen xoshiro256 as the initial proposal.

Task Locality

Task locality of the RNG state has considerable upside: First, it is much easier to get single requests fast (less pointer chasing, no synchronization issues / threadsafety issues). Second, we obtain an extremely nice reproducibility: When using a task-local RNG, then results of multi-threaded computations are independent of scheduler decisions, as long as the user code has no race conditions. In fact, as long as task creation is independent of the number of threads (e.g. recursive divide-and-conquer), randomized simulations are also independent of the number of threads.

There are two downsides of that: First, tasks become 32 bytes heavier, and task creation becomes a handful of cycles slower (should be <100 cycles). Second, it is not possible to separate a task's RNG instance from the owning task.

The way this is implemented is that task creation consumes 32 random bytes from the parent's task-local stream, in order to seed the new task's RNG.

Internals

As you see, the speed of this relies on some codegen trickery to accelerate current_task(). I'd appreciate any comments on that; if desired, we can split that change off into a separate commit/PR.

It is, unfortunately, fundamentally impossible for packages to provide fast task-local random: There can be only 0-1 task-local RNGs that are handled by the julia runtime.

Performance: Cache Pressure

The tasklocal RNG has 32 contiguous bytes of state. No cache of generated random numbers, no giant DSFMT array -- hence, no eviction of precious cache-lines filled with valuable user data.

Speed: Single numbers

Thanks to the newly accelerated current_task(), single random generation can be inlined into the callsite, and the access to the random state can be LICMd. The smaller function-call overhead more than compensates for the simplified approach (always consume 64-128 random bits, even for rand(Bool)).

Speed: Bulk generation

For bulk generation of large arrays, we use the task-local RNG to seed an ephemeral RNG instance (4 x xoshiro), and use that to fill the array. The ephemeral RNG has 128 bytes of state, which is imho too expensive to maintain. By choosing an ephemeral instance, we can just keep the state in register until the array is full and discard it afterwards, no need to save state to either stack or heap.

I had some trouble getting the large bulk generation reliably vectorized, and resorted to effectively writing it in LLVM-IR. The tests in #27614 suggest that 256 or 512 byte of ephemeral state would be even faster on powerful hardware; but this variant is conservative (changing the size is a simple search-and-replace), I don't want to break rasberry pi perf.

Making this the default

I think something like this should become the default. It appears like a strict (pareto) improvement over DSFMT, and has much simpler reproducibility / thread-safety semantics.

Warts

The names need quite a bit of bikeshedding.

We probably should use some better seeding function. I just used the murmur-hash that was lying around, but we really want a fast pseudo-random function on 32 byte (e.g. sha2-256) that has no chance in hell of interacting with xoshiro. This is only called on seeding / task creation, and at 4 cycles/byte we could get real crypto (and 128 extra cycles for task creation is not a lot).

@rfourquet , please help me clean up my misuse of the Random / Sampler API. It is too complicated for me.

The bulk generation of floats currently traverses the array twice: First to do all the bitwise ops, then to do all floating point ops. This is due to x86 issues: There is afaiu some split between how registers are used. Hence we must commit the intermediate values to "memory", and cannot fully interleave the bitwise and floating point (x-> x-1) operations. Main memory is slow, so the correct way would be to write ~25% of L2d with the RNG, then run over the data again for the floating point ops, etc. Still faster than DSFMT on my machine, but might be bad for some users (if they are bandwidth-bound, instead of CPU-bound, and we perturb other threads/cores by hogging whatever shared resources, e.g. shared L3).

Reproducibility

With task-local RNG, the following becomes reproducible:

julia> using Random

julia> function reprocheck(r, i)
           if i == 0
               return UInt(0)
           end
           r1 = rand(r, UInt64)*hash(i)
           t1 = Threads.@spawn reprocheck(r, i-1)
           t2 = Threads.@spawn reprocheck(r, i-1)
           r2 = rand(r, UInt64)
           return r1 + r2 + fetch(t1) + fetch(t2)
       end;

julia> for i=1:4
           Random.seed!(Random.TaskLocal(), 23)
           println()
           @show i, Threads.nthreads(), Threads.threadid(), rand(Random.TaskLocal(), UInt64)
           @show  reprocheck(Random.TaskLocal(), 10)
       end

(i, Threads.nthreads(), Threads.threadid(), rand(Random.TaskLocal(), UInt64)) = (1, 2, 1, 0x4be68ed82672ea07)
reprocheck(Random.TaskLocal(), 10) = 0xd44e32dcf7cfefc5

(i, Threads.nthreads(), Threads.threadid(), rand(Random.TaskLocal(), UInt64)) = (2, 2, 1, 0x4be68ed82672ea07)
reprocheck(Random.TaskLocal(), 10) = 0xd44e32dcf7cfefc5

(i, Threads.nthreads(), Threads.threadid(), rand(Random.TaskLocal(), UInt64)) = (3, 2, 1, 0x4be68ed82672ea07)
reprocheck(Random.TaskLocal(), 10) = 0xd44e32dcf7cfefc5

(i, Threads.nthreads(), Threads.threadid(), rand(Random.TaskLocal(), UInt64)) = (4, 2, 1, 0x4be68ed82672ea07)
reprocheck(Random.TaskLocal(), 10) = 0xd44e32dcf7cfefc5

Benchmark: Single generation speed

using Random, BenchmarkTools
@show Threads.nthreads()
function bulk_single(arr, rng)
    @inbounds for idx = eachindex(arr)
        arr[idx] = rand(rng, eltype(arr))
    end
nothing
end

for T in [Float32, Float64, Bool, Int8, Int16, Int32, Int64, Int128]
    println("\n--------------")
    @show T
    one_def_t =  @belapsed rand($T)
    one_tl_t =  @belapsed rand(Random.TaskLocal(), $T)
    one_def_gbps = 1e-9 * sizeof(T)/one_def_t
    one_tl_gbps = 1e-9 * sizeof(T)/one_tl_t

    @show one_def_t, one_tl_t
    @show one_def_gbps, one_tl_gbps

    arr=zeros(T, 1024)
    loop_def_gbps = 1e-9 * sizeof(T) * 1024 / @belapsed bulk_single($arr, $Random.default_rng())
    loop_tl_gbps = 1e-9 * sizeof(T) * 1024 / @belapsed bulk_single($arr, Random.TaskLocal())
    @show loop_def_gbps, loop_tl_gbps
end

gives

--------------
T = Float32
(one_def_t, one_tl_t) = (1.051851851851852e-8, 4.932000000000001e-9)
(one_def_gbps, one_tl_gbps) = (0.38028169014084506, 0.8110300081103)
(loop_def_gbps, loop_tl_gbps) = (1.3967339824953586, 1.6855967078189302)

--------------
T = Float64
(one_def_t, one_tl_t) = (8.33933933933934e-9, 4.948000000000001e-9)
(one_def_gbps, one_tl_gbps) = (0.9593086064097947, 1.6168148746968471)
(loop_def_gbps, loop_tl_gbps) = (3.266637128932211, 3.300711823431975)

--------------
T = Bool
(one_def_t, one_tl_t) = (9.63963963963964e-9, 4.572e-9)
(one_def_gbps, one_tl_gbps) = (0.10373831775700935, 0.21872265966754156)
(loop_def_gbps, loop_tl_gbps) = (0.41353315983128425, 0.4316829828094993)

--------------
T = Int8
(one_def_t, one_tl_t) = (7.478478478478479e-9, 4.495e-9)
(one_def_gbps, one_tl_gbps) = (0.1337170392183108, 0.2224694104560623)
(loop_def_gbps, loop_tl_gbps) = (0.3939471659399846, 0.4021819768710451)

--------------
T = Int16
(one_def_t, one_tl_t) = (7.498498498498499e-9, 4.495e-9)
(one_def_gbps, one_tl_gbps) = (0.26672006407689225, 0.4449388209121246)
(loop_def_gbps, loop_tl_gbps) = (0.8281439547108775, 0.803872824807013)

--------------
T = Int32
(one_def_t, one_tl_t) = (7.484484484484484e-9, 4.703e-9)
(one_def_gbps, one_tl_gbps) = (0.5344389461013777, 0.850520944078248)
(loop_def_gbps, loop_tl_gbps) = (1.6494697749340017, 1.5724279133253711)

--------------
T = Int64
(one_def_t, one_tl_t) = (1.054154154154154e-8, 4.497e-9)
(one_def_gbps, one_tl_gbps) = (0.7589022884816258, 1.7789637536135203)
(loop_def_gbps, loop_tl_gbps) = (1.7739829853054911, 3.4854630548858316)

--------------
T = Int128
(one_def_t, one_tl_t) = (1.2656656656656656e-8, 6.472000000000001e-9)
(one_def_gbps, one_tl_gbps) = (1.2641569123695036, 2.472187886279357)
(loop_def_gbps, loop_tl_gbps) = (2.408490871137506, 5.165195460277428)

Benchmark: Bulk generation speed

for T in [Float32, Float64, UInt64]
    println("\n--------------")
    @show T
    for bulksize in [1<<9, 1<<11, 1<<13, 1<<15, 1<<20, 1<<28]
        arr = zeros(T, Int(bulksize/sizeof(T)))
        bulk_def_gbps = 1e-9 * bulksize / (@belapsed rand!($arr))
        bulk_tl_gbps = 1e-9 * bulksize / ( @belapsed rand!(Random.TaskLocal(), $arr))
        @show  bulksize, bulk_def_gbps, bulk_tl_gbps
    end
end

gives

--------------
T = Float32
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (512, 2.711167668823767, 3.1823428680154757)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (2048, 3.3883202286153566, 5.181155680859902)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (8192, 3.934678194044188, 7.846743295019157)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (32768, 3.921024291013522, 8.729113249642037)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (1048576, 3.3197492559994934, 6.489556192326974)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (268435456, 2.534223198963009, 4.451675165164107)

--------------
T = Float64
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (512, 4.629889742671698, 3.231287800913199)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (2048, 5.795674637173197, 5.120640080010002)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (8192, 6.260603744745892, 7.795964979063571)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (32768, 5.092943736400374, 8.734930525473994)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (1048576, 4.747437429823608, 6.790811535447605)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (268435456, 4.008831346294357, 4.401819528247619)

--------------
T = UInt64
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (512, 2.048916057850007, 3.7731246188249647)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (2048, 2.947154021563462, 6.8398851217008225)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (8192, 3.859700554915716, 10.23188600437825)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (32768, 4.098732488325551, 11.603399433427763)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (1048576, 3.9243994999887724, 12.208928114011599)
(bulksize, bulk_def_gbps, bulk_tl_gbps) = (268435456, 3.1267273542058613, 8.98779508819599)

@chethega chethega changed the title Tasklocal xoshiro WIP: Tasklocal xoshiro Feb 23, 2020
@tkf
Copy link
Member

tkf commented Feb 23, 2020

Thanks for doing this! I'd be awesome to have reproducible multi-threaded program out-of-the-box.

we really want a fast pseudo-random function on 32 byte (e.g. sha2-256) that has no chance in hell of interacting with xoshiro

I'm no RNG expert, but isn't "counter-based PRNG" a more rigorous approach? Repeating what I already wrote in #34386 (comment):


Quoting what I mentioned in the discourse thread:

I fond what Guy Steele mentioned in the last half of this talk interesting:

https://www.youtube.com/watch?v=0hlBkQ5DjaY

IIRC there is a type of PRNG that can be split in such a way that the distribution of resulting streams combined is equivalent to the original stream (or something like that). It would be nice to look into (probably some follow up studies of):

which were mentioned in the talk.

There is a Julia implementation of counter-based PRNG: https://github.com/sunoru/Random123.jl

@chethega
Copy link
Contributor Author

I'm no RNG expert, but isn't "counter-based PRNG" a more rigorous approach?

No, not really. Mathematically, the construction I use here uses any RNG that accepts arbitrary seeds, plus a random function. From that you get a splittable RNG: Use the old state to generate K pseudo-random bits, use them to seed the child RNG; put a random function into the middle because almost no RNGs make any guarantees about correlations when they get seeded with the output of the same RNG. We could afford something like sha2-256 that is indistinguishable from random oracle, and the resulting construction (i.e. state-splittable RNG) would be just as good as the original non-splittable RNG (i.e. xoshiro), by reduction proof.

The competely unfounded non-rigorous no-good temporary thing I did here was to conjecture that murmur3-internals and xoshiro-internals don't correlate. That is obviously (a) kinda incorrect, because both use similar constructions, (b) a leap of faith, and (c) @vigna might comment more.

I have seen no suitably fast (< 4 cpb) crypto function lying around in the julia repo (needs to be written in C, used during task creation).

Counter-based is super cool because it vectorizes trivially / is embarassingly parallel. However, I had no fast variant lying around that doesn't use AES hardware acceleration. Related #34386 #34216

@tkf
Copy link
Member

tkf commented Feb 24, 2020

Thanks for the explanation. I guess I can see that you can split off an independent child RNG by putting CSPRNG in the middle. I just didn't have a good intuition if you can make it lightweight enough to do this for every @spawn/@async.

src/ccall.cpp Outdated Show resolved Hide resolved
@StefanKarpinski
Copy link
Member

Really impressive work!

@StefanKarpinski StefanKarpinski added multithreading Base.Threads and related functionality randomness Random number generation and the Random stdlib triage This should be discussed on a triage call labels Feb 24, 2020
@rfourquet
Copy link
Member

That's amazing!! I was secretly hoping you would work on that since the last discussion, and you did :D
Of course I can help with the Sampler API, but it's only a tiny part of this PR. I wish I was able to understand how the rest works (maybe with perseverance...)

@StefanKarpinski
Copy link
Member

I guess one immediate (and possibly naive) question I have is whether the task overhead could be reduced to just a pointer to the RNG state which could be lazily initialized.

@chethega
Copy link
Contributor Author

I guess one immediate (and possibly naive) question I have is whether the task overhead could be reduced to just a pointer to the RNG state which could be lazily initialized.

Lazy initialization is fundamentally impossible, and pointer indirection is too slow.

A pointer is 8 bytes. I currently eat 32 bytes. If 32 are too much, then it is probably possible to cook up a counter-based RNG: All tasks share some global key object, and each task just has a counter. Now 4 byte are not enough (would overflow), but there are already 7 juicy bytes of padding after uint8_t sticky;, so there would be zero space overhead on 64 bit, and 4 byte on 32 bit. With this patch, my 2ghz broadwell laptop measures

julia> h()=1;
@btime Task(h);
julia> @btime Task(h);
  112.238 ns (5 allocations: 720 bytes)

julia> @btime schedule(Task(h))
  561.425 ns (5 allocations: 720 bytes)
Task (done) @0x00007fbb959ddfc0

The overhead from this patch is definitely <= 32 byte and < 60ns. Imo the correct thing to compare to is time of task creation + scheduling, because tasks that never get scheduled are not very interesting.

I'm not sure how to measure the additional task creation overhead, with whatever gcc emits. The equivalent julia code is

julia> h() = Base.hash_uint64(rand(Random.TaskLocal(), UInt64))+Base.hash_uint64(rand(Random.TaskLocal(), UInt64))+Base.hash_uint64(rand(Random.TaskLocal(), UInt64))+Base.hash_uint64(rand(Random.TaskLocal(), UInt64));

julia> @btime h()
  17.712 ns (0 allocations: 0 bytes)
0xd0554f2a7c13e966

But if we're lucky the overhead is smaller, because the superscalar CPU can fit the arithmetic into something while waiting on memory / other long dependency chains.

This part will probably get slower as this PR progresses: I don't know whether murmur3 is good enough for the job, waiting on @vigna to comment or on myself to write something cryptographic.

@StefanKarpinski
Copy link
Member

Would be great if @JeffBezanson or @vtjnash could chime in with some advice on how to measure the task overhead this introduces.

@tkf
Copy link
Member

tkf commented Feb 24, 2020

Lazy initialization is fundamentally impossible

Maybe I'm too naive but can't you just take the snapshot of the parent RNG state and store it in the child task with a flag somewhere indicating that it's not initialized yet? Or is it fundamental that the parent task's RNG state is advanced when the RNG is split? Or the initialization (including calling a "practically random function") is very cheap compared to the entire task creation?

@chethega
Copy link
Contributor Author

chethega commented Feb 24, 2020

Or is it fundamental that the parent task's RNG state is advanced when the RNG is split? Or the initialization (including calling a "practically random function") is very cheap compared to the entire task creation?

Both of that.

So, I locally reverted to master, to compare:

julia> @btime Task(h)
  105.717 ns (4 allocations: 704 bytes)
Task (runnable) @0x00007f082d00c010
julia> @btime schedule(Task(h))
  519.275 ns (4 allocations: 704 bytes)
Task (done) @0x00007f0808259fc0

The random state initialization with murmur3 adds a whopping 7 nanoseconds to task creation time, and a whopping 16 bytes to the counted allocated memory. With immediate scheduling that presumably has some memory barriers I get the more realistic 40 nanoseconds (probably something with superscalar execution).

PS, for comparison, on my machine:

julia> function foofun(A)
       s=0
       for x in A
       s+= x
       end
       end

julia> @btime foofun(collect(Any, 1:1000))
  36.710 μs (1459 allocations: 30.72 KiB)

A single dynamic dispatch with tiny allocation appears to cost ~30 ns on my machine.

@JeffBezanson
Copy link
Member

This is really cool, impressive work! These are my quick initial thoughts:

  • It would be nice to use less llvmcall. If the code is written/unrolled such that shift amounts are constant, they should all be fully optimized. If we really need the "unsafe" shifts and rotates we should add intrinsics.
  • I have a PR for codegen of current_task; I might as well merge that first. I believe the tbaa_const is correct.
  • Adding 32 bytes to a Task object doesn't seem so bad to me; it's much cheaper than allocating an extra object.
  • Creating and switching tasks is currently way too slow. We need to do a lot of optimization work. After that the costs might look different. But growing Task a bit should not be a problem.
  • The old principle "don't pay for what you don't use" (popular in e.g. the C++ world) comes to mind. But it might be a mistaken first impression --- the writeup here is pretty persuasive that without built-in support this kind of functionality is simply not possible. It feels strange to me to make RNGs this low-level, but it might be necessary.

@chethega
Copy link
Contributor Author

chethega commented Feb 26, 2020

It would be nice to use less llvmcall. If the code is written/unrolled such that shift amounts are constant, they should all be fully optimized. If we really need the "unsafe" shifts and rotates we should add intrinsics.

We don't really need unsafe shifts. It's just that we're in stdlib, so no Simd.jl. The obvious alternative -- ntuple(i->VecElement(x[i].value [...] ), 4) -- generates very long SSA-IR, confused the inliner and was pretty hit-and-miss with respect to vectorization. Writing it in llvmcall is no longer and imo no less readable. I would instead be tempted to go the other round, and implement the whole thing in IR (including loop and store). Simd.jl is moving towards @generated IR away from ntuple as well.

The main reason that bulk generation is in julia instead of C (dsfmt style) is that either I suck at C, or statically compiled-for-distribution C for tight kernel loops cannot compete, performance and convenience wise, with handwritten llvm-IR or julia that is "late AOT / early JIT" compiled and knows CPU capabilities beyond "generic". (rant: please clang, give me an analogue to llvmcall, maybe __asm__llvm).

The single-number case must be in julia, for performance reasons.

I have a PR for codegen of current_task; I might as well merge that first. I believe the tbaa_const is correct.

Yay, #32812. I should have searched better, would have saved me some time ;) I'll rebase once that merges.

I'm taking the general feedback as "go ahead and clean up the mess"? Cleaning up the mess means fixing code formatting, typos, bugs, writing tests, etc? Or would that be too early?

@JeffBezanson
Copy link
Member

If you'd like to clean it up and add tests that's fine, but I think I need to do some more background reading here and maybe find some others to help review this. This will take a bunch of thought, but it does seem very interesting so far.

@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Feb 27, 2020
@chethega
Copy link
Contributor Author

Then let's wait with the cleanup. In the meantime, I opened a discourse thread to solicit some user input, at https://discourse.julialang.org/t/reproducible-multithreaded-monte-carlo-task-local-random/35269

I'll continue to work on this as my schedule permits, but feel free to push on this branch or fork from it if I become too unresponsive.

@c42f
Copy link
Member

c42f commented Mar 22, 2020

It's wonderful to see someone attacking this problem, and very impressive progress so far!

  • The old principle "don't pay for what you don't use" (popular in e.g. the C++ world) comes to mind.

This is my main concern, though I'd point out that logging also gets special attention because it copies the current logger. In that case it's simply copying a pointer rather than running code, but nevertheless we carry along some special state during task spawning which other libraries couldn't emulate.

I feel like certain things like loggers and RNG state are naturally dynamically scoped for composability. But in general that involves passing user defined state from parent to child task, and in some cases running code to update that state during spawning. It's really irksome that carrying the state and running such code might be for nothing if callees don't use the facilities it provides. It also seems non-sustainable to be hard coding this kind of stuff into the runtime if there's other dynamically scoped facilities people would like to behave in a similar way.

That doesn't mean I'm against this change. In fact I'm cautiously positive if we can pay such a small cost for a decent benefit which is very relevant to the community.

I do wish there was a generalization in sight though which would allow a normal user package to act in a similar way. What would you need for such a package? I would guess:

  • Super fast access to the task-local RNG state. Perhaps a few indirections, but which can be LICM'd?
  • Some hook allowing user-defined code to run on the parent task to populate child task state.

Currently I guess the task local state access in this patch is one load relative to one of the segment registers to get the task pointer (LICM'd as mentioned?) + another to get the RNG state? In the past I've wondered whether it would be possible to use types as unique keys into a task_local_storage-like thing, and to get the compiler to generate good code for loading state relative to these keys. Vaguely in the spirit of pthread_key_create etc.

Copy link
Member

@c42f c42f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick browse of the code changes. It all looks fairly reasonable at surface level — here are some superficial comments of a few minor things I happened to notice.




#needed because julia semintics and x86 semantics differ on shl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This triggered a memory; I wondered whether the following would help with this:
https://discourse.julialang.org/t/how-to-shift-bits-faster/19405/2

stdlib/Random/src/xoroSimd.jl Outdated Show resolved Hide resolved
stdlib/Random/src/xoroSimd.jl Outdated Show resolved Hide resolved
src/task.c Outdated Show resolved Hide resolved
@chethega
Copy link
Contributor Author

chethega commented Aug 12, 2020

Ok, I did a big rebase / refactor / clean-up of the code. The batch of llvmcall is now restricted to the submodule responsible for bulk generation. Reliable good compilation results still elude me if I try to go without llvmcall.

We now also get a free-standing xoshiro -- i.e. one that uses conventional state management instead being baked into the runtime.

I am somewhat hyped to get this feature into 1.6, if possible. The way I see it, we should proceed along:

  1. Decisions:
    1.1. Is task-local RNG something we are willing to pay for? Minor costs are a handful of nanos per task creation and 32 bytes per task. Major cost is that this sets a precedent: Not every possible feature can get dedicated runtime support. I think I argued as best as I can that reproducible random stream merkle-tree for multithreaded simulations that use our nice composable constructs is fundamentally difficult to impossible without such runtime support. I think I argued as best as I can that the runtime costs are very moderate to low, as evidenced by this PoC. Are there still questions regarding these two points?
    That still leaves us with a trade-off -- we can't have every imaginable feature, and sometimes the core team needs to say no.
    1.2. Should task-local become the default RNG? I think this is a no-brainer: If we decide to pay for task-local random being possible, then we better make it the default!
    1.3. If we decide against task-local: Then we could decide to make the more conventional free-standing xoshiro the new default. We would not gain the nice advantage that the random-tree is independent of scheduler decisions, but we could at least grab the speed-ups. (free-standing has similar perf to task-local. I kinda hoped to use the perf gains as a bribe to pay for the reproducibility...)

  2. Tuning. If it looks like this moves towards acceptance, then we need to fine-tune magic constants. The newly rebased version has two tunables: The vector width, and the threshold for switching between simd and ordinary operation. Most interesting tunable is the vector width, because its optimal setting depends very much on the host CPU. Benchmarks show that too long widths are not extremely deady to performance (i.e. the register spills don't kill us). I think N=8 is an appropriate compromise: Optimal for avx2, wide enough to utilize avx512 and hopefully tolerable for weaker cpus.

Extra long benchmark of vector widths on broadwell (avx2)

julia> for N in [1,2,4,8,16,32]
         println("============== Vector width: $(N)  ==================")
         @eval Random.XoshiroSimd xoshiroWidth()=Val($N)
         for T in [UInt64, Float32, Float64, Bool]
             println("\n")
             @show T
             for bulksize in [1<<9, (1<<11) - 8, (1<<11) + 8, 1<<13, 1<<15, 1<<20, 1<<28]
                 arr = zeros(T, Int(bulksize/sizeof(T)))
                 throughput_default = 1e-9 * bulksize / (@belapsed rand!($arr))
                 throughput_tasklocal = 1e-9 * bulksize / ( @belapsed rand!(Random.TaskLocal(), $arr))
                 @show  bulksize, throughput_default, throughput_tasklocal
             end
         end
       end
============== Vector width: 1  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.117372127025688, 3.5395870109610414)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.1476117453285624, 4.277049794084612)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.1677423597606236, 4.286539078370577)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 4.083522569925228, 4.498874183096271)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 4.32381078049746, 4.557916333414473)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.0100962200364085, 4.586826242530817)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.1580543796899265, 4.387440573682439)


T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.706401526781898, 2.7585232291114647)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.2736901227633792, 2.9915197823507276)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.3318692217774792, 3.604027932588338)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 3.929645027182603, 3.7784041408291906)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 3.87572938022394, 3.823421881684882)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 3.3344760148188195, 3.309930333936243)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 2.531206469105317, 2.7118997871611596)


T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.415148298579738, 2.7674338051087073)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 5.450650696763791, 3.0350773456425126)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 5.520589982181747, 3.5971056109690123)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 6.255345143555285, 3.7824748614816337)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 5.179073810652759, 3.8219353835387433)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.723357868088902, 3.284107138382902)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.009693153649714, 2.69424843298056)


T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.8854183287274306, 4.612206572769953)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.601908001774885, 3.969032467269264)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.5702731063173836, 3.910584325227995)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 4.215943595285883, 11.871878542909252)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 4.209120102761722, 12.07814227792112)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.186483648544519, 12.221023064999244)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.6208831753875486, 8.955187010582272)
============== Vector width: 2  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.0764316472781026, 3.540022992281163)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.1449288835177094, 4.331914063769623)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.1699406121516676, 5.437038189313216)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 4.07314513010331, 5.951325826371232)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 4.325237592397044, 6.057678087256594)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.009452215076131, 6.114502303341303)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.1501046890073865, 5.786876076771355)


T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.795596367510527, 2.7618940429294963)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.272126595006105, 2.9918661533562974)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.350313507386022, 4.694875964432758)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 3.929645027182603, 5.209870262019842)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 3.9004880371384365, 5.3272638595350355)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 3.3332146580881417, 4.476636511507772)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 2.518532019247337, 3.379211858127549)


T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.4057646564992305, 2.766549965629147)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 5.446239913066227, 3.044370936466294)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 5.554803848350394, 4.7301712718737665)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 6.264912817375345, 5.210864448826411)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 5.6913590968302215, 5.32103536747751)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.712510505193049, 4.350414265503321)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.012874423188122, 3.370384578263321)


T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.986739133274977, 4.594366553971765)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.6052228353340414, 3.966339581371328)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.5865453634921964, 3.8781284436458225)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 4.181298489179256, 20.517701276094826)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 4.189477721664643, 21.80753360841209)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.186868122214946, 16.93135909318435)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.6129104433789383, 8.80515396914024)
============== Vector width: 4  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.1385620066332303, 3.5395870109610414)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.172285329386132, 4.277873474060363)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.1451956079454573, 8.81617308727789)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 4.113482299774041, 11.262592175978003)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 4.32609413162585, 11.900730398289014)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.005424215685151, 12.215186039467861)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.1571175729592422, 9.018107102753033)


T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.38906379793049, 2.486868461095398)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.252528099785067, 3.0090801118731085)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.325627632687448, 7.307755182919279)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 3.9384615384615387, 9.226091037071798)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 3.9119742130606077, 9.716234247590808)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 3.3011147735035875, 6.90702377266769)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 2.529232625708447, 4.50712283202278)


T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.316562415012239, 2.7864335261963373)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 5.406055718856488, 3.0466297585754236)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 5.578439738070814, 7.20807352315128)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 6.244378382498666, 9.178711484593837)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 5.709705523610386, 9.712994182815223)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.721762663628627, 7.0148247257158145)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.004385924020011, 4.481474505531103)


T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.884594414490822, 4.543678094467249)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.6203945998315845, 3.9588429117224924)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.6205028175119214, 3.9083951231649667)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 4.209877177655583, 35.185234014502306)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 4.192425793244627, 35.82541646287079)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.179575177075985, 16.6912228200312)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.622819457709379, 9.011759329715723)
============== Vector width: 8  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.129800432256435, 3.539732326135151)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.182995270110201, 4.2770040433861745)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.1449695227941694, 8.973142345568487)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 4.126952141057934, 12.66424315106997)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 4.321387359467212, 14.022061620387984)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.00562312817065, 14.645120741909803)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.1496910495491726, 9.03170672605721)


T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.619869554250297, 2.7228783831622314)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.27501244827527, 2.985375497901786)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.3528585568176466, 7.48121194653601)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 3.9359385009609227, 10.072700548363391)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 3.8664306784660774, 11.020627802690584)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 3.334200342776105, 7.569251647645653)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 2.52411837627272, 4.4183456027730115)


T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.359575038251744, 2.7840306611306294)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 5.450553384064201, 3.0449787513191295)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 5.572739238531663, 7.482829447587475)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 6.255822833142422, 10.097572591150394)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 5.176292177429547, 10.983687150837989)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.801633856735309, 7.562591505412794)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.9887125429272317, 4.3979290783757845)


T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 3.031506345644264, 4.630971452452264)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.580314273243125, 3.9690726872916304)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.586707199115656, 3.838473240215738)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 4.210093534792887, 32.739836699613235)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 4.186934994409839, 35.204125483455094)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.185714912998527, 17.214604675597585)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.6259548904611205, 8.960925809762575)
============== Vector width: 16  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.119882535843842, 3.5385991841023072)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.13520103716805, 4.273118808177747)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.180580601817475, 6.472250874067328)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 4.100715823196676, 10.551617873651772)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 4.3275224511357635, 11.308409064764755)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.016732298545883, 7.48817047653734)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.1574920135354767, 5.767567313918249)


T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.6582630780847305, 2.7017922134768617)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.2943620738502215, 3.0062123765952324)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.352798365596915, 5.533018929546554)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 3.942884646237767, 8.512053200332504)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 3.8635434680081744, 9.10411891366257)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 3.302913661133336, 5.247629104339428)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 2.530089846006591, 3.4919580252879854)


T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 3.9554246189317284, 2.757925326525861)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 5.521222149536306, 3.0345853477129046)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 5.5005075564241475, 5.552822718095289)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 6.260603744745892, 8.5231233418301)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 5.32103536747751, 9.297205277344306)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.726338468750282, 5.14307856053286)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.981548878099816, 3.445165604780735)


T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 3.05298909573789, 4.630640362161579)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.5848256861290664, 3.8965168955068648)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.599209411764706, 3.8735820394957488)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 4.196076422680941, 24.864405193277086)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 4.174799337495222, 20.24966011617847)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.185664787877821, 12.101560336072385)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.6197155071741607, 7.958164085333574)
============== Vector width: 32  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.1199301517579077, 3.5401683432560054)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.138134649435555, 4.3317263420183085)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.166501764145837, 4.115369058607565)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 4.048986764786644, 8.297312037397742)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 4.317828435894057, 9.54396184512324)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.020875593884571, 8.58390909983955)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.156305940161569, 6.313756632361514)


T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.709154569416578, 2.761827434019599)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.2657813352737857, 3.0095124965274116)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.346507443153935, 3.6677664390393625)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 3.8759331300599307, 6.763540290620873)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 3.9069989269106955, 7.128348561128722)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 3.301935987706415, 5.490990401281924)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 2.5349197001676744, 3.624287892467378)


T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.389513630191833, 2.7736889344787796)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 5.477228415413814, 3.0351641853678073)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 5.599417744699834, 3.703433993160912)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 6.252957789481719, 6.740167845976633)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 5.760900140646977, 7.241318348276298)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.720083546400662, 5.463552901699649)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.984108534391688, 3.559141923610882)


T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 2.9997122904297338, 4.613605455488265)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 3.6169534137601085, 3.805932777565028)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 3.599481390336354, 3.877858958068616)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 4.217680070020079, 16.077193556300596)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 4.205878577846232, 17.481860862142554)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 4.185013270539402, 11.789833481374876)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 3.623137881909648, 7.7450906823400745)

  1. Code quality and review. Nitpicks and comments are very welcome!

@chriselrod
Copy link
Contributor

chriselrod commented Aug 13, 2020

Results on a system with AVX512:

============== Vector width: 1  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.2757499063473645, 8.092668090338446)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.740279144346051, 8.738858784163927)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.637414186135379, 8.529749168751069)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.674097191261703, 9.02351710084265)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.523505049771126, 9.136802481614444)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.424131577764797, 9.154351166364018)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 5.377260254346873, 9.141597065131236)


T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.271612882807803, 5.082979147645584)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.0474539024139, 5.3814252558688125)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.471307362466929, 5.121870606516539)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 7.702143663031216, 5.394442249440274)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 7.872867684915052, 5.3051268213707505)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 6.913763887515248, 4.829009721793674)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.3880977552123195, 4.268776609837788)


T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 8.734398178493809, 4.999401333581058)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 11.029538861457446, 5.602426104046543)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 11.535898892559318, 6.63076576914423)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 13.711104842501177, 7.199859377746529)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 16.791664294254968, 6.998077920493029)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 17.54792067609405, 6.219385757838172)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 13.58361271616484, 5.155057751489517)


T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 6.032262780837273, 9.182015345641439)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 7.631386861313869, 7.9850927293977545)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 7.667888092726898, 7.848128221102113)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.895816033216224, 23.462726518375764)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.164912771387618, 24.124272988294194)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.167717285817954, 24.322701862633668)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 6.452277194366627, 15.73074306516174)
============== Vector width: 2  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.346622175928913, 8.025181473630115)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.739678988579901, 8.770834560339242)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.671546617364073, 12.905818496558085)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.58100558659218, 14.272864024982388)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.528351264902588, 14.550621669626997)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.43252433297951, 14.568006890994473)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 5.372379314652598, 14.057143748877058)


T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.250021340162186, 5.084366438176417)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.066490443269622, 5.385796788728819)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.4777499252581485, 11.312610006581266)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 7.724658180103725, 13.124479314480324)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 7.855073456388482, 13.513815699033131)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 6.765923125068558, 10.711669101347418)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.374775663462648, 7.352087499319734)


T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 8.655567331031419, 5.081539300851904)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 11.004049807592128, 5.604249667994687)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 11.57331426608624, 11.329777123762225)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 13.832355438166209, 13.12327789016944)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 16.05312721136574, 13.500824024903864)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 17.477431828788585, 11.029979172364463)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 13.516939488314454, 7.335982738121835)


T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 6.004337133372909, 9.155484480628045)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 7.656194236901787, 7.978994617915586)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 7.667665682673459, 7.850236392017702)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.87948760059123, 51.55054700731216)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.162350145049107, 56.113305013387325)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.136404429767621, 41.78092999163247)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 6.468780640716049, 16.174476167430235)
============== Vector width: 4  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.3484053810338805, 7.921876981610653)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.751810606650572, 8.783163440839076)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.719912841597386, 19.529354087510725)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.652398688087166, 23.003851091142494)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.526273711752308, 23.983019834589772)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.4493547689424, 23.87957459406527)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 5.389122469218704, 15.81244534264236)


T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.276654096674666, 5.096778945817681)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.04977664326739, 5.2161548631726)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.4870976822745625, 17.295658115869767)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 7.713747645951036, 23.219460265076915)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 7.853190906600932, 24.96419320432729)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 6.934430240786176, 17.878838513870654)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.3963456911488015, 7.57705647430128)


T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 8.095575618595552, 4.939951093240023)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 10.994517731711495, 5.608338762214984)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 11.584927373517651, 17.1107618225594)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 13.7782872606318, 23.264039251239936)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 16.06012089527855, 24.994660564454616)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 17.587360158333475, 17.18441796817385)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 13.46564994707003, 7.557070899103401)


T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 6.0239691442893815, 9.146010710719796)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 7.652161794934328, 7.9597426310755575)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 7.6335474465855775, 7.896615607067011)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.88134944676509, 83.01245446437207)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.165553651970212, 87.47577872695072)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.1508360386777, 52.21471964943731)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 6.419267118873478, 16.120388745515385)
============== Vector width: 8  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.302717729251253, 7.727496022121638)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.74069946743032, 8.77066237386627)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.630145712238236, 22.1360595048059)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.601528320597328, 40.367821608233996)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.531469294258809, 50.071838213099916)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.489032071237242, 46.05683664953661)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 5.388097109830302, 15.113085786961788)


T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.276201902476648, 5.1053746666526125)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.029760621091222, 5.214919007893457)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.514445953256275, 16.589052672781136)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 7.6855239703536915, 29.33284283069075)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 7.822658754518792, 35.480657545033964)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 6.834765151416392, 21.444149044950716)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.397779860130736, 7.263325298281252)


T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 8.791416368623912, 4.99074703760948)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 11.031460448286259, 5.302311496921369)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 11.489921285927252, 17.37287127240277)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 13.770523824728135, 29.155937410242643)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 16.844413982179578, 35.301754995655955)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 17.464041837380503, 23.715927081919755)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 13.598782026182912, 7.441702103877209)


T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.979805787719213, 9.159812370459257)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 7.233953172298758, 7.972393372667697)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 7.582404583251187, 7.866126267270644)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.89042101567238, 84.5365215114757)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.161709712368506, 109.16038064741049)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.093933480768397, 48.599184278828325)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 6.480986839902164, 15.265156457422892)
============== Vector width: 16  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.343714766279135, 7.740410058027079)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.734773453949933, 8.781868574685982)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.718601538098113, 19.064076659931587)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.601237842617152, 39.11714770797963)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.537017499181433, 53.09068432340298)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.383734249713632, 41.860992454788615)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 5.271051712398194, 14.998236710755695)


T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.2225489051404805, 4.990541348115483)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 5.98783982109162, 5.156197472452219)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.507907797507445, 15.37113984756359)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 7.632535171899749, 28.23175136069994)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 7.827196724108514, 35.5389964592172)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 6.800280162909544, 20.781164532878833)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.438002776328215, 7.432791437004642)


T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 8.737412430548366, 5.0068949833052505)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 11.01719268691473, 5.604032498412657)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 11.542699038933092, 15.350222275314216)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 13.817366313309776, 28.092347309511368)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 15.88794310957871, 35.71660409349643)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 17.563204529085642, 22.384902760284355)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 13.453872394145193, 7.363418360529808)


T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 6.01423677753213, 9.200124175964648)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 7.596463573435679, 7.989319092122829)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 7.620724404341331, 7.841606371641708)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.878518376844857, 70.24313843605286)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.164912771387618, 88.07452832650505)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.155630064263761, 44.3278799408159)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 6.372652454236797, 15.299578849792344)
============== Vector width: 32  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.360980582974188, 8.161761343220201)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.727334516482272, 8.783077104549207)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.727231782113951, 10.925464969934277)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.64807200800089, 28.35036459064963)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.526619907693426, 46.913112164297004)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.4096701245558, 42.7658550511848)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 5.418603840911265, 15.285752633340612)


T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.2559562515323694, 5.003136866435529)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.050730434568569, 5.210435822680721)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.452562182210946, 9.437308920013812)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 7.751703255109766, 21.83713557370049)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 7.829868578255676, 32.673247582012166)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 6.851378017066764, 21.441956526184487)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.3849451783547515, 7.32115340500625)


T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 8.938736641625225, 5.082392444910808)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 11.001286675525744, 5.607535076405987)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 11.579621476356568, 9.457378475819487)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 13.803059255424294, 21.877206466657114)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 16.040902909980964, 32.443564356435644)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 17.53794176186255, 21.865832551350227)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 13.58626854642585, 7.348995046790955)


T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.919039689480432, 9.089249309928016)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 7.639563599262617, 7.9882198952879575)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 7.6475235210797265, 7.858708727276817)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.884465709728868, 41.48898455305141)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.164271980423003, 59.74024771385577)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.041396852769994, 43.853289281084024)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 6.453375269282844, 15.045824345939197)

A vector width of 8 performs best.

FWIW, this is what I get with VectorizedRNG.jl:

julia> for N in [2VectorizedRNG.W64]
                println("============== Vector width: $(N)  ==================")
                @eval Random.XoshiroSimd xoshiroWidth()=Val($N)
                for T in [UInt64, Float32, Float64]
                    println("\n")
                    @show T
                    for bulksize in [1<<9, (1<<11) - 8, (1<<11) + 8, 1<<13, 1<<15, 1<<20, 1<<28]
                        arr = zeros(T, Int(bulksize/sizeof(T)))
                        throughput_default = 1e-9 * bulksize / (@belapsed rand!($arr))
                        throughput_tasklocal = 1e-9 * bulksize / ( @belapsed rand!(Random.TaskLocal(), $arr))
                        throughput_vrng = 1e-9 * bulksize / ( @belapsed rand!(local_rng(), $arr))
                        @show  bulksize, throughput_default, throughput_tasklocal, throughput_vrng
                    end
                end
              end
============== Vector width: 16  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (512, 4.317000627597937, 8.158099675784877, 31.84444721425901)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2040, 6.731527190352106, 8.782055652411051, 45.01070854899161)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2056, 6.637851514113275, 19.02396800286542, 46.393909693889185)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (8192, 8.58050625545534, 39.28219040789349, 51.66561021835035)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (32768, 9.53354911444885, 52.93798384460943, 53.7593280083999)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (1048576, 9.432863748403232, 42.98147237252009, 42.106412881982095)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (268435456, 5.438526861203929, 15.268634801450354, 15.285453211337668)


T = Float32
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (512, 5.1034855894287805, 5.068721523505575, 31.081265206812652)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2040, 6.0572077507275175, 5.252449133383572, 42.308192867188644)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2056, 6.509435920858681, 15.706879955195411, 41.65001436722631)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (8192, 7.594326504125336, 28.297063903281522, 44.78717885276912)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (32768, 7.860995921724528, 35.79616179509891, 44.99023183666625)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (1048576, 6.933192277175351, 23.920976388730466, 43.4174982402385)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (268435456, 4.403767046491716, 7.357364196926021, 15.373862891045325)


T = Float64
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (512, 8.725876140182429, 5.13528618026116, 31.80282566751728)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2040, 11.019134943891538, 5.633072906442621, 41.93670543684068)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2056, 11.421664212384934, 15.460772265144941, 41.58770365816171)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (8192, 13.821794497176954, 28.223673762135302, 44.90401465987021)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (32768, 16.047885944387005, 35.65614798694233, 44.99966454209997)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (1048576, 17.519188678930046, 23.172950276243093, 41.199795685827674)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (268435456, 13.703102286017746, 7.409493045077719, 15.307004960776135)

It doesn't guarantee reproducibility across architectures, so it defaults to 2x or 4x SIMD vector width, depending on the sampler (here, UInt64 uses 4x, while the floats use 2x).

I would set the SIMD threshold to be much smaller. That seems like it would be highly beneficial on Broadwell as well as Skylake-X/Cascadelake-X.
Yet the bulk_simd function still leaves a lot of performance on the table for "smaller" arrays like 2056 and 8192, that aren't bottlenecked by memory bandwidth. The bulk_simd's loop looks good, so this difference is presumably just the result of a large constant overhead.

%lshifted = shl <$N x i64> %0, %lshiftOp
%rshifted = lshr <$N x i64> %0, %rshiftOp
%res = or <$N x i64> %lshifted, %rshifted
ret <$N x i64> %res
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the funnel shift intrinsic.
Although LLVM compiles either approach into to

vprolq  $45, %zmm0, %zmm0

given AVX512.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaiu fshr was introduced in llvm 7.0.0 (release late '18) -- do we still support older llvm versions? Regardless, either compiles ok at this moment.

@vigna
Copy link

vigna commented Aug 13, 2020

How are you initializing the parallel instances? This is something that should be done with some care (e.g., using long jumps).

oneBits(::Type{Float32}) = Base.exponent_one(Float32) + UInt64(Base.exponent_one(Float32))<<32


function forkRand(rng::Union{TaskLocal, Xoshiro}, ::Val{N}) where N
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you initializing the parallel instances? This is something that should be done with some care (e.g., using long jumps).

Good point. It seems to be using forkRand.
FWIW, VectorizedRNG.jl does initialize them with jumps.



#Vector-width. Influences random stream. We may want to tune this before merging.
xoshiroWidth() = Val(8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capitalization in this file seems a bit off (camelCase). Is that intentional?

@chriselrod
Copy link
Contributor

chriselrod commented Aug 13, 2020

The tests in #27614 suggest that 256 or 512 byte of ephemeral state would be even faster on powerful hardware; but this variant is conservative (changing the size is a simple search-and-replace), I don't want to break rasberry pi perf.

I don't oppose throwing old architectures under the performance bus, but still probably worth pointing out that it may also regress CPUs lacking AVX2, such as sandybridge. SIMD bitshifts were introduced there.

Maybe I shouldn't set MARCH=native in my Make.user so that my system image would be compatible.

@chethega
Copy link
Contributor Author

chethega commented Aug 13, 2020

@vigna

How are you initializing the parallel instances? This is something that should be done with some care (e.g., using long jumps).

The fundamental model is not that of a random stream, but instead of a random tree: I.e. we don't have a map rng(initial_seed, k::Int)::UInt64, where we care about correlations in subsequent outputs. Instead, we have a map rng(initial_seed, splits::Vector{Int}) ::UInt64, where the each entry in splits tells us how many numbers were extracted from the parent RNG before splitting off the next (the original stream is found at rng(initial_seed, [k])). You can write a paper about a bigcrush version for that model ;)

I.e. given an RNG parent-state, we need to (1) advance the parent at least once, and (2) initialize the child RNG, such that future output of the child will be uncorrelated (in the sense of e.g. bigcrush) to future and past output of the parent. They will obviously not be independent (we are writing a PRNG after all) and their dependence will be discoverable by clever tricks (we are not writing a CSPRNG after all). These two operations fit together like bread and butter! I mean, we must harvest pseudo-random numbers from the parent (in order to avoid collisions if the parent forks twice without generating randomness in between), and we are in need of pseudo-random numbers to seed the child. Gee, what are we to do, I wonder?

The code is:

void rng_split(jl_task_t *from, jl_task_t *to) {
  /* Fixme: consider a less ad-hoc construction
  Ideally we could just use the output of the random stream to seed the initial
  state of the child. Out of an overabundance of caution we multiply with 
  effectively random coefficients, to break possible self-interactions.
  
  It is not the goal to mix bits -- we work under the assumption that the
  source is well-seeded, and its output looks effectively random.
  However, xoshiro has never been studied in the mode where we seed the
  initial state with the output of another xoshiro instance. 
  Constants have nothing up their sleeve:
  0x02011ce34bce797f == hash(UInt(1))|0x01
  0x5a94851fb48a6e05 == hash(UInt(2))|0x01
  0x3688cf5d48899fa7 == hash(UInt(3))|0x01
  0x867b4bb4c42e5661 == hash(UInt(4))|0x01
  */
  to -> rngState0 = 0x02011ce34bce797f * jl_tasklocal_genrandom(from);
  to -> rngState1 = 0x5a94851fb48a6e05 * jl_tasklocal_genrandom(from);
  to -> rngState2 = 0x3688cf5d48899fa7 * jl_tasklocal_genrandom(from);
  to -> rngState3 = 0x867b4bb4c42e5661 * jl_tasklocal_genrandom(from);
}

and

function forkRand(rng::Union{TaskLocal, Xoshiro}, ::Val{N}) where N
    #  constants have nothing up their sleeve. For more discussion, cf rng_split in task.c
    #  0x02011ce34bce797f == hash(UInt(1))|0x01
    #  0x5a94851fb48a6e05 == hash(UInt(2))|0x01
    #  0x3688cf5d48899fa7 == hash(UInt(3))|0x01
    #  0x867b4bb4c42e5661 == hash(UInt(4))|0x01
    s0 = ntuple(i->VecElement(0x02011ce34bce797f * rand(rng, UInt64)), Val(N))
    s1 = ntuple(i->VecElement(0x5a94851fb48a6e05 * rand(rng, UInt64)), Val(N))
    s2 = ntuple(i->VecElement(0x3688cf5d48899fa7 * rand(rng, UInt64)), Val(N))
    s3 = ntuple(i->VecElement(0x867b4bb4c42e5661 * rand(rng, UInt64)), Val(N))
    (s0, s1, s2, s3)
end

@chethega
Copy link
Contributor Author

Maybe I shouldn't set MARCH=native in my Make.user so that my system image would be compatible.

This is a very good point: We don't want to distribute fully compiled code for generic arch in the sysimage for this function. Due to all the @noinline, we need to compile at most 3-6 variants of the very small kernel loop, so the JIT cost should be negligible.

@chriselrod

I must admit that I don't understand https://github.com/chriselrod/VectorizedRNG.jl/blob/865757a23905686abff1202e83d5fafccb3ae295/src/api.jl#L129

The way I generate bulk Float32 / Float64 is that I first use xoshiro and masking to generate random floats in [1,2), and then subtract 1.0 in a second loop. For some reason (you probably know why?) I got pretty bad perf by running the floating-point subtract on the register that holds the xoshiro result -- so instead, I round-trip through memory.

A more advanced design would chunk that: Fill 8kb of data, spill vector-rng state to the stack and subtract 1, go on. That way, we could avoid the very annoying doubling of memory traffic.

An even better design would figure out how to solve this very annoying problem, and then try to use non-temporal stores (no need to trash the entire L1d).

Both of these optimizations could be done without changing the random stream -- i.e. this PR doesn't lock us into a suboptimal design.

Yet the bulk_simd function still leaves a lot of performance on the table for "smaller" arrays [...] so this difference is presumably just the result of a large constant overhead.

Presumably. Unfortunately we can't afford to keep the vectorized rng alive :(
Thanks for the comments and benchmarks! You're right that we probably should decrease the threshold.

@chriselrod
Copy link
Contributor

The bulk generation of floats currently traverses the array twice: First to do all the bitwise ops, then to do all floating point ops. This is due to x86 issues: There is afaiu some split between how registers are used. Hence we must commit the intermediate values to "memory", and cannot fully interleave the bitwise and floating point (x-> x-1) operations. Main memory is slow, so the correct way would be to write ~25% of L2d with the RNG, then run over the data again for the floating point ops, etc. Still faster than DSFMT on my machine, but might be bad for some users (if they are bandwidth-bound, instead of CPU-bound, and we perturb other threads/cores by hogging whatever shared resources, e.g. shared L3).

Okay, testing on Haswell (after a change on master to make UInt64 sampling use 2x vector width instead of 4x to stop spilling registers on 16-register systems [no performance benefit with 32 registers anyway]):

============== Vector width: 8  ==================


T = UInt64
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (512, 1.380471877908609, 2.8312320283123205, 7.50946615382313)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2040, 2.363368885619713, 3.228238710938591, 9.692474583112592)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2056, 2.3086086248982918, 6.589561517420497, 9.670169491525424)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (8192, 3.12406779661017, 9.355157974876287, 10.399911256825213)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (32768, 3.4140445926234633, 10.365110118223875, 10.667968908965126)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (1048576, 3.2505820243597734, 10.829040586595063, 10.83956334766788)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (268435456, 2.003070157690122, 4.785000072104206, 4.7845748285930565)


T = Float32
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (512, 2.0273678237932264, 1.9461093378148229, 6.6193994153600855)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2040, 2.5065701901088775, 2.2042883673813947, 8.113502565770833)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2056, 2.5714078989529394, 5.485592315901815, 8.047094762298734)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (8192, 3.002565668906536, 7.4209620436633745, 8.650826530197799)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (32768, 2.994699323706818, 8.15559111111111, 8.871801813997564)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (1048576, 2.59226406791561, 5.656787113062806, 8.998798530774776)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (268435456, 1.5789669106483593, 2.382474101715775, 4.803865090209645)


T = Float64
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (512, 2.771631519756662, 1.818756448263703, 6.484977870346265)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2040, 4.4685190180739145, 2.2339163498098857, 8.150624670068625)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2056, 4.470342668290138, 5.523998330550919, 8.102293212593805)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (8192, 5.293699515347335, 7.417602318000724, 8.648367711547833)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (32768, 4.752567152056622, 8.131593874078277, 8.880517632711136)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (1048576, 3.9372488941957484, 5.727200729707735, 9.000884142939304)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (268435456, 3.3682848702750454, 2.364016911760853, 4.7888480960911375)

As before, we really see both the cost of passing over the arrays twice when they are large, and the overhead of forking when they are small.

But

The way I generate bulk Float32 / Float64 is that I first use xoshiro and masking to generate random floats in [1,2), and then subtract 1.0 in a second loop. For some reason (you probably know why?) I got pretty bad perf by running the floating-point subtract on the register that holds the xoshiro result -- so instead, I round-trip through memory.

No, I don't know why, and I also don't think I'm seeing this. I tried this on the Haswell laptop because Haswell (and Broadwell) only have a single execution unit for floating point vector addition/subtraction (Skylake and beyond have 2).
Also, addition/subtraction use port 1, shifts use port 5, and xor uses 0, 1 or 5. Skylake added port 0 as an option for add/sub. I thought maybe the inflexibility caused problems from Broadwell/Haswell, but the pattern in relative performance between this PR and VectorizedRNG look similar between Haswell and Cascadelake-X.

Broadwell/Haswell do have 2 execution units & ports available for fma, so you could try fma(x, 1, -1) instead of x - 1 (although the multiplicative 1 can't be a compile-time constant, otherwise the compiler will get rid of it).

As an aside, given the large dip in performance for large arrays, I think I'll probably add a version that aligns the destination pointer and then uses nontemporal stores. Then a size check (e.g., vs L2 cache size) can pick which version.

A more advanced design would chunk that: Fill 8kb of data, spill vector-rng state to the stack and subtract 1, go on. That way, we could avoid the very annoying doubling of memory traffic.

An even better design would figure out how to solve this very annoying problem, and then try to use non-temporal stores (no need to trash the entire L1d).

I think the former approach would likely be an improvement over what I'm currently doing for the random normal generator. It has to unroll 4x (instead of 2x) because of the high (compared to bit ops) latency of all the arithmetic operations. That leads to a lot more spilling (extra unrolling, plus needing registers to hold polynomial coefficients) than from the batched double-pass approach.

However, I'm wondering if the problem you observed is Broadwell specific, or if you can provide an example to reproduce it. I don't seem to see it on either Cascadelake-X or the very-similar-to-Broadwell Haswell.

Presumably. Unfortunately we can't afford to keep the vectorized rng alive :(
Thanks for the comments and benchmarks! You're right that we probably should decrease the threshold.

Why not? That probably accounts for almost all the performance difference.
The task would only need to hold a pointer, and that's still a lot less state than a MersenneTwister.

For avx512, I'm also using vpternlogq instead of and + or to mask bits into [1,2), but cutting out a single and/or is pretty negligible in comparison.

I must admit that I don't understand https://github.com/chriselrod/VectorizedRNG.jl/blob/865757a23905686abff1202e83d5fafccb3ae295/src/api.jl#L129

The function getstate loads the state from the rng-pointer. The function f returns an advanced state, as well as the generated random numbers (dependent on what we're generating, e.g. masked and trasnslated for uniform floats). Much of the other code normally gets compiled away.
In the typical case, out of z₁ * γ₁ + β₁, only z₁ doesn't get deleted (α defaults to Static{0}(), so dispatch normally hits this version).
All the loads are to allow optionally passing in vectors to load from or scalars to provide centering and scaling type transforms to the generated numbers in a single pass.

@chethega
Copy link
Contributor Author

chethega commented Aug 13, 2020

The task would only need to hold a pointer, and that's still a lot less state than a MersenneTwister.

Unfortunately not: Each task needs to hold a private (task-local) entire instance with all its state. Whether that memory is inside the task-struct or malloced somewhere on the heap doesn't make a difference for that. Thread-local RNG can afford a large internal state (you won't see more than 128 threads, typically) but affords no reproducibility. Task-local gives reproducibility for our composable multi-threading, but there can be thousands or millions of tasks. This PoC has similar or better perf than the current design for sequential generation, and for large bulks has the potential to reach almost parity with your (super impressive!) implementation (I.e. I fully intend to steal your tricks, or talk you into pushing onto this branch, until we get close). For small bulks, well, it's awkward, because we need to pay the cost for bootstrapping a xoshiro-simd instance from our tiny task-local xoshiro-sequential instance.

edit: Unless the core people are willing to pay an extra 8 x 8 x 4 = 256 bytes per task.... I don't think that would be a good tradeoff for the language. An extra 32 bytes is probably ok, though.

@vigna
Copy link

vigna commented Aug 13, 2020 via email

@chethega
Copy link
Contributor Author

So, after triage call today, it looks like this can go forward -- modulo implementation details reaching sufficient quality. Yay!

@vigna
About the seeding / splitting scheme: Is there a particular reason why you reject the proposed output-multiply variant?

The reason why I am so reluctant about your proposed scheme (that you use in dsiutils) is the following: I come more from the crypto-engineering world, where we think about how to compose primitives into modes of operation / protocols (not the symmetric cryptography world -- these people would be qualified to comment on the internals). There, people think in a pretty modular fashion; accessing internal state of an RNG is a big fat no-no. If you want to seed a new RNG from an old one, then you only access its output stream, and never its internal state.

The reason is that this is a composable action with provable security: If both stream-output RNGs are "secure", then the resulting pair of RNGs is "secure" (I am deliberately vague here). The point is not so much whether it is insecure / leads to bad randomness to access the internal state of the parent, but rather that it makes security proofs more difficult. The security proof is the following: If it is possible to distinguish the parent/child pair from true random without knowing the parent seed/key, then one can use this distinguisher to break the single stream-random, simply by performing the split, i.e. a proof-by-reduction. You see how this breaks once we start accessing internal state of the parent -- the attacker can't access that, and our proof breaks down.

Now xoshiro is a very specific RNG, and it is not intended to be cryptographically secure. Nevertheless, it hurts me to access internal bits -- it deprives people of a lot of tools to reason about the resulting construction.

Can we find something that receives your blessing and only uses the output stream of the parent?

Your blessing would be super nice, because I am not really qualified to analyse the internals of xoshiro. What we need to do here is only to break self-interactions -- i.e. bad effects that come from the fact that the 4 seed values are verbatim subsequent output-values of a xoshiro instance. The reason I chose multiplication with effectively random constants is that "integer multiply" is not used in the xoshiro iteration, and hence has a lower chance of bad interaction.

We have a handful of cycles to burn here. If you tell me that it is better to apply one round of spooky to 32 output bytes to get the 32 seed bytes, I'll happily write that (but spooky uses the same kind of primitives as xoshiro...). But I just don't want to read out the internal state :(

@vigna
Copy link

vigna commented Aug 14, 2020

For me the main problem is that you're almost using bits of state to seed the new state (multiplication is a weak scrambler). I'd use at least a MurmurHash mixing instead of a simple multiplication. For example, the lowest bit you're using for the state is exactly the lowest output bit, which is not good IMHO.


Apart from the high speed, Xoshiro has a small memory footprint, making it suitable for applications where many different random states need to be held for long time.

Julia's Xoshiro implementation has a bulk-generation mode; this seeds new virtual PRNGs from the parent, and uses SIMD to generate in parallel (i.e. the bulk stream consists of multiple interleaved xoshiro instances). The virtual PRNGs are discarded once the bulk request has been serviced (and should cause no heap allocations).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please insert newlines (here and elsewhere), this one doesn't even fit on a big screen! This is will make reviewing on Github easier.

end


@inline function rand(rng::Xoshiro, ::Type{UInt64})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@inline function rand(rng::Xoshiro, ::Type{UInt64})
@inline function rand(rng::Xoshiro, ::SamplerType{UInt64})

end


@inline function rand(::TaskLocal, ::Type{UInt64})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@inline function rand(::TaskLocal, ::Type{UInt64})
@inline function rand(::TaskLocal, ::SamplerType{UInt64})

end

#Shared implementation between Xoshiro and TaskLocal -- seeding
function seed!(rng::Union{TaskLocal, Xoshiro}, seed::Union{Int128, UInt128, BigInt})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I would prefer being more conservative on valid seeds here (like with MersenneTwister): with a big BigInt, high bits will simply be discarded, e.g. two seeds which differ by 2^130 lead to the same initialization state. Similarly, assuming s::Int128 < 0, then s and s % UInt128 lead to the same state, while they are represent distinct numbers.

So I suggest defining this method only for UInt128 (like MT does for Vector{UInt32}), and define other integer methods with something like seed!(rng, seed::Integer) = seed!(rng, UInt128(seed)) (two equal integers should lead to the same state).

rand(RandomDevice(), UInt64),
rand(RandomDevice(), UInt64))

function seed!(rng::Union{TaskLocal, Xoshiro}, seed::AbstractVector{UInt64})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error should be thrown if length(seed) > 4.

@inline rand(rng::Union{TaskLocal, Xoshiro}, ::Type{Int128}) = rand(rng, UInt128) % Int128
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::SamplerType{Int128}) = rand(rng, UInt128) % Int128

@inline rand(rng::Union{TaskLocal, Xoshiro}, ::Type{T}) where {T<:Union{Bool, UInt8, Int8, UInt16, Int16, UInt32, Int32, Int64}} = rand(rng, UInt64) % T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::Type{T}) where {T<:Union{Bool, UInt8, Int8, UInt16, Int16, UInt32, Int32, Int64}} = rand(rng, UInt64) % T

@inline rand(rng::Union{TaskLocal, Xoshiro}, ::SamplerType{Int128}) = rand(rng, UInt128) % Int128

@inline rand(rng::Union{TaskLocal, Xoshiro}, ::Type{T}) where {T<:Union{Bool, UInt8, Int8, UInt16, Int16, UInt32, Int32, Int64}} = rand(rng, UInt64) % T
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::SamplerType{T}) where {T<:Union{Bool, UInt8, Int8, UInt16, Int16, UInt32, Int32, Int64, UInt64}} = rand(rng, UInt64) % T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the UInt64 from the Union as rand(rng, ::SamplerType{UInt64}) should be defined elsewhere as the "native" method.

@inline rand(rng::Union{TaskLocal, Xoshiro}, ::Type{T}) where {T<:Union{Bool, UInt8, Int8, UInt16, Int16, UInt32, Int32, Int64}} = rand(rng, UInt64) % T
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::SamplerType{T}) where {T<:Union{Bool, UInt8, Int8, UInt16, Int16, UInt32, Int32, Int64, UInt64}} = rand(rng, UInt64) % T

Sampler(rng::Union{TaskLocal, Xoshiro}, ::Type{T}) where T<:Union{Float64, Float32} = SamplerType{T}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary, this definition is already the default.




function rand!(rng::Union{TaskLocal, Xoshiro}, dst::Union{Array{Float32}, UnsafeView{Float32}}, ::SamplerType{Float32} = SamplerType{Float32}())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ::SamplerTrivial{CloseOpen12{Float32}} instead of SamplerType, and the default value shouldn't be necessary, it should already be handled by the Random API.

end

function seed!(rng::Union{TaskLocal, Xoshiro}, seed::AbstractVector{UInt32})
#can only process an even length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't the last entry of an odd-lengthed vector be cast to an UInt64 ? I guess that could be implemented later, but no bits of the input should be discarded (i.e. better to error out on odd length).

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Sep 30, 2020

We discussed this on triage last month and decided to go ahead with it. The reasoning was the following:

  • Sharing a single global PRNG in the presence of non-deterministic parallelism means that the predictability is pretty much useless since another task may steal values from the sequence at any time. Or put another way, the PRNG sequence is predictable but since it's unpredictably split between different tasks, that's not really useful.
  • The only thing that PR really commits to in terms of API is that each task gets its own PRNG sequence that forks off from the parent: instead of sharing a single globals PRNG stream, each task gets its own PRNG stream.
  • This restores useful predictability in the presence of parallelism: as long as the sequence of sampling PRNG values and spawning child tasks is the same, each task will get the same sequence of PRNG values, which is enough to do predictable simulations.
  • This PR is proof that such a guarantee can be implemented efficiently—more efficiently, in fact, than having a shared PRNG since no locking is required to use task-local PRNGs. The main burden is keeping the PRNG state in the task, but since there are good, small modern PRNGs, that's not such a problem. If we had to do this with Mersenne-Twister, which has a huge state, it might be problematic.
  • We are still free to change PRNGs in the future or change how the branching of PRNGs happens during task forking. In particular we are not locked into xoshiro—it just happens to be a good, fast, modern small-state PRNG.
  • We may, in the future, want to have a general mechanism for saving user-defined state in the task which can be accessed this efficiently, at which point, this could be changed to use that mechanism, but for now, we can just hard-code it into the task implementation. Random number generation is sufficiently important that it's worth doing.

@StefanKarpinski
Copy link
Member

@chethega, are you willing to update this? If not, perhaps we can have someone else fix it up.

@oscardssmith oscardssmith removed the triage This should be discussed on a triage call label Jan 7, 2021
@tkf
Copy link
Member

tkf commented Jan 10, 2021

a general mechanism for saving user-defined state in the task which can be accessed this efficiently

I think there's a way to integrate something like this to context variable #35833 for supporting this. Cilk devs investigated splitter "hyperobject" which is quite similar.

@vtjnash
Copy link
Member

vtjnash commented Apr 21, 2021

Updated in #40546

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality randomness Random number generation and the Random stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.