From 77c06727559c3dbb87e5fde46cec87ce59ba61c8 Mon Sep 17 00:00:00 2001 From: Stefan Karpinski Date: Wed, 28 Feb 2024 18:06:12 -0500 Subject: [PATCH] task splitting: change additive accumulation to multiplicative (#53408) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Issue raised [here](https://discourse.julialang.org/t/linear-relationship-between-xoshiro-tasks/110454). Given this definition, look at output values: ```jl using .Threads macro fetch(ex) :(fetch(@spawn($(esc(ex))))) end function taskCorrelatedXoshiros(+ = +) r11, r10 = @fetch (@fetch(rand(UInt64)), rand(UInt64)) r01, r00 = (@fetch(rand(UInt64)), rand(UInt64)) r01 + r10 - (r00 + r11) end ``` Before: ```jl julia> sort!(unique(taskCorrelatedXoshiros() for _ = 1:1000)) 9-element Vector{UInt64}: 0x0000000000000000 0x0000000000000001 0x00000000007fffff 0x0000000000800000 0x0000000000800001 0xffffffffff7fffff 0xffffffffff800000 0xffffffffff800001 0xffffffffffffffff ``` After: ```jl julia> sort!(unique(taskCorrelatedXoshiros() for _ = 1:1000)) 1000-element Vector{UInt64}: 0x000420647a085960 0x0038c5b889b585c6 0x007b29fae8107ff7 0x00e73b9e883ac1c8 ⋮ 0xfe68be9c0dde1e88 0xfeca042354218c35 0xfeeb8203e470c96b 0xfff5dbb8771307b9 julia> sort!(unique(taskCorrelatedXoshiros(*) for _ = 1:1000)) 1000-element Vector{UInt64}: 0x00126f951c1e56dc 0x0025a82477ffac08 0x002dd82c9986457a 0x00a713c4d56a3dbc ⋮ 0xfe2e40a5b345095e 0xfe77b90881967436 0xfea2559be63f1701 0xff88b5b28cefac5f ``` --- src/task.c | 341 ++++++++++++++++++++++------------------------------- 1 file changed, 143 insertions(+), 198 deletions(-) diff --git a/src/task.c b/src/task.c index a1e8dabb89f9d..7c8770530c326 100644 --- a/src/task.c +++ b/src/task.c @@ -856,202 +856,143 @@ guaranteed to avoid collisions between the RNG streams of all tasks. The main RNG is the xoshiro256++ RNG whose state is stored in rngState[0..3]. There is also a small internal RNG used for task forking stored in rngState[4]. This state is used to iterate a linear congruential generator (LCG), which is then -put through four different variations of the strongest PCG output function, -referred to as PCG-RXS-M-XS-64 [1]. This output function is invertible: it maps -a 64-bit state to 64-bit output. This is one of the reasons it's not recommended -for general purpose RNGs unless space is at an absolute premium, but in our -usage invertibility is actually a benefit (as is explained below) and adding as -little additional memory overhead to each task object as possible is preferred. +combined with xoshiro256's state and put through four different variations of +the strongest PCG output function, referred to as PCG-RXS-M-XS-64 [1]. The goal of jl_rng_split is to perturb the state of each child task's RNG in -such a way each that for an entire tree of tasks spawned starting with a given -state in a root task, no two tasks have the same RNG state. Moreover, we want to -do this in a way that is deterministic and repeatable based on (1) the root -task's seed, (2) how many random numbers are generated, and (3) the task tree -structure. The RNG state of a parent task is allowed to affect the initial RNG -state of a child task, but the mere fact that a child was spawned should not -alter the RNG output of the parent. This second requirement rules out using the -main RNG to seed children: if we use the main RNG, we either advance it, which -affects the parent's RNG stream or, if we don't advance it, then every child -would have an identical RNG stream. Therefore some separate state must be -maintained and changed upon forking a child task while leaving the main RNG -state unchanged. - -The basic approach is that used by the DotMix [2] and SplitMix [3] RNG systems: -each task is uniquely identified by a sequence of "pedigree" numbers, indicating -where in the task tree it was spawned. This vector of pedigree coordinates is -then reduced to a single value by computing a dot product with a shared vector -of random weights. The weights are common but each pedigree of each task is -distinct, so the dot product of each task is unlikely to be the same. The DotMix -paper provides a proof that this dot product hash value (referred to as a -"compression function") is collision resistant in the sense that the pairwise -collision probability of two distinct tasks is 1/N where N is the number of -possible weight values. Both DotMix and SplitMix use a prime value of N because -the proof requires that the difference between two distinct pedigree coordinates -have a multiplicative inverse, which is guaranteed by N being prime since all -values are invertible then. We take a somewhat different approach: instead of -assigning n-ary pedigree coordinates, we assign binary tree coordinates to -tasks, which means that our pedigree vectors have only 0/1 and differences -between them can only be -1, 0 or 1. Since the only possible non-zero coordinate -differences are ±1 which are invertible regardless of the modulus, we can use a -modulus of 2^64, which is far easier and more efficient then using a prime -modulus. It also means that when accumulating the dot product incrementally, as -described in SplitMix, we don't need to multiply weights by anything, we simply -add the random weight for the current task tree depth to the parent's dot -product to derive the child's dot product. - -we instead limit pedigree coordinates to being binary, guaranteeing -invertibility regardless of modulus. When a task spawns a child, the parent and -child share the parent's previous pedigree prefix and the parent appends a zero -to its coordinates, which doesn't affect the task's dot product value, while the -child appends a one, which does produce a new dot product. In this manner a -binary pedigree vector uniquely identifies each task and since the coordinates -are binary, the difference between coordinates is always invertible: 1 and -1 -are their own multiplicative inverses regardless of the modulus. - -How does our assignment of pedigree coordinates to tasks differ from DotMix and -SplitMix? In DotMix and SplitMix, each task has a fixed pedigree vector that -never changes. The root tasks's pedigree is `()`, its first child's pedigree is -`(0,)`, its second child's pedigree is `(2,)` and so on. The length of a task's -pedigree tuple corresponds to how many ancestors tasks it has. Our approach -instead appends 0 to the parent's pedigree when it forks a child and appends 1 -to the child's pedigree at the same time. The root task starts with a pedigree -of `()` as before, but when it spawns a child, we update its pedigree to `(0,)` -and give its child a pedigree of `(1,)`. When the root task then spawns a second -child, we update its pedigree to `(0,0)` and give it's second child a pedigree -of `(0,1)`. If the first child spawns a grandchild, the child's pedigree is -changed from `(1,)` to `(1,0)` and the grandchild is assigned a pedigree of -`(1,1)`. In other words, DotMix and SplitMix build an n-ary tree where every -node is a task: parent nodes are higher up the tree and child tasks are children -in the pedigree tree. Our approach is to build a binary tree where only leaves -are tasks and each task spawn replaces a leaf in the tree with two leaves: the -parent moves to the left/zero leaf while the child is the right/one leaf. Since -the tree is binary, the pedigree coordinates are binary. - -It may seem odd for a task's pedigree coordinates to change, but note that we -only ever append zeros to a task's pedigree, which does not change its dot -product. So while the pedigree changes, the dot product is fixed. What purpose -does appending zeros like this serve if the task's dot product doesn't change? -Changing the pedigree length (which is also the binary tree depth) ensures that -the next child spawned by that task will have new and different dot product from -the previous child since it will have a different pseudo-random weight added to -the parent's dot product value. Whereas the pedigree length in DotMix and -SplitMix is unchanging and corresponds to how many ancestors a task has, in our -scheme the pedigree length corresponds to the number of ancestors *plus* -children a task has, which increases every time it spawns another child. - -We use the LCG in rngState[4] to generate pseudorandom weights for the dot -product. Each time a child is forked, we update the LCG in both parent and child -tasks. In the parent, that's all we have to do -- the main RNG state remains -unchanged. (Recall that spawning a child should *not* affect subsequent RNG -draws in the parent). The next time the parent forks a child, the dot product -weight used will be different, corresponding to being a level deeper in the -pedigree tree. In the child, we use the LCG state to generate four pseudorandom -64-bit weights (more below) and add each weight to one of the xoshiro256 state -registers, rngState[0..3]. If we assume the main RNG remains unused in all -tasks, then each register rngState[0..3] accumulates a different dot product -hash as additional child tasks are spawned. Each one is collision resistant with -a pairwise collision chance of only 1/2^64. Assuming that the four pseudorandom -64-bit weight streams are sufficiently independent, the pairwise collision -probability for distinct tasks is 1/2^256. If we somehow managed to spawn a -trillion tasks, the probability of a collision would be on the order of 1/10^54. -In other words, practically impossible. Put another way, this is the same as the -probability of two SHA256 hash values accidentally colliding, which we generally -consider so unlikely as not to be worth worrying about. - -What about the random "junk" that's in the xoshiro256 state registers from -normal use of the RNG? For a tree of tasks spawned with no intervening samples -taken from the main RNG, all tasks start with the same junk which doesn't affect -the chance of collision. The Dot/SplitMix papers even suggest adding a random -base value to the dot product, so we can consider whatever happens to be in the -xoshiro256 registers to be that. What if the main RNG gets used between task -forks? In that case, the initial state registers will be different. The DotMix -collision resistance proof doesn't apply without modification, but we can -generalize the setup by adding a different base constant to each compression -function and observe that we still have a 1/N chance of the weight value -matching that exact difference. This proves collision resistance even between -tasks whose dot product hashes are computed with arbitrary offsets. We can -conclude that this scheme provides collision resistance even in the face of -different starting states of the main RNG. Does this seem too good to be true? -Perhaps another way of thinking about it will help. Suppose we seeded each task -completely randomly. Then there would also be a 1/2^256 chance of collision, -just as the DotMix proof gives. Essentially what the proof is telling us is that -if the weights are chosen uniformly and uncorrelated with the rest of the -compression function, then the dot product construction is a good enough way to -pseudorandomly seed each task based on its parent's RNG state and where in the -task tree it lives. From that perspective, all we need to believe is that the -dot product construction is random enough (assuming the weights are), and it -becomes easier to believe that adding an arbitrary constant to each dot product -value doesn't make its randomness any worse. - -This leaves us with the question of how to generate four pseudorandom weights to -add to the rngState[0..3] registers at each depth of the task tree. The scheme -used here is that a single 64-bit LCG state is iterated in both parent and child -at each task fork, and four different variations of the PCG-RXS-M-XS-64 output -function are applied to that state to generate four different pseudorandom -weights. Another obvious way to generate four weights would be to iterate the -LCG four times per task split. There are two main reasons we've chosen to use -four output variants instead: - -1. Advancing four times per fork reduces the set of possible weights that each - register can be perturbed by from 2^64 to 2^60. Since collision resistance is - proportional to the number of possible weight values, that would reduce - collision resistance. While it would still be strong engough, why reduce it? - -2. It's easier to compute four PCG output variants in parallel. Iterating the - LCG is inherently sequential. PCG variants can be computed independently. All - four can even be computed at once with SIMD vector instructions. The C - compiler doesn't currently choose to do that transformation, but it could. - -A key question is whether the approach of using four variations of PCG-RXS-M-XS -is sufficiently random both within and between streams to provide the collision -resistance we expect. We obviously can't test that with 256 bits, but we have -tested it with a reduced state analogue using four PCG-RXS-M-XS-8 output -variations applied to a common 8-bit LCG. Test results do indicate sufficient -independence: a single register has collisions at 2^5 while four registers only -start having collisions at 2^20. This is actually better scaling of collision -resistance than we theoretically expect. In theory, with one byte of resistance -we have a 50% chance of some collision at 20 tasks, which matches what we see, -but four bytes should give a 50% chance of collision at 2^17 tasks and our -reduced size analogue construction remains collision free at 2^19 tasks. This -may be due to the next observation, which is that the way we generate -pseudorandom weights actually guarantees collision avoidance in many common -situations rather than merely providing collision resistance and thus is better -than true randomness. - -In the specific case where a parent task spawns a sequence of child tasks with -no intervening usage of its main RNG, the parent and child tasks are actually -_guaranteed_ to have different RNG states. This is true because the four PCG -streams each produce every possible 2^64 bit output exactly once in the full -2^64 period of the LCG generator. This is considered a weakness of PCG-RXS-M-XS -when used as a general purpose RNG, but is quite beneficial in this application. -Since each of up to 2^64 children will be perturbed by different weights, they -cannot have hash collisions. What about parent colliding with child? That can -only happen if all four main RNG registers are perturbed by exactly zero. This -seems unlikely, but could it occur? Consider the core of the output function: - - p ^= p >> ((p >> 59) + 5); - p *= m[i]; - p ^= p >> 43 - -It's easy to check that this maps zero to zero. An unchanged parent RNG can only -happen if all four `p` values are zero at the end of this, which implies that -they were all zero at the beginning. However, that is impossible since the four -`p` values differ from `x` by different additive constants, so they cannot all -be zero. Stated more generally, this non-collision property: assuming the main -RNG isn't used between task forks, sibling and parent tasks cannot have RNG -collisions. If the task tree structure is more deeply nested or if there are -intervening uses of the main RNG, we're back to relying on "merely" 256 bits of -collision resistance, but it's nice to know that in what is likely the most -common case, RNG collisions are actually impossible. This fact may also explain -better-than-theoretical collision resistance observed in our experiment with a -reduced size analogue of our hashing system. +such a way that for an entire tree of tasks spawned starting with a given root +task state, no two tasks have the same RNG state. Moreover, we want to do this +in a way that is deterministic and repeatable based on (1) the root task's seed, +(2) how many random numbers are generated, and (3) the task tree structure. The +RNG state of a parent task is allowed to affect the initial RNG state of a child +task, but the mere fact that a child was spawned should not alter the RNG output +of the parent. This second requirement rules out using the main RNG to seed +children: if we use the main RNG, we either advance it, which affects the +parent's RNG stream or, if we don't advance it, then every child would have an +identical RNG stream. Therefore some separate state must be maintained and +changed upon forking a child task while leaving the main RNG state unchanged. + +The basic approach is a generalization and simplification of that used in the +DotMix [2] and SplitMix [3] RNG systems: each task is uniquely identified by a +sequence of "pedigree" numbers, indicating where in the task tree it was +spawned. This vector of pedigree coordinates is then reduced to a single value +by computing a "dot product" with a shared vector of random weights. I write +"dot product" in quotes because what we use is not an actual dot product. The +linear dot product construction used in both DotMix and SplitMix was found by +@foobar_iv2 [4] to allow easy construction of linear relationships between the +main RNG states of tasks, which was in turn reflected in observable linear +relationships between the outputs of their RNGs. This relationship was between a +minimum of four tasks, so doesn't constitute a collision, per se, but is clearly +undesirable and highlights a hazard of the plain dot product construction. + +As in DotMix and SplitMix, each task is assigned unique task "pedigree" +coordinates. Our pedigree construction is a bit different and uses only binary +coordinates rather than arbitrary integers. Each pedigree is an infinite +sequence of ones and zeros with only finitely many ones. Each task has a "fork +index": the root task has index 0; the fork index of a task the jth child task +of a parent task with fork index i is i+j. The root task's coordinates are all +zeros; each child task's coordinates are the same as its parents except at its +fork index, where the parent has a zero while the child has a one. A task's +coordinates after its fork index are all zeros. The coordinates of a tasks +ancestors are all prefixes of its own coordinates, padded with zeros. + +Also as in DotMix and SplitMix, we generate a sequence of pseudorandom weights +to combine with the coordinates of each task. This sequence is common across all +tasks, and different mix values for each task derive from their coordinates +being different. In DotMix and SplitMix, this is a literal dot product: the +pseudorandom weights are multiplied by corresponding task coordinate and added +up. While this does provably make collisions as unlikely as randomly assigned +task seeds, this linear construction can be used to create linearly correlated +states between tasks. However, it turns out that the compression construction +need not be linear, commutative, associative, etc. which allows us to avoid any +linear or other obvious correlations between related sets of tasks. + +To generalize SplitMix's optimized dot product construction, we similarly +compute each task's compression function value incrementally by combining the +parent's compression value with pseudorandom weight corresponding with the +child's fork index. Formally, if the parent's compression value is c then we can +compute the child's compression value as c′ = f(c, wᵢ) where w is the vector of +pseudorandom weights. What is f? It can be any function that is bijective in +each argument for all values of the other argument: + + * For all c: w ↦ f(c, w) is bijective + * For all w: c ↦ f(c, w) is bijective + +The proof that these requirements are sufficient to ensure collision resistance +is in the linked discussion [4]. DotMix/SplitMix are a special case where f is +just addition. Instead we use a much less simple mixing function: + + 1. We use (2c+1)(2w+1)÷2 % 2^64 to mix the bits of c and w + 2. We then apply the PCG-RXS-M-XS-64 output function + +The first step thoroughly mixes the bits of the previous compression value and +the pseudorandom weight value using multiplication, which is non-commutative +with xoshiro's operations (xor, shift, rotate). This mixing function is a +bijection on each argument witnessed by these inverses: + + * c′ ↦ (2c′+1)(2w+1)⁻¹÷2 % 2^64 + * w′ ↦ (2c+1)⁻¹(2w′+1)÷2 % 2^64 + +The second PCG output step is a bijection and designed to be significantly +non-linear -- non-linear enough to mask the linearity of the LCG that drives the +PCG-RXS-M-XS-64 RNG and allows it to pass statistical RNG test suites despite +having the same size state and output. In particular, since this mixing function +is highly non-associative and non-linear, we (hopefully) don't have any +discernible relationship between these values: + + * c₀₀ = c + * c₁₀ = f(c, wᵢ) + * c₀₁ = f(c, wⱼ) + * c₁₁ = f(f(c, wᵢ), wⱼ) + +When f is simply `+` then these have a very obvious relationship: + + c₀₀ + c₁₁ == c₁₀ + c₀₁ + +This relationship holds regardless of what wᵢ and wⱼ are and is precisely what +allows easy creation of correlated tasks with the DotMix/SplitMix construction +that we previously used. Expressing any relationship between these values with +our mixing function would require inverting the PCG output function (doable but +non-trivial), knowing the weights wᵢ and wⱼ, and then applying the inversion +functions for those weights appropriately. Since the weights are pseudo-randomly +generated and not directly observable, this is infeasible. + +We maintain an LCG in rngState[4] to generate pseudorandom weights. An LCG by +itself is a very bad RNG, but we combine this one with xoshiro256 state +registers in a non-trivial way and then apply the PCG-RXS-M-XS-64 output +function to that. Even if the xoshiro256 states are all zeros, which they should +never be, the output would be the same as PCG-RXS-M-XS-64, which is a solid +statistical RNG. + +Each time a child is forked, we update the LCG in both parent and child tasks, +corresponding to increasing the fork index. In the parent, that's all we have to +do -- the main RNG state remains unchanged. Recall that spawning a child should +*not* affect subsequent RNG draws in the parent. The next time the parent forks +a child, the mixing weight used will be different. In the child, we use the LCG +state to perturb the child's main RNG state registers, rngState[0..3]. + +Since we want these registers to behave independently, we use four different +variations on f to mix the LCG state with each of the four main RNG registers. +Each variation first xors the LCG state with a different random constant before +combining that value above with the old register state via multiplication; the +PCG-RXS-M-XS-64 output function is then applied to that mixed state, with a +different multiplier constant for each variation / register index. Xor is used +in the first step since we multiply the result with the state immediately after +and multiplication distributes over `+` and commutes with `*`, which makes both +options suspect; multiplication doesn't distribute over or commute with xor. We +also use a different odd multiplier in PCG-RXS-M-XS-64 for each RNG register. +These three sources of variation (different xor constants, different xoshiro256 +state, different PCG multipliers) are sufficient for each of the four outputs to +behave statistically independently. [1]: https://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf [2]: http://supertech.csail.mit.edu/papers/dprng.pdf [3]: https://gee.cs.oswego.edu/dl/papers/oopsla14.pdf + +[4]: +https://discourse.julialang.org/t/linear-relationship-between-xoshiro-tasks/110454 */ void jl_rng_split(uint64_t dst[JL_RNG_SIZE], uint64_t src[JL_RNG_SIZE]) JL_NOTSAFEPOINT { @@ -1060,26 +1001,30 @@ void jl_rng_split(uint64_t dst[JL_RNG_SIZE], uint64_t src[JL_RNG_SIZE]) JL_NOTSA src[4] = dst[4] = x * 0xd1342543de82ef95 + 1; // high spectrum multiplier from https://arxiv.org/abs/2001.05304 + // random xor constants static const uint64_t a[4] = { - 0xe5f8fa077b92a8a8, // random additive offsets... - 0x7a0cd918958c124d, - 0x86222f7d388588d4, - 0xd30cbd35f2b64f52 + 0x214c146c88e47cb7, + 0xa66d8cc21285aafa, + 0x68c7ef2d7b1a54d4, + 0xb053a7d7aa238c61 }; + // random odd multipliers static const uint64_t m[4] = { 0xaef17502108ef2d9, // standard PCG multiplier - 0xf34026eeb86766af, // random odd multipliers... + 0xf34026eeb86766af, 0x38fd70ad58dd9fbb, 0x6677f9b93ab0c04d }; // PCG-RXS-M-XS-64 output with four variants for (int i = 0; i < 4; i++) { - uint64_t p = x + a[i]; - p ^= p >> ((p >> 59) + 5); - p *= m[i]; - p ^= p >> 43; - dst[i] = src[i] + p; // SplitMix dot product + uint64_t c = src[i]; + uint64_t w = x ^ a[i]; + c += w*(2*c + 1); // c = (2c+1)(2w+1)÷2 % 2^64 (double bijection) + c ^= c >> ((c >> 59) + 5); + c *= m[i]; + c ^= c >> 43; + dst[i] = c; } }