Consider Fast Loaded Dice Roller for `WeightedIndex` #1014

vks · 2020-08-05T21:31:13Z

It seems to outperform the alias method (and others), being close to the theoretical optimum of entropy use. According to the paper, it is faster than the alias method (if the latter is modified to produce exact samples), while using several orders of magnitudes less preprocessing time. Maybe it can replace our current algorithms in Rand?

There is a reference implementation in C and Python.

vks · 2020-08-06T22:57:26Z

I played a bit with the algorithm, but I could not make it perform as well as our alias-method implementation.

The algorithm consumes the random numbers bit-by-bit. On the one hand, this avoids throwing away random data, on the other hand, the approach does not buy much, because generating random bits is cheap with modern generators. I suspect the results in the paper are with a slow RNG like the Mersenne Twister.

There is also a similar Fast Dice Roller algorithm for uniform sampling, which would also save a lot of random bits for small ranges. However, in practice it seems to be much slower with a fast RNG.

peteroupc · 2020-08-07T01:37:04Z

Vks wrote:

I played a bit with the algorithm, but I could not make it perform as well as our alias-method implementation.

The algorithm consumes the random numbers bit-by-bit. On the one hand, this avoids throwing away random data, on the other hand, the approach does not buy much, because generating random bits is cheap with modern generators. I suspect the results in the paper are with a slow RNG like the Mersenne Twister.

There is also a similar Fast Dice Roller algorithm for uniform sampling, which would also save a lot of random bits for small ranges. However, in practice it seems to be much slower with a fast RNG.

@fsaad, can you please comment?

fsaad · 2020-08-07T16:23:46Z

Thanks for trying out FLDR.

I played a bit with the algorithm, but I could not make it perform as well as our alias-method implementation.

@vks Does your alias implementation use exact sampling or floating-point approximations of real numbers? The runtime comparison, both theoretical and practical, between FLDR and alias sampling is with respect to error-free sampling. If the alias implementation uses floating-point approximations of real numbers then there is not much that can analyzed in terms of performance with FLDR, since an inexact alias implementation typically trade-offs error with runtime, whereas FLDR is designed to always be error free.

If you have an error-free alias sampler, I would be really interested to explore its performance as compared to FLDR, and I would be quite surprised if FLDR is slower (especially on low to medium entropy distributions), both for preprocessing and sampling.

The algorithm consumes the random numbers bit-by-bit. On the one hand, this avoids throwing away random data, on the other hand, the approach does not buy much, because generating random bits is cheap with modern generators.

While the FLDR paper uses a simple (naive) serial implementation of the sampler (Alg 5), there are many other ways to implement that leverage alternate data structures, which include linear encodings of the DDG sampling tree (Algs 4--7 Saad et al. 2020), bit-slicing for constant time generation (Sec 3, Karmakar et al. 2018), and batch generating with bit recycling (Alg 3, Devroye & Gravel 2018), among many others.

The choice of implementation depends on the constraints of the application. For example, lazy DDG sampling is typically far more efficient as compared to floating-point arithmetic when implementing the sampler directly on hardware (e.g., an FPGA). While FLDR is characterized by its theoretical near-optimal entropy use, many of these techniques can be used in practice to push down the constant factors of runtime (or memory).

There is also a similar Fast Dice Roller algorithm for uniform sampling, which would also save a lot of random bits for small ranges. However, in practice it seems to be much slower with a fast RNG.

The FDR of Lumbroso 2013 is an exact uniform sampler. Unlike FLDR, the FDR is theoretically optimal (for uniform sampling), so it should be superior to any other exact uniform sampler, though I agree that non-exact samplers can be faster in practice.

Overall, I think it is very valuable to compare the performance of exact and approximate sampling, as your ticket may point toward. The status quo of most existing software libraries is inexact sampling with floating-point. FLDR is an initial step toward showing that exact sampling can be practical, both theoretically and for applications, for general discrete distributions. There is of course a lot more work to do to explore the trade-offs with inexact samplers.

dhardy · 2020-08-07T16:49:24Z

The code for our weighted alias implementation is here. IIRC it uses exact sampling, with the requirement that the sum of weights can be represented by the weight type. Weight type may be an integer or floating-point or a user-defined type; the latter implies that rounding can occur, depending on the user's choice of types.

In many benchmarks however I found the alias method a worse choice than the binary search method due to the set-up time, hence the latter is our default (and also exact if used with integer types).

vks · 2020-08-08T15:00:23Z

Does your alias implementation use exact sampling or floating-point approximations of real numbers?

I don't think we use floating-point when the weights are integers. I did not look into the details, but I doubt we use the alias method documented in the paper:

(vi) exact alias sampler (Walker, 1977), using entropy-optimal uniform and Bernoulli sampling (Lumbroso, 2013) and the one-table implementation (Vose, 1991).

In particular, we are not using the FDR for uniform sampling, which would be twice as slow.

Here are the benchmark results:

weighted_u32              :      12,472 ns/iter (+/- 308)
weighted_large_set        :      72,129 ns/iter (+/- 2,319)
weighted_new              :      39,283 ns/iter (+/- 876)

weighted_alias_u32        :      11,029 ns/iter (+/- 327)
weighted_alias_large_set  :      12,318 ns/iter (+/- 366)
weighted_alias_new        :     261,343 ns/iter (+/- 7,985)

weighted_fldr_i32         :      17,420 ns/iter (+/- 365)
weighted_fldr_large_set   :      29,980 ns/iter (+/- 671)
weighted_fldr_new         :     935,315 ns/iter (+/- 26,667)

I also implemented the FDR uniform sampling. It's a bit awkward to implement with our API, so I'm not sure it's optimized as well as it could be.

uniform     :       2,032 ns/iter (+/- 119) = 492 MB/s
uniform_fdr :       4,373 ns/iter (+/- 212) = 228 MB/s

The code is here.

peteroupc · 2020-08-08T15:30:59Z

For your information there are numerous data structures and algorithms for weighted sampling with replacement, besides the Fast Loaded Dice Roller, binary search, and the alias method, and I list a great number of them. Perhaps it would be a good idea to compare their performance.

fsaad · 2020-08-09T00:06:45Z

@vks Thanks a lot for the code. I'm not an expert in Rust, but it seems that there are many ways that this FLDR implementation can be optimized. One immediate thought is that, in your current implementation, the PRNG is always called once per sample, i.e.,

rand/rand_distr/src/weighted_fldr.rs

Line 85 in e2b4b12

let r = rng.next_u64();

In our reference implementation, we separately store a "global" buffer of 31 random bits (in C, rand() produces a positive 32 bit signed integer, so only 31 of the bits are random), which is the function "flip()" in the following file:

https://github.com/probcomp/fast-loaded-dice-roller/blob/master/src/c/flip.c

The main FLDR sampler then calls "flip()" to obtain a fresh bit:

https://github.com/probcomp/fast-loaded-dice-roller/blob/2fb32e94a04c744c86aa4cb54b3ee0259894e777/src/c/fldr.c#L93

which may or not may not involve a call to the PRNG, depending on whether the buffer is empty. This approach ensures that we do not spuriously generate random bits. If FLDR is implemented to use a separate buffer that is shared across calls to "sample", instead of calling the PRNG once per sample, then I do expect the runtime should improve in many regimes.

vks · 2020-08-10T13:05:29Z

@fsaad Thanks for having a look!

I'm not an expert in Rust, but it seems that there are many ways that this FLDR implementation can be optimized. One immediate thought is that, in your current implementation, the PRNG is always called once per sample, i.e.,

Fair enough, that is a very good point! Unfortunately, implementing this is a bit tricky, because our API requires sample to not modify the distribution object. There are the following solutions:

Add some API for sampling with modifying the distribution object.
Use Cell. This means that the distribution object can no longer be shared by threads.

Going with 2. seems to have an adverse effects on the performance for large sets, while improving it for small sets:

weighted_fldr_i32         :      12,733 ns/iter (+/- 53)
weighted_fldr_large_set   :      36,303 ns/iter (+/- 536)
weighted_fldr_new         :     911,047 ns/iter (+/- 17,077)

newpavlov · 2020-08-10T17:19:04Z

Personally I don't like option 2. Can we implement Distribution<T> only for &mut reference of a type?

dhardy · 2020-08-10T17:41:18Z

Distribution::sample is immutable with respect to the distribution for good reasons: (1) API simplicity, (2) basically all distributions have fast (enough) immutable implementations and (3) because it places fewer restrictions on usage.

It's going to take quite a good reason to convince me that we should change that. Especially since the benchmarks so far have the alias-method winning everywhere (especially in set-up, where I already thought the alias method was slow).

fsaad · 2020-08-10T18:32:52Z

Distribution::sample is immutable with respect to the distribution for good reasons

I agree. Does your API For Rng support efficient and lazy production of random bits one at a time, for example by storing an internal buffer (using a similar method to the "global" buffer approach)? That way, sample can leverage mutability with respect to rng.

the benchmarks so far have the alias-method winning everywhere (especially in set-up, where I already thought the alias method was slow).

Do the benchmarks use test distributions with varying entropy, i.e., for a distribution with n outcomes, entropies from 0 to log_2(n)?

The average runtime of sampling in FLDR is a direct function of the entropy. The runtime of preprocessing also varies with the values of probabilities in the test distribution. The point about set-up is surprising. Do you have a pointer to the benchmarks and the measurements of setup runtime?

dhardy · 2020-08-10T20:17:42Z

Does your API For Rng support efficient and lazy production of random bits one at a time

Good question. No. Theoretically it could, even using different approaches depending on the RNG, but we chose not to do that. For simpler PRNGs we use no buffer at all and for block-based PRNGs we use a block which we read off one word (32-bit) at a time.

Another factor to consider (for both mutable distribution and buffered RNGs) is reproducible behaviour. If we allowed out-of-order usage of random bits/bytes it becomes much harder to reason about and especially document the behaviour, thus making reproducibility hard. If we don't allow this, then every read has to either be compatible with misaligned buffers (slow) or check and clear up to the next alignment point (also slow). At this point we're no longer optimising one RNG+distribution pair but expected usage of the RNG across a number of distributions.

I suppose we could side-step that problem by having a special RNG optimised for bit- or byte-sized reads, though that's another step in complexity.

@vks which RNG did you use for these benchmarks?

vks · 2020-08-10T20:34:14Z

Another factor to consider (for both mutable distribution and buffered RNGs) is reproducible behaviour. If we allowed out-of-order usage of random bits/bytes it becomes much harder to reason about and especially document the behaviour, thus making reproducibility hard.

Aren't we having this problem already when mixing fill_bytes and next_u64?

I suppose we could side-step that problem by having a special RNG optimised for bit- or byte-sized reads, though that's another step in complexity.

In principle, we could have a wrapper type implementing the buffering. It can then be used as needed.

For example, we could just add a bit: u8 field to BlockRng and a gen_bool method that uses it.

@vks which RNG did you use for these benchmarks?

It's just our default benchmarks, they use Pcg64Mcg.

Do you have a pointer to the benchmarks and the measurements of setup runtime?

See here.

newpavlov · 2020-08-11T05:23:04Z

For example, we could just add a bit: u8 field to BlockRng and a gen_bool method that uses it.

I think we can use a single counter for used bits, which would be incremented by 8 for read byte, 32 for u32, etc. We would need to perform additional computations to correctly align read point (e.g. if index is equal to 9, we would have to align it to 16 for reading byte and to 64 for reading u32). It would degrade performance a bit, but I am not sure if it will be noticeable.

As an advantage this approach will be less wasteful (and thus more performant) for users using primarily fill_bytes. But it would change behavior of BlockRng between releases, which may create problems for users who rely on reproducibility.

riking · 2021-07-15T19:09:58Z

Re compromising performance on pure u64: I think you can reduce that cost to a single branch.

The algorithm consumes the random numbers bit-by-bit.
[...] reproducible behavior

First, we discard this as a design constraint when the application is making RNG dependent range requests. I do not believe any application should expect seeded RNG stability under a reordering of range request sizes.

Deterministic behavior for a seeded RNG under a deterministic request stream should be preserved.

(Note: it is theoretically possible for this to weaken the stream somehow, when considered as a complete system with the application. Probably a fun research paper or game speedrun investigation for someone.)

The "one branch" algorithm is then simple: for any "large enough" requests, bypass or discard the buffer.

Any requests for over half the source RNG bit length (>32.0 bits) should bypass the buffer, as they are unlikely to be followed by small enough requests. Note that 63 and 53 bits are common request lengths; there is little to no profit to be had in attempting to merge leftover bits there.

Any requests for more bits than are contained in the buffer (e.g. buffer has 2 bits remaining, request wants 6) should discard the remaining buffer instead of attempting to merge the leftovers.

elidupree · 2021-07-15T21:35:16Z

Hang on, why would we need to compromise performance at all for existing use cases? Requests for a full u64 would usually know at compile time that they want a full u64, so they could just make a method call that gets a random u64 from the backing RNG rather than asking for bits from the "spare bits buffer" at all.

I'm currently fiddling with some code that keeps just such a "spare bits buffer", and is built entirely as a wrapper around a source Rng, without having to change the Rng interface at all. For generating small uniform ranges, my wrapper around ChaChaRng is able to outperform using ChaChaRng directly. (On the other hand, it's hard to beat Pcg64Mcg. On the third hand, I haven't put a lot of work into optimizing it.)

dhardy · 2021-07-16T09:18:51Z

@elidupree see notes about reproducibility in this comment.

@riking I'll reply in #1147.

elidupree · 2021-07-16T12:09:11Z

I saw that comment, but I don't understand it. Let me ask some questions to clarify my understanding: If an Rng implementor is deterministic and platform-independent, and my code that keeps a bit buffer is a deterministic, platform-independent wrapper around the Rng, then how could the combination possibly be less reproducible than the Rng we started with? From the wording "out-of-order usage of random bits/bytes", my only guess about what you mean is that you're talking about getting different results if you switch the order between a request for bits and a request for a full u64, but that can't be right, because an Rng always gives different results if you switch the order between any two requests for numbers...

dhardy · 2021-07-16T14:06:23Z

Besides non-determinicity there are a few possible sources of non-portable results:

LE vs BE architectures when casting between [u8] and u32 etc.
library versions: an update may change results (we disallow this in patch releases)
std trig & exp functions use libm or equivalent, which is not identical across platforms (though this only affects some distributions, not RNGs)

But this is also about documenting behaviour...

because an Rng always gives different results if you switch the order between any two requests for numbers

Is this actually the case (for two different requests) given an RNG using a buffer for partially-consumed generated values? There is nothing in the RngCore doc requiring this, or whether inserting an extra request will influence future output.

elidupree · 2021-07-16T14:30:27Z

Besides non-determinicity there are a few possible sources of non-portable results:

Those are possible sources of non-portable results in general, but I still don't see how any of them would cause difficulties for implementing a gen_bits API. Endianness could be an issue if you implement it using casts, but my quick-and-dirty implementation isn't endianness-dependent (it just uses bit ops to extract the desired number of bits); I don't see how this is different from any other situation where we accept a small performance cost to guarantee platform-independence.

Is this actually the case (for two different requests) given an RNG using a buffer for partially-consumed generated values? There is nothing in the RngCore doc requiring this, or whether inserting an extra request will influence future output.

Sorry, my wording was imprecise. Of course it's not guaranteed that the values will be different (for one thing, they could be the same due to luck!) What I'm saying is that the Rng traits already do not provide any guarantee that results will be identical in any situation other than an identical sequence of requests (with identical order). I suppose implementors are permitted to provide additional guarantees (I've never heard of a trait that forbids implementors from providing additional guarantees), but is there any specific implementor that does?

dhardy · 2021-07-16T14:46:24Z

Sorry, what I wrote was besides the point. We have discussed in the past whether RNGs may discard unused bits/bytes (to which the conclusion was yes, though some attempts have been made not to do this, most recently #1031). I believe we also discussed whether an RNG may return bits out-of-order from when they are generated (using a buffer for some outputs but not others), but I don't recall and can't find the exact conclusion here. I thought we had decided against this (i.e. forcing clearing of buffered bits whenever generating fresh output), but do not see that documentation. (It's also not exactly "portability" and more "how to reproduce output given an RNG algorithm and sequence of requests".)

newpavlov · 2021-07-16T15:01:57Z

We have discussed in the past whether RNGs may discard unused bits/bytes (to which the conclusion was yes, though some attempts have been made not to do this, most recently #1031

Note that code in that PR also discards bits for performance reasons (we do not want to copy u64s with bit-level granularity). It will not discard bits only if you continuously use the gen_bool method. But if user calls next_u64 with bit-level counter equal to 65, it will get rounded to 128 (to be exact, to 2 with u64-level counter).

dhardy · 2024-07-20T12:00:45Z

We still do not support single-bit random value generation (see #1261 for related proposals), and I don't think it's very likely that we will. We also do not really have an interest in supporting mutable "distribution" objects. Therefore I think it's unlikely we will pursue the suggested FLDR algorithm.

This was referenced Aug 11, 2020

Bump MSRV to 1.36 #1011

Merged

Unify buffering logic between chacha20, salsa20, and ctr RustCrypto/stream-ciphers#95

Closed

vks mentioned this issue Aug 26, 2020

Tracker: 0.8 release #889

Closed

19 tasks

newpavlov mentioned this issue Aug 29, 2020

Add next_bool method to RngCore and counter levels to BlockRng #1031

Closed

vks mentioned this issue Oct 8, 2020

Partially-sampled random numbers #1055

Closed

vks mentioned this issue Oct 27, 2020

Add geometric and hypergeometric distributions #1062

Merged

vks mentioned this issue Jul 15, 2021

Rng::gen_range() is slowest with power-of-2 input sizes? #1145

Closed

dhardy mentioned this issue Jul 16, 2021

Meta: add bit / bit-range sampling to RngCore #1147

Closed

dhardy added A-new Propose usage of a new algorithm X-stale Outdated or abandoned work labels Jul 20, 2024

dhardy closed this as completed Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider Fast Loaded Dice Roller for `WeightedIndex` #1014

Consider Fast Loaded Dice Roller for `WeightedIndex` #1014

vks commented Aug 5, 2020 •

edited

Loading

vks commented Aug 6, 2020

peteroupc commented Aug 7, 2020

fsaad commented Aug 7, 2020

dhardy commented Aug 7, 2020

vks commented Aug 8, 2020

peteroupc commented Aug 8, 2020 •

edited

Loading

fsaad commented Aug 9, 2020

vks commented Aug 10, 2020 •

edited

Loading

newpavlov commented Aug 10, 2020

dhardy commented Aug 10, 2020

fsaad commented Aug 10, 2020 •

edited

Loading

dhardy commented Aug 10, 2020

vks commented Aug 10, 2020 •

edited

Loading

newpavlov commented Aug 11, 2020

riking commented Jul 15, 2021

elidupree commented Jul 15, 2021

dhardy commented Jul 16, 2021

elidupree commented Jul 16, 2021

dhardy commented Jul 16, 2021

elidupree commented Jul 16, 2021

dhardy commented Jul 16, 2021

newpavlov commented Jul 16, 2021

dhardy commented Jul 20, 2024

Consider Fast Loaded Dice Roller for WeightedIndex #1014

Consider Fast Loaded Dice Roller for WeightedIndex #1014

Comments

vks commented Aug 5, 2020 • edited Loading

vks commented Aug 6, 2020

peteroupc commented Aug 7, 2020

fsaad commented Aug 7, 2020

dhardy commented Aug 7, 2020

vks commented Aug 8, 2020

peteroupc commented Aug 8, 2020 • edited Loading

fsaad commented Aug 9, 2020

vks commented Aug 10, 2020 • edited Loading

newpavlov commented Aug 10, 2020

dhardy commented Aug 10, 2020

fsaad commented Aug 10, 2020 • edited Loading

dhardy commented Aug 10, 2020

vks commented Aug 10, 2020 • edited Loading

newpavlov commented Aug 11, 2020

riking commented Jul 15, 2021

elidupree commented Jul 15, 2021

dhardy commented Jul 16, 2021

elidupree commented Jul 16, 2021

dhardy commented Jul 16, 2021

elidupree commented Jul 16, 2021

dhardy commented Jul 16, 2021

newpavlov commented Jul 16, 2021

dhardy commented Jul 20, 2024

Consider Fast Loaded Dice Roller for `WeightedIndex` #1014

Consider Fast Loaded Dice Roller for `WeightedIndex` #1014

vks commented Aug 5, 2020 •

edited

Loading

peteroupc commented Aug 8, 2020 •

edited

Loading

vks commented Aug 10, 2020 •

edited

Loading

fsaad commented Aug 10, 2020 •

edited

Loading

vks commented Aug 10, 2020 •

edited

Loading