-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SliceRandom::choose_multiple_weighted, implementing weighted sampling without replacement #976
Conversation
@@ -23,7 +23,7 @@ appveyor = { repository = "rust-random/rand" } | |||
[features] | |||
# Meta-features: | |||
default = ["std", "std_rng"] | |||
nightly = ["simd_support"] # enables all features requiring nightly rust | |||
nightly = ["simd_support", "partition_at_index"] # enables all features requiring nightly rust |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simpler approach might be to just use the nightly
feature instead of introducing a new one. This would have the advantage that we don't have to worry about dropping the feature in the future without breaking code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning here is that this allows leaving the feature flag in place even once "slice_partition_at_index" stabilizes, which would allow supporting older rustc versions. I also see your point about dropping the feature being a breaking change, though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I think we will likely just raise the minimum Rust version. @dhardy What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it is not possible to reliably support old nightlies, and not very useful either. So removing a feature flag only usable on nightly compilers once that feature has stabilised is not an issue.
The implementation looks good to me. I think the general docs in |
@@ -50,6 +50,7 @@ | |||
#![doc(test(attr(allow(unused_variables), deny(warnings))))] | |||
#![cfg_attr(not(feature = "std"), no_std)] | |||
#![cfg_attr(all(feature = "simd_support", feature = "nightly"), feature(stdsimd))] | |||
#![cfg_attr(all(feature = "partition_at_index", feature = "nightly"), feature(slice_partition_at_index))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This however makes separate use of partition_at_index
pointless — I think unlike simd_support
we should not depend on nightly
here. As a bonus, this makes it impossible to use partition_at_index
on stable compilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand what the advantage of a separate partition_at_index
is.
#[cfg(feature = "alloc")] use core::slice; | ||
#[cfg(feature = "alloc")] | ||
use core::slice; | ||
|
||
#[cfg(all(feature = "alloc", not(feature = "std")))] | ||
use crate::alloc::vec::{self, Vec}; | ||
#[cfg(feature = "std")] use std::vec; | ||
#[cfg(feature = "std")] | ||
use std::vec; | ||
// BTreeMap is not as fast in tests, but better than nothing. | ||
#[cfg(all(feature = "alloc", not(feature = "std")))] | ||
use crate::alloc::collections::BTreeSet; | ||
#[cfg(feature = "std")] use std::collections::HashSet; | ||
#[cfg(feature = "std")] | ||
use std::collections::HashSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave the reformatting out please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! The paper (the full one, not what you linked) seems a good source (though I don't follow the integral{CDF × PDF} step).
if length == 0 { | ||
return Ok(IndexVec::USize(Vec::new())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better, if amount == 0
. And this should be pushed up to the top of the method.
fn partial_cmp(&self, other: &Self) -> Option<core::cmp::Ordering> { | ||
self.key | ||
.partial_cmp(&other.key) | ||
.or(Some(core::cmp::Ordering::Less)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why include this line? To avoid panic in Ord::cmp
? Does it have some benefit? Because it's technically wrong, and means NaN
values could have arbitrary sort position (some sorting algs might not even terminate). Better just to allow a panic IMO.
let mut candidates = Vec::with_capacity(length); | ||
for index in 0..length { | ||
let weight = weight(index).into(); | ||
if weight < 0.0 || weight.is_nan() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer !(weight >= 0.0)
, though it is equivalent.
pub fn sample_weighted<R, F, X>( | ||
rng: &mut R, length: usize, weight: F, amount: usize, | ||
) -> Result<IndexVec, WeightedError> | ||
where | ||
R: Rng + ?Sized, | ||
F: Fn(usize) -> X, | ||
X: Into<f64>, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation looks correct, but it may use a lot of memory without good cause. Suggestions:
If we know length <= u32::MAX
, then use u32
instead of usize
. I guess in this case we can't since a million entries can be evaluated in under a tenth of a second. It may or may not be worth testing and implementing multiple paths. (IndexVec
is like it is for performance reasons.)
An implementation of the A-Res variant should compare well on performance I think (though still O(n)
). An implementation of A-ExpJ
variant should be the best option for length >> amount
. It may be nice having both and selecting which to use with a heuristic, or it may be that A-ExpJ is fast enough to always be a good choice.
This introduces two new methods, intended to close #596 :
SliceRandom::choose_multiple_weighted
- similar toSliceRandom::choose_multiple
, but without replacement and with weights.index::sample_weighted
- an analogue toindex::sample
which implements the actual logic.This notably includes two different implementations of
sample_weighted
- one which uses an unstable, nightly-only feature of the standard library, "slice_partition_at_index", and one which uses only stable functionality. By my measurements, the performance improvement of using the unstable feature with very large slices is extremely significant (see below). I'm not sure what the right thing to do here is - the options seem to be:rustc
version requirements.rand
, and it involves lots of use ofunsafe
.Both implementations are based on Algorithm A described by Efraimidis and Spirakis in their paper on Weighted Random Sampling, available here. They differ in the way they implement step 2, "select the items with the largest keys".
Finally, one last thing I'm uncertain about is the behavior the algorithm should take when given invalid weights - negative values and
NaN
. The approach I've taken so far is to add a check that the result of the user-specifiedweight
function is valid before computing with it, but the alternative is simply documenting that the behavior of the algorithm is not specified if the weights are invalid.Here are some benchmarking results taken on Windows on a Ryzen 1700X: