Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rand vs. sample #6003

Closed
StefanKarpinski opened this issue Mar 1, 2014 · 27 comments
Closed

rand vs. sample #6003

StefanKarpinski opened this issue Mar 1, 2014 · 27 comments
Assignees
Labels
needs decision A decision on this change is needed randomness Random number generation and the Random stdlib
Milestone

Comments

@StefanKarpinski
Copy link
Member

Base Julia provides the rand and rand! functions, which do random sampling over various domains. The StatsBase package provides sample and sample! with partially overlapping functionality. There have been various threads (e.g. this one) where people are discussion additional functionality for one or the other. I'm opening this issue to discuss whether these interfaces should be separate or if they should be merged and if they are to remain separate, to clarify the distinction between rand and sample.

@JeffBezanson
Copy link
Member

Looking at the docs, I don't think I see a case where the functions couldn't be combined.

@StefanKarpinski
Copy link
Member Author

Among others, there's this:

julia> using StatsBase

julia> rand(1:3)
1

julia> rand([1:3])
ERROR: no method rand(Array{Int64,1})

julia> sample(1:3)
3

julia> sample([1:3])
3

sample(a) where a is an array could just be a method of rand and since we already support rand(3:100) and a range is just an efficiently stored vector, there is precedent for this.

@StefanKarpinski
Copy link
Member Author

Ah, I misread your comment as "could" rather than "couldn't". So you're in agreement :-)

@StefanKarpinski
Copy link
Member Author

I wonder if the idea of a weighted collection shouldn't be generalized. There's a lot of overlap with multisets and dicts where the values are weights. The features that don't appear in Base's rand function are weights, replacement and ordering. It would be nice if we could arrange things so that there doesn't need to be a separate function just to add those features. We can either backports support for those things into Base or figure out a way to layer support onto the base rand functions in a package.

@andreasnoack
Copy link
Member

cc: @lindahua

@lindahua
Copy link
Contributor

lindahua commented Mar 1, 2014

I am all for migrating the sample function to base, and merging it with rand. The use of a separate function name in StatsBase is just a stop gap.

cc: @johnmyleswhite

@johnmyleswhite
Copy link
Member

I wholeheartedly agree. I would love to see sample in Base.

@porterjamesj
Copy link
Contributor

+1000000 for sample in Base.

@StefanKarpinski
Copy link
Member Author

Well, it's more the other direction we're proposing: using rand in stats to sample things.

@johnmyleswhite
Copy link
Member

I thought the idea was to merge the names rand and sample.

@StefanKarpinski
Copy link
Member Author

Yes, that's the idea. It remains to be seen which name will be left at the end.

@johnmyleswhite
Copy link
Member

Presumably you're going to keep randn, right? That seems sufficient reason to keep rand over sample?

@jiahao
Copy link
Member

jiahao commented Mar 4, 2014

I was thinking about that. Samplen is pretty odd.

@lindahua
Copy link
Contributor

lindahua commented Mar 4, 2014

Agree to use rand in the place of sample.

@StefanKarpinski
Copy link
Member Author

Ok, so the naming issue is decided. It remains to determine what needs to be done to merge them.

@lindahua
Copy link
Contributor

lindahua commented Mar 4, 2014

If you may wait a little bit, I can make a PR for this after the deadline in this weekend. I am also fine if someone else can take a lead to do that.

@StefanKarpinski
Copy link
Member Author

No hurry on this. It's not going into 0.3.

@lindahua
Copy link
Contributor

I have now get to the point to tackle this issue. I am proposing the API below:

The main idea is to reuse rand for sampling from population:

# randomly draw one sample from a
rand(a::AbstractArray) 

# randomly draw multiple samples from a, with or without replacement
# this subsumes the rand(a::Range, n) methods
rand(a::AbstractArray, n::Int; replace::Bool=true)
rand(a::AbstractArray, dim::Dims; replace::Bool=true)

The rand function will be extended in StatsBase to accept a weight-vector for weighted sampling.

With this implemented, the sample function in StatsBase will then be deprecated in favor of rand.

@JeffBezanson
Copy link
Member

Sounds good. There's a slight potential for confusion with rand((m,n)), which constructs a random array instead of sampling from the tuple. But this doesn't seem like a huge problem.

rfourquet added a commit to rfourquet/julia that referenced this issue Sep 11, 2014
This implements the generalization suggested by @ivarne
(JuliaLang#8257 (comment))
or by @lindahua
(JuliaLang#6003 (comment)).
This change is very simple thanks to commit 48f27bc.
rfourquet added a commit to rfourquet/julia that referenced this issue Sep 11, 2014
This implements the generalization suggested by @ivarne
(JuliaLang#8257 (comment))
or by @lindahua
(JuliaLang#6003 (comment)).
This change is very simple thanks to commit 48f27bc.
rfourquet added a commit to rfourquet/julia that referenced this issue Sep 17, 2014
This implements the generalization suggested by @ivarne
(JuliaLang#8257 (comment))
or by @lindahua
(JuliaLang#6003 (comment)).
This change is very simple thanks to commit 48f27bc.
rfourquet added a commit to rfourquet/julia that referenced this issue Sep 30, 2014
This implements the generalization suggested by @ivarne
(JuliaLang#8257 (comment))
or by @lindahua
(JuliaLang#6003 (comment)).
This change is very simple thanks to commit 48f27bc.
rfourquet added a commit to rfourquet/julia that referenced this issue Sep 30, 2014
This implements the generalization suggested by @ivarne
(JuliaLang#8257 (comment))
or by @lindahua
(JuliaLang#6003 (comment)).
This change is very simple thanks to commit 48f27bc.
@ViralBShah
Copy link
Member

@lindahua Is this worth revisiting now in the 0.4 cycle?

@ViralBShah ViralBShah added the randomness Random number generation and the Random stdlib label Nov 22, 2014
@lindahua
Copy link
Contributor

I may look into this again next week.

@ViralBShah ViralBShah added this to the 0.4 milestone Nov 22, 2014
@tkelman
Copy link
Contributor

tkelman commented Sep 15, 2015

Did not make it into 0.4, removing from milestone.

@tkelman tkelman removed this from the 0.4.x milestone Sep 15, 2015
@StefanKarpinski StefanKarpinski added this to the 0.6.0 milestone Sep 13, 2016
@StefanKarpinski
Copy link
Member Author

Someone needs to own this to make it happen or it's going to slip again.

@StefanKarpinski
Copy link
Member Author

It's also a bit unclear what still needs to be done at this point.

@simonbyrne
Copy link
Contributor

simonbyrne commented Nov 10, 2016

The main difference now is that sample provides keyword arguments for replacement and ordering, e.g.

sample(1:10,8,replace=false)

The only reason we can't combine them is that we can't add more keyword arguments to existing method signatures without getting a method overwritten warning.

@simonbyrne
Copy link
Contributor

and the problem @JeffBezanson mentioned above:

julia> rand([1,2])
1

julia> rand((1,2))
1×2 Array{Float64,2}:
 0.966145  0.590825

(sample throws a MethodError on the tuple one)

@simonbyrne
Copy link
Contributor

simonbyrne commented Nov 10, 2016

After some discussion, the decision was made to leave it as is. rand and sample are indeed subtly different:

  • rand samples independently (either uniform [0,1], uniform from a discrete set, or from a specific distribution if used with Distributions),
  • sample allows for non-independent methods (e.g. without replacement), which are only applicable to finite sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed randomness Random number generation and the Random stdlib
Projects
None yet
Development

No branches or pull requests

10 participants