-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the definition of wmedian
to wquantile(., 0.5)
#436
Merged
Merged
Changes from 13 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
c66f39d
Change the definition of `wmedian` to `wquantile(., 0.5)`
matthieugomez 8b3cb26
deprecate wmedian and wquantile
matthieugomez 49cdf8f
Update weights.jl
matthieugomez 0b96dd0
Update
matthieugomez 551f440
Update deprecates.jl
matthieugomez 2941e1f
Update deprecates.jl
matthieugomez a368b4e
Merge branch 'master' into master
matthieugomez 4ea8934
rmv i
matthieugomez d87239b
Merge branch 'master' of https://github.com/matthieugomez/StatsBase.jl
matthieugomez 7849064
Update deprecates.jl
matthieugomez 6ed920d
Update deprecates.jl
matthieugomez c709017
Update weights.jl
matthieugomez 6bdcf4e
add non integer test
matthieugomez 2cb7a5f
update
matthieugomez 43e2203
pass tests
matthieugomez fc136db
Update weights.jl
matthieugomez 125a30c
empty lines
matthieugomez a7c9086
Update weights.jl
matthieugomez 4a20b4c
Update weights.jl
matthieugomez 20f04af
Update weights.jl
matthieugomez 86b02d0
Update weights.jl
matthieugomez 2a53a01
Update weights.jl
matthieugomez 8a2ed42
Update weights.jl
matthieugomez 85dd98a
fetch
matthieugomez f09e1f8
Use pweights in deprecation
nalimilan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -473,116 +473,50 @@ _mean(A::AbstractArray{T}, w::AbstractWeights{W}, dims::Nothing) where {T<:Numbe | |
_mean(A::AbstractArray{T}, w::AbstractWeights{W}, dims::Int) where {T<:Number,W<:Real} = | ||
_mean!(similar(A, wmeantype(T, W), Base.reduced_indices(axes(A), dims)), A, w, dims) | ||
|
||
###### Weighted median ##### | ||
function median(v::AbstractArray, w::AbstractWeights) | ||
throw(MethodError(median, (v, w))) | ||
end | ||
|
||
""" | ||
median(v::RealVector, w::AbstractWeights) | ||
|
||
Compute the weighted median of `x`, using weights given by a weight vector `w` | ||
(of type `AbstractWeights`). The weight and data vectors must have the same length. | ||
|
||
The weighted median ``x_k`` is the element of `x` that satisfies | ||
``\\sum_{x_i < x_k} w_i \\le \\frac{1}{2} \\sum_{j} w_j`` and | ||
``\\sum_{x_i > x_k} w_i \\le \\frac{1}{2} \\sum_{j} w_j``. | ||
|
||
If a weight has value zero, then its associated data point is ignored. | ||
If none of the weights are positive, an error is thrown. | ||
`NaN` is returned if `x` contains any `NaN` values. | ||
An error is raised if `w` contains any `NaN` values. | ||
""" | ||
function median(v::RealVector, w::AbstractWeights{<:Real}) | ||
isempty(v) && error("median of an empty array is undefined") | ||
if length(v) != length(w) | ||
error("data and weight vectors must be the same size") | ||
end | ||
@inbounds for x in w.values | ||
isnan(x) && error("weight vector cannot contain NaN entries") | ||
end | ||
@inbounds for x in v | ||
isnan(x) && return x | ||
end | ||
mask = w.values .!= 0 | ||
if !any(mask) | ||
error("all weights are zero") | ||
end | ||
if all(w.values .<= 0) | ||
error("no positive weights found") | ||
end | ||
v = v[mask] | ||
wt = w[mask] | ||
midpoint = w.sum / 2 | ||
maxval, maxind = findmax(wt) | ||
if maxval > midpoint | ||
v[maxind] | ||
else | ||
permute = sortperm(v) | ||
cumulative_weight = zero(eltype(wt)) | ||
i = 0 | ||
for (_i, p) in enumerate(permute) | ||
i = _i | ||
if cumulative_weight == midpoint | ||
i += 1 | ||
break | ||
elseif cumulative_weight > midpoint | ||
cumulative_weight -= wt[p] | ||
break | ||
end | ||
cumulative_weight += wt[p] | ||
end | ||
if cumulative_weight == midpoint | ||
middle(v[permute[i-2]], v[permute[i-1]]) | ||
else | ||
middle(v[permute[i-1]]) | ||
end | ||
end | ||
end | ||
|
||
|
||
""" | ||
wmedian(v, w) | ||
|
||
Compute the weighted median of an array `v` with weights `w`, given as either a | ||
vector or an `AbstractWeights` vector. | ||
""" | ||
wmedian(v::RealVector, w::RealVector) = median(v, weights(w)) | ||
wmedian(v::RealVector, w::AbstractWeights{<:Real}) = median(v, w) | ||
|
||
###### Weighted quantile ##### | ||
|
||
|
||
""" | ||
quantile(v, w::AbstractWeights, p) | ||
|
||
Compute the weighted quantiles of a vector `v` at a specified set of probability | ||
values `p`, using weights given by a weight vector `w` (of type `AbstractWeights`). | ||
Weights must not be negative. The weights and data vectors must have the same length. | ||
|
||
With [`FrequencyWeights`](@ref), the function returns the same result as | ||
`quantile` for a vector with repeated values. | ||
With non `FrequencyWeights`, denote ``N`` the length of the vector, ``w`` the vector of weights, | ||
``h = p (\\sum_{i<= N}w_i - w_1) + w_1`` the cumulative weight corresponding to the | ||
probability ``p`` and ``S_k = \\sum_{i<=k}w_i`` the cumulative weight for each | ||
`NaN` is returned if `x` contains any `NaN` values. An error is raised if `w` contains | ||
any `NaN` values. | ||
|
||
With [`FrequencyWeights`](@ref), the function returns the same result as the unweighted | ||
`quantile` applied to a vector with repeated values. The function returns an error if the | ||
weights are not of type `AbstractVector{<:Integer}`. | ||
matthieugomez marked this conversation as resolved.
Show resolved
Hide resolved
|
||
With non `FrequencyWeights`, denote ``N`` the length of the vector, ``w`` the vector of | ||
nalimilan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
weights, ``h = p (\\sum_{i<= N}w_i - w_1) + w_1`` the cumulative weight corresponding to | ||
the probability ``p`` and ``S_k = \\sum_{i<=k}w_i`` the cumulative weight for each | ||
observation, define ``v_{k+1}`` the smallest element of `v` such that ``S_{k+1}`` | ||
is strictly superior to ``h``. The weighted ``p`` quantile is given by ``v_k + \\gamma (v_{k+1} -v_k)`` | ||
with ``\\gamma = (h - S_k)/(S_{k+1}-S_k)``. In particular, when `w` is a vector | ||
of ones, the function returns the same result as `quantile`. | ||
is strictly superior to ``h``. The weighted ``p`` quantile is given by | ||
``v_k + \\gamma (v_{k+1} -v_k)`` with ``\\gamma = (h - S_k)/(S_{k+1}-S_k)``. | ||
""" | ||
|
||
matthieugomez marked this conversation as resolved.
Show resolved
Hide resolved
|
||
function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where {V,W<:Real} | ||
# checks | ||
isempty(v) && error("quantile of an empty array is undefined") | ||
isempty(p) && throw(ArgumentError("empty quantile array")) | ||
|
||
w.sum == 0 && error("weight vector cannot sum to zero") | ||
length(v) == length(w) || error("data and weight vectors must be the same size, got $(length(v)) and $(length(w))") | ||
length(v) == length(w) || error("data and weight vectors must be the same size, | ||
got $(length(v)) and $(length(w))") | ||
matthieugomez marked this conversation as resolved.
Show resolved
Hide resolved
|
||
for x in w.values | ||
isnan(x) && error("weight vector cannot contain NaN entries") | ||
x < 0 && error("weight vector cannot contain negative entries") | ||
end | ||
|
||
|
||
matthieugomez marked this conversation as resolved.
Show resolved
Hide resolved
|
||
isa(w, FrequencyWeights) && !(eltype(w) <: Integer) && any(!isinteger, w) && | ||
error("The values of the vector of `FrequencyWeights` must be numerically equal to | ||
integers. Use `ProbabilityWeights` or `AnalyticWeights` instead.") | ||
|
||
@inbounds for x in v | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be 100% certain the return type is type-stable, better call |
||
isnan(x) && return fill(x, length(p)) | ||
end | ||
|
||
# remove zeros weights and sort | ||
wsum = sum(w) | ||
nz = .!iszero.(w) | ||
|
@@ -598,7 +532,7 @@ function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where | |
out = Vector{typeof(zero(V)/1)}(undef, length(p)) | ||
fill!(out, vw[end][1]) | ||
|
||
# start looping on quantiles | ||
# loop on quantiles | ||
Sk, Skold = zero(W), zero(W) | ||
vk, vkold = zero(V), zero(V) | ||
k = 0 | ||
|
@@ -628,25 +562,23 @@ function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where | |
return out | ||
end | ||
|
||
# similarly to statistics.jl in Base | ||
# similar function in Base statistics.jl | ||
function bound_quantiles(qs::AbstractVector{T}) where T<:Real | ||
matthieugomez marked this conversation as resolved.
Show resolved
Hide resolved
|
||
epsilon = 100 * eps() | ||
if (any(qs .< -epsilon) || any(qs .> 1+epsilon)) | ||
throw(ArgumentError("quantiles out of [0,1] range")) | ||
end | ||
T[min(one(T), max(zero(T), q)) for q = qs] | ||
end | ||
|
||
quantile(v::RealVector, w::AbstractWeights{<:Real}, p::Number) = quantile(v, w, [p])[1] | ||
|
||
|
||
|
||
###### Weighted median ##### | ||
""" | ||
wquantile(v, w, p) | ||
median(v::RealVector, w::AbstractWeights) | ||
|
||
Compute the `p`th quantile(s) of `v` with weights `w`, given as either a vector | ||
or an `AbstractWeights` vector. | ||
Compute the weighted median of `v` with weights `w` | ||
(of type `AbstractWeights`). See the documentation for [`quantile`](@ref) for more details. | ||
""" | ||
wquantile(v::RealVector, w::AbstractWeights{<:Real}, p::RealVector) = quantile(v, w, p) | ||
wquantile(v::RealVector, w::AbstractWeights{<:Real}, p::Number) = quantile(v, w, [p])[1] | ||
wquantile(v::RealVector, w::RealVector, p::RealVector) = quantile(v, weights(w), p) | ||
wquantile(v::RealVector, w::RealVector, p::Number) = quantile(v, weights(w), [p])[1] | ||
median(v::RealVector, w::AbstractWeights{<:Real}) = quantile(v, w, 0.5) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me realize that we probably shouldn't accept
Weights
objects inquantile
andmedian
since their meaning is ambiguous: they could be frequency weights, or another type of weights. So here we'd better recommend usingpweights
orfweights
(whichever is closer to the current behavior ofwmedian
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it ok to assume that default weights are probability weights, as it is now? Honestly, most of the time these differences don't matter, and I don't want users starting to feel overwhelmed by the exact weight type they should use.
In Stata:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the time, but clearly not in this function! :-)
Contrary to Stata, our commands don't "tell" the user what "idea of the 'natural' kind of weight" they have. So it would really not help users to make silent assumptions behind their back. Anyway if people have weights they would better declare them using the right type as early as possible so that they get the right answer automatically.
It's too bad that contrary to Stata our definition of quantiles isn't independent from the type of weights, but at this point we'll just have to bite the bullet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like that too. I don't think it's too late.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like commands that talk to the user. Either a command succeeds and it should just do what is requested, or it doesn't and it should indicate a way to make it succeed. Printing warnings during normal operation is just annoying, we'd better explain how to do things properly. Adding
f
,p
ora
in front ofweights
isn't that costly, and anyway if users doesn't know what letter to use they won't understand the warning (and likely do something incorrect).Also I don't think is common for Julia functions to print messages like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was talking about the second point ;)