Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the definition of wmedian to wquantile(., 0.5) #436

Merged
merged 25 commits into from
Feb 18, 2019
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions src/deprecates.jl
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,10 @@ end
@deprecate scattermat_zm(x::DenseMatrix, wv::AbstractWeights, dims::Int) scattermat_zm(x::DenseMatrix, wv::AbstractWeights, dims=dims)
@deprecate mean!(R::AbstractArray, A::AbstractArray, w::AbstractWeights, dims::Int) mean!(R, A, w, dims=dims)
@deprecate mean(A::AbstractArray{T}, w::AbstractWeights{W}, dims::Int) where {T<:Number,W<:Real} mean(A, w, dims=dims)

@deprecate wquantile(v::RealVector, w::AbstractWeights{<:Real}, p::RealVector) quantile(v, w, p)
@deprecate wquantile(v::RealVector, w::AbstractWeights{<:Real}, p::Number) quantile(v, w, [p])[1]
@deprecate wquantile(v::RealVector, w::RealVector, p::RealVector) quantile(v, weights(w), p)
@deprecate wquantile(v::RealVector, w::RealVector, p::Number) quantile(v, weights(w), [p])[1]
@deprecate wmedian(v::RealVector, w::AbstractWeights{<:Real}) median(v, w)
@deprecate wmedian(v::RealVector, w::RealVector) median(v, weights(w))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me realize that we probably shouldn't accept Weights objects in quantile and median since their meaning is ambiguous: they could be frequency weights, or another type of weights. So here we'd better recommend using pweightsor fweights (whichever is closer to the current behavior of wmedian).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it ok to assume that default weights are probability weights, as it is now? Honestly, most of the time these differences don't matter, and I don't want users starting to feel overwhelmed by the exact weight type they should use.

In Stata:

Each command has its own idea of the "natural" kind of weight. The command will tell you what kind of weight it is assuming and perform the request as if you specified that kind of weight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time, but clearly not in this function! :-)

Contrary to Stata, our commands don't "tell" the user what "idea of the 'natural' kind of weight" they have. So it would really not help users to make silent assumptions behind their back. Anyway if people have weights they would better declare them using the right type as early as possible so that they get the right answer automatically.

It's too bad that contrary to Stata our definition of quantiles isn't independent from the type of weights, but at this point we'll just have to bite the bullet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like that too. I don't think it's too late.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like commands that talk to the user. Either a command succeeds and it should just do what is requested, or it doesn't and it should indicate a way to make it succeed. Printing warnings during normal operation is just annoying, we'd better explain how to do things properly. Adding f, p or a in front of weights isn't that costly, and anyway if users doesn't know what letter to use they won't understand the warning (and likely do something incorrect).

Also I don't think is common for Julia functions to print messages like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking about the second point ;)

128 changes: 30 additions & 98 deletions src/weights.jl
Original file line number Diff line number Diff line change
Expand Up @@ -473,116 +473,50 @@ _mean(A::AbstractArray{T}, w::AbstractWeights{W}, dims::Nothing) where {T<:Numbe
_mean(A::AbstractArray{T}, w::AbstractWeights{W}, dims::Int) where {T<:Number,W<:Real} =
_mean!(similar(A, wmeantype(T, W), Base.reduced_indices(axes(A), dims)), A, w, dims)

###### Weighted median #####
function median(v::AbstractArray, w::AbstractWeights)
throw(MethodError(median, (v, w)))
end

"""
median(v::RealVector, w::AbstractWeights)

Compute the weighted median of `x`, using weights given by a weight vector `w`
(of type `AbstractWeights`). The weight and data vectors must have the same length.

The weighted median ``x_k`` is the element of `x` that satisfies
``\\sum_{x_i < x_k} w_i \\le \\frac{1}{2} \\sum_{j} w_j`` and
``\\sum_{x_i > x_k} w_i \\le \\frac{1}{2} \\sum_{j} w_j``.

If a weight has value zero, then its associated data point is ignored.
If none of the weights are positive, an error is thrown.
`NaN` is returned if `x` contains any `NaN` values.
An error is raised if `w` contains any `NaN` values.
"""
function median(v::RealVector, w::AbstractWeights{<:Real})
isempty(v) && error("median of an empty array is undefined")
if length(v) != length(w)
error("data and weight vectors must be the same size")
end
@inbounds for x in w.values
isnan(x) && error("weight vector cannot contain NaN entries")
end
@inbounds for x in v
isnan(x) && return x
end
mask = w.values .!= 0
if !any(mask)
error("all weights are zero")
end
if all(w.values .<= 0)
error("no positive weights found")
end
v = v[mask]
wt = w[mask]
midpoint = w.sum / 2
maxval, maxind = findmax(wt)
if maxval > midpoint
v[maxind]
else
permute = sortperm(v)
cumulative_weight = zero(eltype(wt))
i = 0
for (_i, p) in enumerate(permute)
i = _i
if cumulative_weight == midpoint
i += 1
break
elseif cumulative_weight > midpoint
cumulative_weight -= wt[p]
break
end
cumulative_weight += wt[p]
end
if cumulative_weight == midpoint
middle(v[permute[i-2]], v[permute[i-1]])
else
middle(v[permute[i-1]])
end
end
end


"""
wmedian(v, w)

Compute the weighted median of an array `v` with weights `w`, given as either a
vector or an `AbstractWeights` vector.
"""
wmedian(v::RealVector, w::RealVector) = median(v, weights(w))
wmedian(v::RealVector, w::AbstractWeights{<:Real}) = median(v, w)

###### Weighted quantile #####


"""
quantile(v, w::AbstractWeights, p)

Compute the weighted quantiles of a vector `v` at a specified set of probability
values `p`, using weights given by a weight vector `w` (of type `AbstractWeights`).
Weights must not be negative. The weights and data vectors must have the same length.

With [`FrequencyWeights`](@ref), the function returns the same result as
`quantile` for a vector with repeated values.
With non `FrequencyWeights`, denote ``N`` the length of the vector, ``w`` the vector of weights,
``h = p (\\sum_{i<= N}w_i - w_1) + w_1`` the cumulative weight corresponding to the
probability ``p`` and ``S_k = \\sum_{i<=k}w_i`` the cumulative weight for each
`NaN` is returned if `x` contains any `NaN` values. An error is raised if `w` contains
any `NaN` values.

With [`FrequencyWeights`](@ref), the function returns the same result as the unweighted
`quantile` applied to a vector with repeated values. The function returns an error if the
weights are not of type `AbstractVector{<:Integer}`.
matthieugomez marked this conversation as resolved.
Show resolved Hide resolved
With non `FrequencyWeights`, denote ``N`` the length of the vector, ``w`` the vector of
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
weights, ``h = p (\\sum_{i<= N}w_i - w_1) + w_1`` the cumulative weight corresponding to
the probability ``p`` and ``S_k = \\sum_{i<=k}w_i`` the cumulative weight for each
observation, define ``v_{k+1}`` the smallest element of `v` such that ``S_{k+1}``
is strictly superior to ``h``. The weighted ``p`` quantile is given by ``v_k + \\gamma (v_{k+1} -v_k)``
with ``\\gamma = (h - S_k)/(S_{k+1}-S_k)``. In particular, when `w` is a vector
of ones, the function returns the same result as `quantile`.
is strictly superior to ``h``. The weighted ``p`` quantile is given by
``v_k + \\gamma (v_{k+1} -v_k)`` with ``\\gamma = (h - S_k)/(S_{k+1}-S_k)``.
"""

matthieugomez marked this conversation as resolved.
Show resolved Hide resolved
function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where {V,W<:Real}
# checks
isempty(v) && error("quantile of an empty array is undefined")
isempty(p) && throw(ArgumentError("empty quantile array"))

w.sum == 0 && error("weight vector cannot sum to zero")
length(v) == length(w) || error("data and weight vectors must be the same size, got $(length(v)) and $(length(w))")
length(v) == length(w) || error("data and weight vectors must be the same size,
got $(length(v)) and $(length(w))")
matthieugomez marked this conversation as resolved.
Show resolved Hide resolved
for x in w.values
isnan(x) && error("weight vector cannot contain NaN entries")
x < 0 && error("weight vector cannot contain negative entries")
end


matthieugomez marked this conversation as resolved.
Show resolved Hide resolved
isa(w, FrequencyWeights) && !(eltype(w) <: Integer) && any(!isinteger, w) &&
error("The values of the vector of `FrequencyWeights` must be numerically equal to
integers. Use `ProbabilityWeights` or `AnalyticWeights` instead.")

@inbounds for x in v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be 100% certain the return type is type-stable, better call fill!(x, out) below, after allocating out.

isnan(x) && return fill(x, length(p))
end

# remove zeros weights and sort
wsum = sum(w)
nz = .!iszero.(w)
Expand All @@ -598,7 +532,7 @@ function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where
out = Vector{typeof(zero(V)/1)}(undef, length(p))
fill!(out, vw[end][1])

# start looping on quantiles
# loop on quantiles
Sk, Skold = zero(W), zero(W)
vk, vkold = zero(V), zero(V)
k = 0
Expand Down Expand Up @@ -628,25 +562,23 @@ function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where
return out
end

# similarly to statistics.jl in Base
# similar function in Base statistics.jl
function bound_quantiles(qs::AbstractVector{T}) where T<:Real
matthieugomez marked this conversation as resolved.
Show resolved Hide resolved
epsilon = 100 * eps()
if (any(qs .< -epsilon) || any(qs .> 1+epsilon))
throw(ArgumentError("quantiles out of [0,1] range"))
end
T[min(one(T), max(zero(T), q)) for q = qs]
end

quantile(v::RealVector, w::AbstractWeights{<:Real}, p::Number) = quantile(v, w, [p])[1]



###### Weighted median #####
"""
wquantile(v, w, p)
median(v::RealVector, w::AbstractWeights)

Compute the `p`th quantile(s) of `v` with weights `w`, given as either a vector
or an `AbstractWeights` vector.
Compute the weighted median of `v` with weights `w`
(of type `AbstractWeights`). See the documentation for [`quantile`](@ref) for more details.
"""
wquantile(v::RealVector, w::AbstractWeights{<:Real}, p::RealVector) = quantile(v, w, p)
wquantile(v::RealVector, w::AbstractWeights{<:Real}, p::Number) = quantile(v, w, [p])[1]
wquantile(v::RealVector, w::RealVector, p::RealVector) = quantile(v, weights(w), p)
wquantile(v::RealVector, w::RealVector, p::Number) = quantile(v, weights(w), [p])[1]
median(v::RealVector, w::AbstractWeights{<:Real}) = quantile(v, w, 0.5)
120 changes: 28 additions & 92 deletions test/weights.jl
Original file line number Diff line number Diff line change
Expand Up @@ -233,94 +233,6 @@ end
end
end

@testset "Median $f" for f in weight_funcs
data = (
[7, 1, 2, 4, 10],
[7, 1, 2, 4, 10],
[7, 1, 2, 4, 10, 15],
[1, 2, 4, 7, 10, 15],
[0, 10, 20, 30],
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5],
[30, 40, 50, 60, 35],
[2, 0.6, 1.3, 0.3, 0.3, 1.7, 0.7, 1.7, 0.4],
[3.7, 3.3, 3.5, 2.8],
[100, 125, 123, 60, 45, 56, 66],
[2, 2, 2, 2, 2, 2],
[2.3],
[-2, -3, 1, 2, -10],
[1, 2, 3, 4, 5],
[5, 4, 3, 2, 1],
[-2, 2, -1, 3, 6],
[-10, 1, 1, -10, -10],
[2, 4],
[2, 2, 4, 4],
[2, 2, 2, 4]
)
wt = (
[1, 1/3, 1/3, 1/3, 1],
[1, 1, 1, 1, 1],
[1, 1/3, 1/3, 1/3, 1, 1],
[1/3, 1/3, 1/3, 1, 1, 1],
[30, 191, 9, 0],
[10, 1, 1, 1, 9],
[10, 1, 1, 1, 900],
[1, 3, 5, 4, 2],
[2, 2, 0, 1, 2, 2, 1, 6, 0],
[5, 5, 4, 1],
[30, 56, 144, 24, 55, 43, 67],
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
[12],
[7, 1, 1, 1, 6],
[1, 0, 0, 0, 2],
[1, 2, -3, 4, -5],
[0.1, 0.2, 0.3, -0.2, 0.1],
[-1, -1, -1, -1, 1],
[1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]
)
median_answers = (7.0, 4.0, 8.5,
8.5, 10.0, 2.5,
5.0, 50.0, 1.7,
3.5, 100.0, 2.0,
2.3, -2.0, 5.0,
2.0, -1.0, -10.0,
3.0, 3.0, 2.0)
num_tests = length(data)
for i = 1:num_tests
@test wmedian(data[i], wt[i]) == median_answers[i]
@test wmedian(data[i], f(wt[i])) == median_answers[i]
@test median(data[i], f(wt[i])) == median_answers[i]
for j = 1:100
# Make sure the weighted median does not change if the data
# and weights are reordered.
reorder = sortperm(rand(length(data[i])))
@test median(data[i][reorder], f(wt[i][reorder])) == median_answers[i]
end
end
data = [4, 3, 2, 1]
wt = [0, 0, 0, 0]
@test_throws MethodError wmedian(data[1])
@test_throws ErrorException median(data, f(wt))
@test_throws ErrorException wmedian(data, wt)
@test_throws ErrorException median((Float64)[], f((Float64)[]))
wt = [1, 2, 3, 4, 5]
@test_throws ErrorException median(data, f(wt))
@test_throws MethodError median([4 3 2 1 0], f(wt))
@test_throws MethodError median([[1 2];[4 5];[7 8];[10 11];[13 14]], f(wt))
data = [1, 3, 2, NaN, 2]
@test isnan(median(data, f(wt)))
wt = [1, 2, NaN, 4, 5]
@test_throws ErrorException median(data, f(wt))
data = [1, 3, 2, 1, 2]
@test_throws ErrorException median(data, f(wt))
wt = [-1, -1, -1, -1, -1]
@test_throws ErrorException median(data, f(wt))
wt = [-1, -1, -1, 0, 0]
@test_throws ErrorException median(data, f(wt))
end


# Quantile fweights
@testset "Quantile fweights" begin
Expand Down Expand Up @@ -392,6 +304,10 @@ end
@test quantile([1, 2, 3, 4, 5], fweights([0,1,2,1,0]), p) ≈ quantile([2, 3, 3, 4], p)
@test quantile([1, 2], fweights([1, 1]), 0.25) ≈ 1.25
@test quantile([1, 2], fweights([2, 2]), 0.25) ≈ 1.0

# test non integer frequency weights
quantile([1, 2], fweights([1.0, 2.0]), 0.25) == quantile([1, 2], fweights([1, 2]), 0.25)
@test_throws ErrorException quantile([1, 2], fweights([1.5, 2.0]), 0.25)
end

@testset "Quantile aweights, pweights and weights" for f in (aweights, pweights, weights)
Expand Down Expand Up @@ -491,10 +407,30 @@ end
w = [1, 1/3, 1/3, 1/3, 1]
answer = 6.0
@test quantile(data[1], f(w), 0.5) ≈ answer atol = 1e-5
@test wquantile(data[1], f(w), [0.5]) ≈ [answer] atol = 1e-5
@test wquantile(data[1], f(w), 0.5) ≈ answer atol = 1e-5
@test wquantile(data[1], w, [0.5]) ≈ [answer] atol = 1e-5
@test wquantile(data[1], w, 0.5) ≈ answer atol = 1e-5
end


@testset "Median $f" for f in weight_funcs
data = [4, 3, 2, 1]
wt = [0, 0, 0, 0]
@test_throws ErrorException median(data, f(wt))
@test_throws ErrorException median((Float64)[], f((Float64)[]))
matthieugomez marked this conversation as resolved.
Show resolved Hide resolved
wt = [1, 2, 3, 4, 5]
@test_throws ErrorException median(data, f(wt))
if VERSION >= v"1.0"
@test_throws MethodError median([4 3 2 1 0], f(wt))
@test_throws MethodError median([[1 2];[4 5];[7 8];[10 11];[13 14]], f(wt))
matthieugomez marked this conversation as resolved.
Show resolved Hide resolved
end
data = [1, 3, 2, NaN, 2]
@test isnan(median(data, f(wt)))
wt = [1, 2, NaN, 4, 5]
@test_throws ErrorException median(data, f(wt))
data = [1, 3, 2, 1, 2]
@test_throws ErrorException median(data, f(wt))
wt = [-1, -1, -1, -1, -1]
@test_throws ErrorException median(data, f(wt))
wt = [-1, -1, -1, 0, 0]
@test_throws ErrorException median(data, f(wt))
end
matthieugomez marked this conversation as resolved.
Show resolved Hide resolved

end # @testset StatsBase.Weights