-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modify weighted quantile (aweights + fweights) #316
Conversation
Codecov Report
@@ Coverage Diff @@
## master #316 +/- ##
=========================================
Coverage ? 91.07%
=========================================
Files ? 18
Lines ? 1960
Branches ? 0
=========================================
Hits ? 1785
Misses ? 175
Partials ? 0
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Do you have a reference regarding the choice of normalization for non frequency weights? In particular, why should we normalize to 1 rather than e.g. to N
or sum(w)
? If not, I can survey what other software do.
src/weights.jl
Outdated
This corresponds to R-7, Excel, SciPy-(1,1) and Maple-6 when `w` contains only ones | ||
(see [Wikipedia](https://en.wikipedia.org/wiki/Quantile)). | ||
With frequency weights, the function returns the same result as `quantile` for a vector with repeated values. | ||
With non frequency weights, denote N the length of the vector, w the vector of weights normalized to sum to 1, `h = p (N - 1) + 1` and ``S_k = 1 + (k-1) * wk + (N-1) \\sum_{i<=k}w_i/\\sum_{i<=N}w_i``, define ``x_{k+1}`` the smallest element of `x` such that ``S_{k+1}`` is strictly superior to `h`. The function returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you say what S
and h
represent?
Remove the double spaces, use double backticks everywhere (including around variable names) and break lines at 92 chars. Also better have [frequency weights](@ref FrequencyWeights)
.
src/weights.jl
Outdated
""" | ||
function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where {V,W<:Real} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No line breaks before nor after the signature. I think the style of this package is not to have spaces after commas for type parameters.
src/weights.jl
Outdated
vw = sort!(collect(zip(v, w.values))) | ||
wvalues = w.values | ||
nz = find(w.values) | ||
#normalize if non frequencyweight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after # (also below) and between "frequency" and "weights". Better also say that the sum will be 1.
src/weights.jl
Outdated
# full sort | ||
vw = sort!(collect(zip(v, w.values))) | ||
wvalues = w.values | ||
nz = find(w.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use find(!iszero, w.values)
since the previous form is deprecated on 0.7. Or probably even better, just do nz = .!iszero.(w.values)
, since a boolean vector will be more efficient (you know the size in advance and it's smaller).
Also move this below to the place where it's used for clarity.
Finally, why is that operation needed for frequency weights? Shouldn't the algorithm be able to skip entries with zero weights on its own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the algorithm does not work with zero weights in its current form. There is an issue for instance if the highest value has zero weight. The algorithm also only keeps track of the last visited value, whereas it should keep track of the last visiting value with non zero weight. I just think it's simpler to remove the zero values.
src/weights.jl
Outdated
cumulative_weight, Sk, Skold = zero(W), zero(W), zero(W) | ||
vk, vkold = zero(V), zero(V) | ||
k = 1 | ||
Sk, Skold = zero(W), zero(W) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double space.
src/weights.jl
Outdated
# happens when N or p or wsum equal zero | ||
out[ppermute[i]] = vw[1][1] | ||
else | ||
if isa(w, FrequencyWeights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird to use a completely different path for frequency weights. Couldn't a common path be defined, moving some type-specific computations out of the loop like you did for normalization? For example, for h = p[i] * (wsum - 1) + 1
, wsum
just needs to be replaced with N
for non frequency weights, so you could as well define the variable before the loop.
BTW, wouldn't it make more sense to normalize the weights to sum to N
(rather than 1), since it plays a role equivalent to wsum
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about this but I think it is better to do two different paths. Joining the two looks more confusing than enlightening.
test/weights.jl
Outdated
# zero don't count | ||
x = [1, 2, 3, 4, 5] | ||
@test quantile(x, fweights([0,1,1,1,0]), p) ≈ quantile([2, 3, 4], p) | ||
# repetitions dont count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"don't"
test/weights.jl
Outdated
end | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed.
test/weights.jl
Outdated
) | ||
p = [0.0, 0.25, 0.5, 0.75, 1.0] | ||
|
||
srand(10) | ||
for i = 1:length(data) | ||
@test quantile(data[i], wt[i], p) ≈ quantile_answers[i] | ||
@test quantile(data[i], aweights(wt[i]), p) ≈ quantile_answers[i] atol = 1e-3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1e-3 sounds quite high, why not keep more precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment hasn't been addressed.
BTW, it would be nice to duplicate each @test
line to test pweights
too.
test/weights.jl
Outdated
) | ||
p = [0.0, 0.25, 0.5, 0.75, 1.0] | ||
for x in data | ||
@test quantile(x, fweights(ones(Int64, length(x))), p) ≈ quantile(x, p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to test more thoroughly the results by combining various inputs with various weights, as done below with non frequency weights. You could use a rep
helper function (JuliaLang/julia#16443) to generate vectors with repeated entries to call the unweighted quantile
method on.
Also, there aren't any zeros no negative non-integer values. Wouldn't hurt to add some.
Thanks for the detailed comments. I commented on the ones I disagreed with. |
But why is it more problematic to give different results depending on the sum of weights than to normalize them arbitrarily to 1? Is there a rationale or a precedent in other software? That combined with the fact that we don't even use the same algorithm for frequency and other kinds of weights makes it sound like a totally arbitrary choice. For example, Hmisc::wtd.quantile normalizes to |
Take an algorithm that does not depend on the sum of weight. You can always
rewrite it as an algorithm that takes a vector of weight that sum to one.
Rewriting it like that allows to replace all the expressions in sum wi by
1. That’s all there is. This does not mean that the normalization is
arbitrary.
…On Thu, Nov 9, 2017 at 2:23 PM Milan Bouchet-Valat ***@***.***> wrote:
But why is it more problematic to give different results depending on the
sum of weights than to normalize them arbitrarily to 1? Is there a
rationale or a precedent in other software? That combined with the fact
that we don't even use the same algorithm for frequency and other kinds of
weights makes it sound like a totally arbitrary choice. For example,
Hmisc::wtd.quantile normalizes to N when normwt=TRUE, and does not
provide an argument to normalize to 1. Is that better? Is that worse? I
wouldn't want to make a choice without solid arguments.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#316 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF733Zfl61eoMzyFFc9gOhFQ0FzVSX8gks5s03t6gaJpZM4QXQaz>
.
|
Ok I have updated the definition of aweights. There are now only two differences between fweights and aweights in the algorithm. These differences are important. The intuition for why they need to be different is that fweights give you more datapoints on the empirical CDF than aweights do. quantile([1, 2], fweights([1, 2]), 0.5) = 2
# but
quantile([1, 2], weights([1, 2]), 0.5) < 2 |
OK, I trust you when you say it's more intuitive, that's not completely obvious to me. ;-) Is there any way to check our results against another implementation? We don't return the same results as I've also found a MIT-licensed implementation of weighted quantiles for MATLAB (they use the R-8 definition by default, but support all variants). Unfortunately, even if they say that equal weights should give the same results as the unweighted version, that doesn't appear to be the case: >> iosr.statistics.quantile([7, 1, 2, 4, 10], [0, .25, .5, .75, 1], [], ['R-7'])
ans =
1 2 4 7 10
>> iosr.statistics.quantile([7, 1, 2, 4, 10], [0, .25, .5, .75, 1], [], ['R-7'], [1, 1, 1, 1, 1])
ans =
1.0000 1.5000 3.0000 5.5000 10.0000 It's incredible how hard it is to find completely reliable implementations, and the existence of so many variants makes it difficult to compare them. |
test/weights.jl
Outdated
f([1, 2, 3, 4, 5]), | ||
f([0.1, 0.2, 0.3, 0.2, 0.1]), | ||
f([1, 1, 1, 1, 1]), | ||
Int[3, 1, 1, 1, 3], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why specify Int
?
src/weights.jl
Outdated
|
||
""" | ||
quantile(v, w::AbstractWeights, p) | ||
|
||
Compute the weighted quantiles of a vector `x` at a specified set of probability | ||
values `p`, using weights given by a weight vector `w` (of type `AbstractWeights`). | ||
Compute the weighted quantiles of a vector ``x`` at a specified set of probability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, references to Julia objects should use single backticks... :-) And v
should be changed to x
in the signature above to match descriptions below (or the other way around).
src/weights.jl
Outdated
|
||
This corresponds to R-7, Excel, SciPy-(1,1) and Maple-6 when `w` contains only ones | ||
(see [Wikipedia](https://en.wikipedia.org/wiki/Quantile)). | ||
With [FrequencyWeights](@ref FrequencyWeights), the function returns the same result as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing backticks here and below.
src/weights.jl
Outdated
(see [Wikipedia](https://en.wikipedia.org/wiki/Quantile)). | ||
With [FrequencyWeights](@ref FrequencyWeights), the function returns the same result as | ||
`quantile` for a vector with repeated values. | ||
With non FrequencyWeights, denote N the length of the vector, w the vector of weights, ``h = p (\\sum_{i<= N}w_i - w_1) + w_1`` the cumulative weight corresponding to the probability `p` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cut line at 92 chars. Still missing backticks.
src/weights.jl
Outdated
define ``x_{k+1}`` the smallest element of ``x`` such that ``S_{k+1}`` is strictly | ||
superior to ``h``. The function returns``x_k + \\gamma (x_{k+1} -x_k)`` | ||
with ``\\gamma = (h - S_k)/(S_{k+1}-S_k)``. In particular, when ``w`` is a vector | ||
of one, the function returns the same result as `quantile`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"of ones". Also remove indentation on these lines.
src/weights.jl
Outdated
With [FrequencyWeights](@ref FrequencyWeights), the function returns the same result as | ||
`quantile` for a vector with repeated values. | ||
With non FrequencyWeights, denote N the length of the vector, w the vector of weights, ``h = p (\\sum_{i<= N}w_i - w_1) + w_1`` the cumulative weight corresponding to the probability `p` and | ||
``S_k = \\sum_{i<=k}w_i`` the cumulative weight for to each observation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"for to"
src/weights.jl
Outdated
|
||
wsum = w.sum | ||
#remove zeros weights and sort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after #
.
src/weights.jl
Outdated
|
||
wsum = w.sum | ||
#remove zeros weights and sort | ||
wsum = sum(w.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use sum(w)
, which is equivalent to w.sum
thanks to a special method. I don't think you need to access the values
field below either: AbstractWeight
implements the AbstractVector
interface (but does not guarantee that the values
field exists).
src/weights.jl
Outdated
vk, vkold = zero(V), zero(V) | ||
k = 1 | ||
Sk, Skold = zero(W), zero(W) | ||
vk, vkold= zero(V), zero(V) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space before =
.
test/weights.jl
Outdated
) | ||
p = [0.0, 0.25, 0.5, 0.75, 1.0] | ||
|
||
srand(10) | ||
for i = 1:length(data) | ||
@test quantile(data[i], wt[i], p) ≈ quantile_answers[i] | ||
@test quantile(data[i], aweights(wt[i]), p) ≈ quantile_answers[i] atol = 1e-3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment hasn't been addressed.
BTW, it would be nice to duplicate each @test
line to test pweights
too.
BTW, should |
Bump. Do you want to finish this? |
Ok. I've updated it to incorporate your comments. I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've pushed a few fixes to get the tests to pass.
Thanks a lot! Really appreciate your help |
What is the status on this PR / branch? It seems like the functionality is there. Are the tests not passing? |
It's just that I had left the PR open in case somebody wanted to comment, but nobody objected nor merged it. |
Solves #313
The quantile method for frequency weights is now equivalent to the unweighted method with a vector of repeated values
The quantile method for non frequency weight now removes zeros.