-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantiles don't respect weights of 0 #313
Comments
@matthieugomez Remarks? |
The Implementation in StatsBase points to http://stats.stackexchange.com/questions/13169/defining-quantiles-over-a-weighted-sample, which doesn't provide a reference or theoretical guarantees. My understanding is that any subtype of For context, my use case is creating approximate quantiles from a histogram (where bins may contain zero observations). If anyone has a good reference for weighted quantiles (a few google searches returns there's no well-established method), I would be happy to take a look. |
These issues are really tricky. As Wikipedia notes, there are 9 different ways of computing sample quantiles. AFAIK we use the same definition as R (R-7), and indeed we get the same result for unweighted quantiles. For weighted quantiles, the implementation by @matthieugomez from #117 gives the same result as the unweighted quantiles when all weights are 1. But then it doesn't appear to follow the (a priori natural) property that it's equivalent to replicating each entry the number of times corresponding to its integer weight: julia> quantile([2,3,3,4], [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
2.0
2.75
3.0
3.25
4.0
julia> quantile([2,3,3,4], fweights([1,1,1,1]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
2.0
2.75
3.0
3.25
4.0
julia> quantile([2,3,4], fweights([1,2,1]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
2.0
2.5
3.0
3.5
4.0
julia> quantile([1,2,3,4,5], fweights([0,1,2,1,0]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
1.0
2.42857
3.0
3.57143
5.0 Note that with R's Hmisc package, weighted and unweighted results are consistent, except in the presence of zero weights (last case). I've filed harrelfe/Hmisc#81 to try to get more details. > quantile(c(2, 3, 3, 4))
0% 25% 50% 75% 100%
2.00 2.75 3.00 3.25 4.00
> Hmisc::wtd.quantile(c(2,3,3,4), c(1,1,1,1), seq(0, 1, by=.25))
0% 25% 50% 75% 100%
2.00 2.75 3.00 3.25 4.00
> Hmisc::wtd.quantile(c(2,3,4), c(1,2,1), seq(0, 1, by=.25))
0% 25% 50% 75% 100%
2.00 2.75 3.00 3.25 4.00
> Hmisc::wtd.quantile(c(1,2,3,4,5), c(0,1,2,1,0), seq(0, 1, by=.25))
0% 25% 50% 75% 100%
2.000 2.750 3.000 3.375 4.500
More precisely, this is the case for
I don't know of a good reference, but I've found this PR against Numpy which claims to be equivalent to repeating values: numpy/numpy#9211. Since Numpy is BSD, you can take inspiration from the code without licensing issues. There are also various implementations at https://stackoverflow.com/questions/21844024/weighted-percentile-using-numpy and https://stackoverflow.com/questions/11585564/definitive-way-to-match-stata-weighted-xtile-command-using-python (the latter claiming to reproduce R's results). You could also look at what Hmisc does (from a quick glance it looks quite short) and describe the algorithm here so that somebody else can implement it without making StatsBase a derivative work of Hmisc. |
It makes sense to change the current definition so that the output does not depend on values with zero weights. That being said, it is not possible to have a quantile definition that corresponds to frequency weight and that does not depend on normalization of weights (w.sum). For instance, intuitively, we want quantile([1, 2], fweights([2, 2]), [0.25]) = 1
#but
quantile([1, 2], fweights([1, 1]), [0.25]) > 1 In particular, we cannot use the frequency definition to handle non integer weights: quantile([1, 2], fweights([0.1, 0.1]), [0.25]) So I guess we need two definitions of weighted quantile, one for |
Right. Though currently the problem seems to be more general: the weighted quantile is not equivalent to the unweighted quantile with replicated values. Is that expected?
That would certainly be possible (that's one of the advantages of having multiple types of weights), but I don't understand why it would be needed. Shouldn't we just take the values of the weights as they are in all cases? What kind of normalization would you want to apply? I'd have thought we should use a formula which is a generalization of what we expect for the special case of integer weights. BTW, how confident are you that the algorithm you implemented is good? Do you have more references? I guess you have a much better understanding of the possible implementations than I do. |
I don't think that the notion of quantile with frequency weights can be extended to non integer weights. The observation that the results with weights = 1 should differ from the result with weights = 2 says two things.
|
When you say "cannot be extended to non integer weights", do you mean that there are several ways of extending it (with no clear reason to choose a solution over another one), or that it's really not possible? The fact that "frequency weight paradigm cannot give an answer on what to do" doesn't mean we can't choose a particular behavior. |
I mean that there is no clear way to do it. Moreover, any method that does it would make the result depend on the sum of weights. So it would be a bad idea because the result of an operation with non frequency weights should not depend on the sum of weights. |
We could certainly restrict the method to |
This has been fixed by #316. julia> quantile([1,2,3,4,5], fweights([0,1,2,1,0]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
2.0
2.75
3.0
3.25
4.0
julia> quantile([2,3,3,4])
5-element Array{Float64,1}:
2.0
2.75
3.0
3.25
4.0 |
I would expect these two to be the same, but maybe I'm missing something?
The text was updated successfully, but these errors were encountered: