Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the definition of wmedian to wquantile(., 0.5) #436

Merged
merged 25 commits into from
Feb 18, 2019
Merged

Change the definition of wmedian to wquantile(., 0.5) #436

merged 25 commits into from
Feb 18, 2019

Conversation

matthieugomez
Copy link
Contributor

@matthieugomez matthieugomez commented Dec 9, 2018

and returns an error for non integer FrequencyWeights.
See #435
I wish there was a NEWS section to document the change, because wmedian now gives a different result.

@codecov
Copy link

codecov bot commented Dec 10, 2018

Codecov Report

Merging #436 into master will decrease coverage by 0.11%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #436      +/-   ##
==========================================
- Coverage   85.09%   84.97%   -0.12%     
==========================================
  Files          18       18              
  Lines        1992     1963      -29     
==========================================
- Hits         1695     1668      -27     
+ Misses        297      295       -2
Impacted Files Coverage Δ
src/weights.jl 78.94% <100%> (-1.88%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2913b7a...c66f39d. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 10, 2018

Codecov Report

Merging #436 into master will decrease coverage by 0.11%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #436      +/-   ##
==========================================
- Coverage   85.06%   84.95%   -0.12%     
==========================================
  Files          21       21              
  Lines        2116     2080      -36     
==========================================
- Hits         1800     1767      -33     
+ Misses        316      313       -3
Impacted Files Coverage Δ
src/deprecates.jl 0% <ø> (ø) ⬆️
src/weights.jl 78.19% <100%> (-2.17%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ce5225...f09e1f8. Read the comment docs.

@nalimilan
Copy link
Member

We should revive this PR. Sorry for the delay.

For future reference, can you explain why median differs from quantile currently? What definition does it implement? @andreasnoack thinks there aren't so many different definitions of the weighted median as there are for quantiles, but I'm not sure that's the case.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Feb 1, 2019

median currently is:

  • take the maximum value such that sum of weights of values below the value is lower or equal than 1/2
  • take the minimum value such that sum of weights of values above is lower or equal than 1/2.
  • If these are the same values, you've found the median. Otherwise, take the average.

It corresponds to a nice extension of the median definition from unweighted to weighted vectors. The issue is this generalization does not help when generalizing other quantiles, so I think is better to use a generalization that is defined for any quantile.

The other issue with the current implementation is that, with frequency weights, it does not give the same thing as the unweighted median of a repeated variable. Both issues are solved by this PR but they are conceptually different.

src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
@deprecate wquantile(v::RealVector, w::RealVector, p::RealVector) quantile(v, weights(w), p)
@deprecate wquantile(v::RealVector, w::RealVector, p::Number) quantile(v, weights(w), [p])[1]
@deprecate wmedian(v::RealVector, w::AbstractWeights{<:Real}) median(v, w)
@deprecate wmedian(v::RealVector, w::RealVector) median(v, weights(w))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me realize that we probably shouldn't accept Weights objects in quantile and median since their meaning is ambiguous: they could be frequency weights, or another type of weights. So here we'd better recommend using pweightsor fweights (whichever is closer to the current behavior of wmedian).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it ok to assume that default weights are probability weights, as it is now? Honestly, most of the time these differences don't matter, and I don't want users starting to feel overwhelmed by the exact weight type they should use.

In Stata:

Each command has its own idea of the "natural" kind of weight. The command will tell you what kind of weight it is assuming and perform the request as if you specified that kind of weight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time, but clearly not in this function! :-)

Contrary to Stata, our commands don't "tell" the user what "idea of the 'natural' kind of weight" they have. So it would really not help users to make silent assumptions behind their back. Anyway if people have weights they would better declare them using the right type as early as possible so that they get the right answer automatically.

It's too bad that contrary to Stata our definition of quantiles isn't independent from the type of weights, but at this point we'll just have to bite the bullet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like that too. I don't think it's too late.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like commands that talk to the user. Either a command succeeds and it should just do what is requested, or it doesn't and it should indicate a way to make it succeed. Printing warnings during normal operation is just annoying, we'd better explain how to do things properly. Adding f, p or a in front of weights isn't that costly, and anyway if users doesn't know what letter to use they won't understand the warning (and likely do something incorrect).

Also I don't think is common for Julia functions to print messages like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking about the second point ;)

src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
test/weights.jl Outdated Show resolved Hide resolved
test/weights.jl Outdated Show resolved Hide resolved
test/weights.jl Show resolved Hide resolved
@matthieugomez
Copy link
Contributor Author

Incorporated your suggestion. Sorry I answered to your remarks as new comments — you can see them in "Files Changed".

src/weights.jl Outdated Show resolved Hide resolved
@deprecate wquantile(v::RealVector, w::RealVector, p::RealVector) quantile(v, weights(w), p)
@deprecate wquantile(v::RealVector, w::RealVector, p::Number) quantile(v, weights(w), [p])[1]
@deprecate wmedian(v::RealVector, w::AbstractWeights{<:Real}) median(v, w)
@deprecate wmedian(v::RealVector, w::RealVector) median(v, weights(w))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time, but clearly not in this function! :-)

Contrary to Stata, our commands don't "tell" the user what "idea of the 'natural' kind of weight" they have. So it would really not help users to make silent assumptions behind their back. Anyway if people have weights they would better declare them using the right type as early as possible so that they get the right answer automatically.

It's too bad that contrary to Stata our definition of quantiles isn't independent from the type of weights, but at this point we'll just have to bite the bullet.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Feb 5, 2019

Honestly the difference does not matter for 99.99% of vectors. For me it'd be a bit like having to specify the type of quantile (1, 2... 7) etc every time I use the quantile function: while it would be more explicit, it would be much more cumbersome.
In any case, if you still want to do it, maybe it'd be better to do it in a separate pull request?

@nalimilan
Copy link
Member

Honestly the difference does not matter for 99.99% of vectors. For me it'd be a bit like having to specify the type of quantile (1, 2... 7) etc every time I use the quantile function: while it would be more explicit, it would be much more cumbersome.

That's not really comparable IMHO, as these quantile types all aim to estimate the same quantity (and they converge with large sample sizes). OTC if you have frequency weights and you pass them without noticing they are treated as analytical weights, the interpretation is clearly incorrect. What's the point in spending so much work in finding the right definitions if in the end we don't think it matters? Also, people could be confused if they pass weights(w) for frequency weights and don't get the same result as replicating observations with the unweighted quantile.

In any case, if you still want to do it, maybe it'd be better to do it in a separate pull request?

Adding an error can wait for another PR, but the deprecation should suggest using pweights immediately, so that people don't use something which will throw an error soon.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@nalimilan nalimilan merged commit 814219a into JuliaStats:master Feb 18, 2019
@matthieugomez
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants