-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix problems in group rank when both nans and infinity are present #20561 #20681
Conversation
Codecov Report
@@ Coverage Diff @@
## master #20681 +/- ##
=========================================
Coverage ? 91.84%
=========================================
Files ? 153
Lines ? 49295
Branches ? 0
=========================================
Hits ? 45274
Misses ? 4021
Partials ? 0
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. can you add a whatsnew note in groupby section
pandas/_libs/groupby_helper.pxi.in
Outdated
* keep: leave NA values where they are | ||
* top: smallest rank if ascending | ||
* bottom: smallest rank if descending | ||
ties_method : {'average', 'min', 'max', 'first', 'dense'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is average the derault?
pandas/_libs/groupby_helper.pxi.in
Outdated
ascending : boolean | ||
False for ranks by high (1) to low (N) | ||
pct : boolean | ||
Compute percentage rank of data within each group | ||
na_option : {'keep', 'top', 'bottom'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is na_option the default? (mark if it is)
pandas/tests/groupby/test_groupby.py
Outdated
]) | ||
def test_infs_n_nans(self, grps, vals, ties_method, ascending, na_option, | ||
exp): | ||
key = np.repeat(grps, len(vals)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number
@@ -1965,6 +1965,55 @@ def test_rank_args(self, grps, vals, ties_method, ascending, pct, exp): | |||
exp_df = DataFrame(exp * len(grps), columns=['val']) | |||
assert_frame_equal(result, exp_df) | |||
|
|||
@pytest.mark.parametrize("grps", [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also happy to have in another PR, move all of the rank tests to test_functional (you can do it here as well). We may want to move other things too, so maybe new PR. test_groupby is getting large.
pls rebase as well. |
cc @WillAyd if you could review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change
pandas/tests/groupby/test_groupby.py
Outdated
('average', True, 'keep', [1.5, 1.5, np.nan, 3, np.nan, 4.5, 4.5]), | ||
('average', True, 'top', [3.5, 3.5, 1.5, 5., 1.5, 6.5, 6.5]), | ||
('average', True, 'bottom', [1.5, 1.5, 6.5, 3., 6.5, 4.5, 4.5]), | ||
('average', False, 'keep', [1.5, 1.5, np.nan, 3, np.nan, 4.5, 4.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically I think it would be better to write this (and other similar lists) in reverse order rather than using the step size in the indexer to do that. It's one less operation and more importantly I find easier to read when comparing to vals above
pandas/_libs/groupby_helper.pxi.in
Outdated
* top: smallest rank if ascending | ||
* bottom: smallest rank if descending | ||
ties_method : {'average', 'min', 'max', 'first', 'dense'} | ||
* average: average rank of group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps clearer to say average rank of tied values
- saying "group" can be confused with the larger Group object. Similar change needed for below points
* min: lowest rank in group | ||
* max: highest rank in group | ||
* first: ranks assigned in order they appear in the array | ||
* dense: like 'min', but rank always increases by 1 between groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tough to describe in one line so I'm not sure of the best way but I think it can be improved by simply changing "groups" to "values"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/pandas/core/groupby/groupby.py
Lines 1848 to 1857 in 5edc5c4
method : {'average', 'min', 'max', 'first', 'dense'}, efault 'average' | |
* average: average rank of group | |
* min: lowest rank in group | |
* max: highest rank in group | |
* first: ranks assigned in order they appear in the array | |
* dense: like 'min', but rank always increases by 1 between groups | |
method : {'keep', 'top', 'bottom'}, default 'keep' | |
* keep: leave NA values where they are | |
* top: smallest rank if ascending | |
* bottom: smallest rank if descending |
Yes, I agree. It is hard to describe those methods. So I copied from there to save some time. Btw, there are some typos there too. I've raise another issue #20694. I think we should come up with something consistent for both places.
@pytest.mark.parametrize("vals", [ | ||
[-np.inf, -np.inf, np.nan, 1., np.nan, np.inf, np.inf], | ||
]) | ||
@pytest.mark.parametrize("ties_method,ascending,na_option,exp", [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To err on the side of caution is it possible to test percentage display here as well? The other tests appear to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I just found out when pct is true, and ties_method is "dense". The ranks are not calculated as expected. ( with and without inf/nan)
In [61]: df_test = pd.DataFrame({"A":[1,1,2,2],"B":[1,1,1,1]})
In [62]: df_test.groupby("B").rank(method="dense", ascending=True, pct=False, na_option='top')
Out[62]:
A
0 1.0
1 1.0
2 2.0
3 2.0
In [63]: df_test.groupby("B").rank(method="dense", ascending=True, pct=True, na_option='top')
Out[63]:
A
0 0.25
1 0.25
2 0.50
3 0.50
The expected output should be
In [65]: df_test['A'].rank(method="dense", ascending=True, pct=True, na_option='top')
Out[65]:
0 0.5
1 0.5
2 1.0
3 1.0
Name: A, dtype: float64
Maybe another PR ? Or fix it here? It is similar to #15639 @jreback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's broken even without np.inf
then I think another PR is fine - can you open an issue for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fix conflict with whatsnew merge but otherwise lgtm. Thanks for opening the other issues |
thanks @peterpanmj and @WillAyd nice patches! keep em coming both of you! |
Please include the output of the validation script below between the "```" ticks:
Checklist for other PRs (remove this part if you are doing a PR for the pandas documentation sprint):
git diff upstream/master -u -- "*.py" | flake8 --diff