Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix problems in group rank when both nans and infinity are present #20561 #20681

Merged
merged 4 commits into from
Apr 21, 2018

Conversation

peterpanmj
Copy link
Contributor

Please include the output of the validation script below between the "```" ticks:


################################################################################
############# Docstring (pandas._libs.groupby.group_rank_float64)  #############
################################################################################

Provides the rank of values within each group

Parameters
----------
out : array of float64_t values which this method will write its results to
values : array of float64_t values to be ranked
labels : array containing unique label for each group, with its ordering
    matching up to the corresponding record in `values`
is_datetimelike : bool
    unused in this method but provided for call compatibility with other
    Cython transformations
ties_method :  {'keep', 'top', 'bottom'}
    * keep: leave NA values where they are
    * top: smallest rank if ascending
    * bottom: smallest rank if descending
ascending : boolean
    False for ranks by high (1) to low (N)
pct : boolean
    Compute percentage rank of data within each group

Notes
-----
This method modifies the `out` parameter rather than returning an object

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
        Docstring text (summary) should start in the line immediately after the opening quotes (not in the same line, or leaving a blank line in between)
        Summary does not end with dot
        No extended summary found
        Errors in parameters section
                Unknown parameters {'values', 'pct', 'labels', 'out', 'ties_method', 'ascending', 'is_datetimelike'}
                Parameter "out" has no description
                Parameter "values" has no description
                Parameter "labels" description should start with capital letter
                Parameter "labels" description should finish with "."
                Parameter "is_datetimelike" description should start with capital letter
                Parameter "is_datetimelike" description should finish with "."
                Parameter "ties_method" description should start with capital letter
                Parameter "ties_method" description should finish with "."
                Parameter "ascending" description should finish with "."
                Parameter "pct" description should finish with "."
        No returns section found
        See Also section not found
        No examples section found

Checklist for other PRs (remove this part if you are doing a PR for the pandas documentation sprint):

@codecov
Copy link

codecov bot commented Apr 14, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@7e75e4a). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20681   +/-   ##
=========================================
  Coverage          ?   91.84%           
=========================================
  Files             ?      153           
  Lines             ?    49295           
  Branches          ?        0           
=========================================
  Hits              ?    45274           
  Misses            ?     4021           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.23% <ø> (?)
#single 41.89% <ø> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e75e4a...6eb1d8f. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. can you add a whatsnew note in groupby section

* keep: leave NA values where they are
* top: smallest rank if ascending
* bottom: smallest rank if descending
ties_method : {'average', 'min', 'max', 'first', 'dense'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is average the derault?

ascending : boolean
False for ranks by high (1) to low (N)
pct : boolean
Compute percentage rank of data within each group
na_option : {'keep', 'top', 'bottom'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is na_option the default? (mark if it is)

])
def test_infs_n_nans(self, grps, vals, ties_method, ascending, na_option,
exp):
key = np.repeat(grps, len(vals))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number

@@ -1965,6 +1965,55 @@ def test_rank_args(self, grps, vals, ties_method, ascending, pct, exp):
exp_df = DataFrame(exp * len(grps), columns=['val'])
assert_frame_equal(result, exp_df)

@pytest.mark.parametrize("grps", [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also happy to have in another PR, move all of the rank tests to test_functional (you can do it here as well). We may want to move other things too, so maybe new PR. test_groupby is getting large.

@jreback jreback added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 14, 2018
@jreback
Copy link
Contributor

jreback commented Apr 14, 2018

pls rebase as well.

@jreback
Copy link
Contributor

jreback commented Apr 14, 2018

cc @WillAyd if you could review

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change

('average', True, 'keep', [1.5, 1.5, np.nan, 3, np.nan, 4.5, 4.5]),
('average', True, 'top', [3.5, 3.5, 1.5, 5., 1.5, 6.5, 6.5]),
('average', True, 'bottom', [1.5, 1.5, 6.5, 3., 6.5, 4.5, 4.5]),
('average', False, 'keep', [1.5, 1.5, np.nan, 3, np.nan, 4.5, 4.5
Copy link
Member

@WillAyd WillAyd Apr 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically I think it would be better to write this (and other similar lists) in reverse order rather than using the step size in the indexer to do that. It's one less operation and more importantly I find easier to read when comparing to vals above

* top: smallest rank if ascending
* bottom: smallest rank if descending
ties_method : {'average', 'min', 'max', 'first', 'dense'}
* average: average rank of group
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps clearer to say average rank of tied values - saying "group" can be confused with the larger Group object. Similar change needed for below points

* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like 'min', but rank always increases by 1 between groups
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tough to describe in one line so I'm not sure of the best way but I think it can be improved by simply changing "groups" to "values"

Copy link
Contributor Author

@peterpanmj peterpanmj Apr 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method : {'average', 'min', 'max', 'first', 'dense'}, efault 'average'
* average: average rank of group
* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like 'min', but rank always increases by 1 between groups
method : {'keep', 'top', 'bottom'}, default 'keep'
* keep: leave NA values where they are
* top: smallest rank if ascending
* bottom: smallest rank if descending

Yes, I agree. It is hard to describe those methods. So I copied from there to save some time. Btw, there are some typos there too. I've raise another issue #20694. I think we should come up with something consistent for both places.

@pytest.mark.parametrize("vals", [
[-np.inf, -np.inf, np.nan, 1., np.nan, np.inf, np.inf],
])
@pytest.mark.parametrize("ties_method,ascending,na_option,exp", [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To err on the side of caution is it possible to test percentage display here as well? The other tests appear to do so

Copy link
Contributor Author

@peterpanmj peterpanmj Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I just found out when pct is true, and ties_method is "dense". The ranks are not calculated as expected. ( with and without inf/nan)

In [61]: df_test = pd.DataFrame({"A":[1,1,2,2],"B":[1,1,1,1]})

In [62]: df_test.groupby("B").rank(method="dense", ascending=True, pct=False, na_option='top')
Out[62]:
     A
0  1.0
1  1.0
2  2.0
3  2.0

In [63]: df_test.groupby("B").rank(method="dense", ascending=True, pct=True, na_option='top')
Out[63]:
      A
0  0.25
1  0.25
2  0.50
3  0.50

The expected output should be

In [65]: df_test['A'].rank(method="dense", ascending=True, pct=True, na_option='top')
Out[65]:
0    0.5
1    0.5
2    1.0
3    1.0
Name: A, dtype: float64

Maybe another PR ? Or fix it here? It is similar to #15639 @jreback

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's broken even without np.inf then I think another PR is fine - can you open an issue for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd
Copy link
Member

WillAyd commented Apr 18, 2018

Need to fix conflict with whatsnew merge but otherwise lgtm. Thanks for opening the other issues

@jreback jreback added this to the 0.23.0 milestone Apr 21, 2018
@jreback jreback merged commit 0d199e4 into pandas-dev:master Apr 21, 2018
@jreback
Copy link
Contributor

jreback commented Apr 21, 2018

thanks @peterpanmj and @WillAyd nice patches!

keep em coming both of you!

@peterpanmj peterpanmj deleted the group_rank branch May 9, 2018 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupBy Rank Operations With Infinity Incorrect
3 participants