BUG: Fix problems in group rank when both nans and infinity are present #20561 #20681

peterpanmj · 2018-04-13T09:35:40Z

Please include the output of the validation script below between the "```" ticks:


################################################################################
############# Docstring (pandas._libs.groupby.group_rank_float64)  #############
################################################################################

Provides the rank of values within each group

Parameters
----------
out : array of float64_t values which this method will write its results to
values : array of float64_t values to be ranked
labels : array containing unique label for each group, with its ordering
    matching up to the corresponding record in `values`
is_datetimelike : bool
    unused in this method but provided for call compatibility with other
    Cython transformations
ties_method :  {'keep', 'top', 'bottom'}
    * keep: leave NA values where they are
    * top: smallest rank if ascending
    * bottom: smallest rank if descending
ascending : boolean
    False for ranks by high (1) to low (N)
pct : boolean
    Compute percentage rank of data within each group

Notes
-----
This method modifies the `out` parameter rather than returning an object

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
        Docstring text (summary) should start in the line immediately after the opening quotes (not in the same line, or leaving a blank line in between)
        Summary does not end with dot
        No extended summary found
        Errors in parameters section
                Unknown parameters {'values', 'pct', 'labels', 'out', 'ties_method', 'ascending', 'is_datetimelike'}
                Parameter "out" has no description
                Parameter "values" has no description
                Parameter "labels" description should start with capital letter
                Parameter "labels" description should finish with "."
                Parameter "is_datetimelike" description should start with capital letter
                Parameter "is_datetimelike" description should finish with "."
                Parameter "ties_method" description should start with capital letter
                Parameter "ties_method" description should finish with "."
                Parameter "ascending" description should finish with "."
                Parameter "pct" description should finish with "."
        No returns section found
        See Also section not found
        No examples section found

Checklist for other PRs (remove this part if you are doing a PR for the pandas documentation sprint):

closes GroupBy Rank Operations With Infinity Incorrect #20561
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2018-04-14T04:59:27Z

Codecov Report

❗ No coverage uploaded for pull request base (master@7e75e4a). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master   #20681   +/-   ##
=========================================
  Coverage          ?   91.84%           
=========================================
  Files             ?      153           
  Lines             ?    49295           
  Branches          ?        0           
=========================================
  Hits              ?    45274           
  Misses            ?     4021           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.23% <ø> (?)`
#single	`41.89% <ø> (?)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e75e4a...6eb1d8f. Read the comment docs.

jreback

lgtm. can you add a whatsnew note in groupby section

jreback · 2018-04-14T12:41:49Z

pandas/_libs/groupby_helper.pxi.in

-        * keep: leave NA values where they are
-        * top: smallest rank if ascending
-        * bottom: smallest rank if descending
+    ties_method : {'average', 'min', 'max', 'first', 'dense'}


is average the derault?

jreback · 2018-04-14T12:42:04Z

pandas/_libs/groupby_helper.pxi.in

    ascending : boolean
        False for ranks by high (1) to low (N)
    pct : boolean
        Compute percentage rank of data within each group
+    na_option : {'keep', 'top', 'bottom'}


is na_option the default? (mark if it is)

jreback · 2018-04-14T12:42:26Z

pandas/tests/groupby/test_groupby.py

+    ])
+    def test_infs_n_nans(self, grps, vals, ties_method, ascending, na_option,
+                         exp):
+        key = np.repeat(grps, len(vals))


can you add the issue number

jreback · 2018-04-14T12:44:09Z

pandas/tests/groupby/test_groupby.py

@@ -1965,6 +1965,55 @@ def test_rank_args(self, grps, vals, ties_method, ascending, pct, exp):
        exp_df = DataFrame(exp * len(grps), columns=['val'])
        assert_frame_equal(result, exp_df)

+    @pytest.mark.parametrize("grps", [


also happy to have in another PR, move all of the rank tests to test_functional (you can do it here as well). We may want to move other things too, so maybe new PR. test_groupby is getting large.

jreback · 2018-04-14T12:45:11Z

pls rebase as well.

jreback · 2018-04-14T12:45:33Z

cc @WillAyd if you could review

WillAyd

Nice change

WillAyd · 2018-04-14T15:20:40Z

pandas/tests/groupby/test_groupby.py

+        ('average', True, 'keep', [1.5, 1.5, np.nan, 3, np.nan, 4.5, 4.5]),
+        ('average', True, 'top', [3.5, 3.5, 1.5, 5., 1.5, 6.5, 6.5]),
+        ('average', True, 'bottom', [1.5, 1.5, 6.5, 3., 6.5, 4.5, 4.5]),
+        ('average', False, 'keep', [1.5, 1.5, np.nan, 3, np.nan, 4.5, 4.5


Stylistically I think it would be better to write this (and other similar lists) in reverse order rather than using the step size in the indexer to do that. It's one less operation and more importantly I find easier to read when comparing to vals above

WillAyd · 2018-04-14T15:24:59Z

pandas/_libs/groupby_helper.pxi.in

-        * top: smallest rank if ascending
-        * bottom: smallest rank if descending
+    ties_method : {'average', 'min', 'max', 'first', 'dense'}
+        * average: average rank of group


Perhaps clearer to say average rank of tied values - saying "group" can be confused with the larger Group object. Similar change needed for below points

WillAyd · 2018-04-14T15:33:05Z

pandas/_libs/groupby_helper.pxi.in

+        * min: lowest rank in group
+        * max: highest rank in group
+        * first: ranks assigned in order they appear in the array
+        * dense: like 'min', but rank always increases by 1 between groups


This is tough to describe in one line so I'm not sure of the best way but I think it can be improved by simply changing "groups" to "values"

pandas/pandas/core/groupby/groupby.py

Lines 1848 to 1857 in 5edc5c4

method : {'average', 'min', 'max', 'first', 'dense'}, efault 'average'

* average: average rank of group

* min: lowest rank in group

* max: highest rank in group

* first: ranks assigned in order they appear in the array

* dense: like 'min', but rank always increases by 1 between groups

method : {'keep', 'top', 'bottom'}, default 'keep'

* keep: leave NA values where they are

* top: smallest rank if ascending

* bottom: smallest rank if descending

Yes, I agree. It is hard to describe those methods. So I copied from there to save some time. Btw, there are some typos there too. I've raise another issue #20694. I think we should come up with something consistent for both places.

WillAyd · 2018-04-14T15:35:08Z

pandas/tests/groupby/test_groupby.py

+    @pytest.mark.parametrize("vals", [
+        [-np.inf, -np.inf, np.nan, 1., np.nan, np.inf, np.inf],
+    ])
+    @pytest.mark.parametrize("ties_method,ascending,na_option,exp", [


To err on the side of caution is it possible to test percentage display here as well? The other tests appear to do so

You are right. I just found out when pct is true, and ties_method is "dense". The ranks are not calculated as expected. ( with and without inf/nan)

In [61]: df_test = pd.DataFrame({"A":[1,1,2,2],"B":[1,1,1,1]}) In [62]: df_test.groupby("B").rank(method="dense", ascending=True, pct=False, na_option='top') Out[62]: A 0 1.0 1 1.0 2 2.0 3 2.0 In [63]: df_test.groupby("B").rank(method="dense", ascending=True, pct=True, na_option='top') Out[63]: A 0 0.25 1 0.25 2 0.50 3 0.50

The expected output should be

In [65]: df_test['A'].rank(method="dense", ascending=True, pct=True, na_option='top') Out[65]: 0 0.5 1 0.5 2 1.0 3 1.0 Name: A, dtype: float64

Maybe another PR ? Or fix it here? It is similar to #15639 @jreback

If it's broken even without np.inf then I think another PR is fine - can you open an issue for it?

…nt (pandas-dev#20561)

WillAyd · 2018-04-18T16:48:29Z

Need to fix conflict with whatsnew merge but otherwise lgtm. Thanks for opening the other issues

…oup_rank

jreback · 2018-04-21T18:24:20Z

thanks @peterpanmj and @WillAyd nice patches!

keep em coming both of you!

jreback requested changes Apr 14, 2018

View reviewed changes

jreback added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 14, 2018

jreback mentioned this pull request Apr 14, 2018

TST: split test_groupby.py #20696

Closed

WillAyd requested changes Apr 14, 2018

View reviewed changes

BUG: Fix problems in group rank when both nans and infinity are prese…

aa63df3

…nt (pandas-dev#20561)

peterpanmj force-pushed the group_rank branch from acc89db to aa63df3 Compare April 18, 2018 11:10

WillAyd approved these changes Apr 18, 2018

View reviewed changes

peterpanmj and others added 3 commits April 19, 2018 13:39

Merge branch 'master' of https://github.com/pandas-dev/pandas into gr…

feaccd4

…oup_rank

Merge branch 'master' into PR_TOOL_MERGE_PR_20681

89341d8

doc - move issues together

6eb1d8f

jreback added this to the 0.23.0 milestone Apr 21, 2018

jreback approved these changes Apr 21, 2018

View reviewed changes

jreback merged commit 0d199e4 into pandas-dev:master Apr 21, 2018

peterpanmj deleted the group_rank branch May 9, 2018 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix problems in group rank when both nans and infinity are present #20561 #20681

BUG: Fix problems in group rank when both nans and infinity are present #20561 #20681

peterpanmj commented Apr 13, 2018

codecov bot commented Apr 14, 2018 •

edited

Loading

jreback left a comment

jreback Apr 14, 2018

jreback Apr 14, 2018

jreback Apr 14, 2018

jreback Apr 14, 2018

jreback commented Apr 14, 2018

jreback commented Apr 14, 2018

WillAyd left a comment

WillAyd Apr 14, 2018 •

edited

Loading

WillAyd Apr 14, 2018

WillAyd Apr 14, 2018

peterpanmj Apr 15, 2018 •

edited

Loading

WillAyd Apr 14, 2018

peterpanmj Apr 18, 2018 •

edited

Loading

WillAyd Apr 18, 2018

peterpanmj Apr 18, 2018

WillAyd commented Apr 18, 2018

jreback commented Apr 21, 2018

	method : {'average', 'min', 'max', 'first', 'dense'}, efault 'average'
	* average: average rank of group
	* min: lowest rank in group
	* max: highest rank in group
	* first: ranks assigned in order they appear in the array
	* dense: like 'min', but rank always increases by 1 between groups
	method : {'keep', 'top', 'bottom'}, default 'keep'
	* keep: leave NA values where they are
	* top: smallest rank if ascending
	* bottom: smallest rank if descending

BUG: Fix problems in group rank when both nans and infinity are present #20561 #20681

BUG: Fix problems in group rank when both nans and infinity are present #20561 #20681

Conversation

peterpanmj commented Apr 13, 2018

codecov bot commented Apr 14, 2018 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 14, 2018

jreback commented Apr 14, 2018

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Apr 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterpanmj Apr 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterpanmj Apr 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Apr 18, 2018

jreback commented Apr 21, 2018

codecov bot commented Apr 14, 2018 •

edited

Loading

WillAyd Apr 14, 2018 •

edited

Loading

peterpanmj Apr 15, 2018 •

edited

Loading

peterpanmj Apr 18, 2018 •

edited

Loading