Raise ValueError When Attempting to Rank Object Dtypes #19560

WillAyd · 2018-02-07T00:50:37Z

Referencing the comments in #19481, right now rank operations performed against objects have a few issues, namely that they:

Are inherently ambiguous, relying on lexical encoding AND
Are not consistent across Series, DataFrame and GroupBy objects with various arguments

To illustrate the latter:

In [1]: vals = ['apple', 'orange', 'banana']
In [2]  pd.Series(vals).rank()  # this will "work"
Out[8]: 
0    1.0
1    3.0
2    2.0
dtype: float64

In [3]: pd.Series(vals).rank(method='first')  # raises
ValueError: first not supported for non-numeric data

In [4]: pd.DataFrame({'key': ['foo'] * 3, 'vals': vals}).groupby('key').rank(method='first')  # should raise?
Out[4]: 
Empty DataFrame
Columns: []
Index: []

(see also #19482)

With this change I'd propose that we simply raise ValueError consistently for rank against object dtypes regardless of which type of object performs the transformation and regardless of arguments.

One known caveat is that Categorical types currently use the rank_object methods in algos. My assumption is that we would want to continue supporting ranking for ordered Categoricals but raise for unordered Categoricals.

The text was updated successfully, but these errors were encountered:

jreback · 2018-02-10T18:01:45Z

this is manifesting in #11759 as well. just need a better error message.

ishaan007 · 2018-03-10T05:01:04Z

@jreback I am looking to work on this, any pointers on which particular classes should I start inspecting ?

WillAyd · 2018-03-11T17:10:40Z

@ishaan007 you want to be looking at the various rank implementations in pandas.core.generic.NDFrame, pandas.core.algorithms and pandas.core.groupby.GroupBy

mapehe · 2018-04-07T21:52:32Z

In case @ishaan007 isn't still on this I could have a look since this is tagged "good first issue" and would seem like a good way to get to know the codebase a bit.

WillAyd · 2018-04-08T18:30:59Z

@mapehe sure give it a shot

mapehe · 2018-04-09T15:07:03Z

Are not consistent across Series, DataFrame and GroupBy objects with various arguments

So I've had a look at the above point, please correct me if I've misunderstood something. Series and DataFrame both inherit rank() from pandas.core.generic.NDFrame which essentially executes rank() in pandas.core.algorithms. This eventually calls a cython implementation in algos_rank_helper.pxi. The function rank() in GroupBy goes through a completely distinct set of functions and eventually calls an implementation in groupby_helper.pxi.

Running pd.Series(vals).rank(method='first') would cause an error because of this check:

pandas/pandas/_libs/algos_rank_helper.pxi.in

Lines 166 to 167 in 2431641

    
           elif tiebreak == TIEBREAK_FIRST: 
        
               raise ValueError('first not supported for non-numeric data')

On the other hand pd.Series(vals).rank() works because a similar error is not raised for method=average.

In the case of GroupBy, an implementation for object ranking doesn't seem to exist because of line 412?

pandas/pandas/_libs/groupby_helper.pxi.in

Lines 412 to 419 in 2431641

    
           {{if name != 'object'}} 
        
           @cython.boundscheck(False) 
        
           @cython.wraparound(False) 
        
           def group_rank_{{name}}(ndarray[float64_t, ndim=2] out, 
        
                                   ndarray[{{c_type}}, ndim=2] values, 
        
                                   ndarray[int64_t] labels, 
        
                                   bint is_datetimelike, object ties_method, 
        
                                   bint ascending, bint pct, object na_option):

I don't know the codebase well enough to make a good decision about how these issues should be handled. If GroupBy.rank() is not supposed to support non-numeric data, maybe a type check would be appropriate around here?

pandas/pandas/core/groupby/groupby.py

Lines 2453 to 2461 in 2431641

    
           def _cython_operation(self, kind, values, how, axis, min_count=-1, 
        
                                 **kwargs): 
        
               assert kind in ['transform', 'aggregate'] 
        
               # can we do this operation with our cython functions 
        
               # if not raise NotImplementedError 
        
               # we raise NotImplemented if this is an invalid operation 
        
               # entirely, e.g. adding datetimes

I guess the inconsistency between pd.Series(vals).rank() and pd.Series(vals).rank(method='first') could be solved by modifying algos_rank_helper.pxi.in, but I have no idea how many things it would break. :)

WillAyd · 2018-04-09T15:47:27Z

Nice research. Best approach would be to simply raise at the start of the method you found in algos for object dtypes, rather than only doing it for the TIEBREAK_FIRST argument

mapehe · 2018-04-10T07:58:09Z

Thanks for the reply @WillAyd. So you'd suggest replacing the currentrank_1d_object() with a function that just runs raise ValueError('first not supported for non-numeric data')?

WillAyd · 2018-04-10T08:07:50Z

Hmm well per the original comment we would still want to support ranking of ordered categoricals so I don't think we can entirely do away with the object function. The rank method that calls the Cython equivalent is located in pandas.core.algorithms - I'd suggest taking a look at that and seeing where it makes sense to do introspection and either allow the ranking (ordered Categorical) or raise

mapehe · 2018-04-10T13:51:21Z

One could use something like this in algos.rank to only allow ordered Categorials to be ranked:

    if is_object_dtype(values):
        def raise_non_numeric_error():
            raise ValueError("pandas.core.algorithms.rank "
                            "not supported for unordered "
                            "non-numeric data")
        if is_categorical_dtype(values):
            if not values.ordered:
                raise_non_numeric_error()
        else:
            raise_non_numeric_error()

However, that would cause e.g. this test to fail. These lines, for example, would directly contradict not allowing "apple" and "orange" to be ranked.

pandas/pandas/tests/frame/test_rank.py

Lines 74 to 77 in 4e6aa1c

    
           df = DataFrame([['b', 'c', 'a'], ['a', 'c', 'b']]) 
        
           expected = DataFrame([[2.0, 3.0, 1.0], [1, 3, 2]]) 
        
           result = df.rank(1, numeric_only=False) 
        
           tm.assert_frame_equal(result, expected)

Should the test be modified so that raising a ValueError is expected instead?

WillAyd · 2018-04-12T05:16:44Z

There might be a few tests that require updating as part of this. At this point I’d suggest opening a PR and you can get review of your code from there

…ble non-numeric entries. (pandas-dev#19560)

…ble non-numeric entries. (pandas-dev#19560) test_rank2() works ENH: Consistent error messages when attempting to order unorderable non-numeric data. (pandas-dev#19560)

…dev#19560)

mapehe · 2018-04-12T17:23:19Z

Sorry for the spam, I did't know the commits are added here. I opened a PR related to this issue.

peterpanmj · 2018-07-26T02:06:38Z

@WillAyd @jreback
DataFrame.rank and Series.rank currently support none-categorical object dtype, and unordered categorical dtype.

In [42]: df
Out[42]:
   A  B
0  a  b
1  b  c
2  c  c
3  a  d

In [43]: df.dtypes
Out[43]:
A    object
B    object
dtype: object

In [44]: df.rank()
Out[44]:
     A    B
0  1.5  1.0
1  3.0  2.5
2  4.0  2.5
3  1.5  4.0

If we no longer support rank calculation on those conditions, perhaps should give a deprecation warning first ? Personally, I use DataFrame.rank on string data all the time. It will cause some inconvenience for users like me if we remove the support.

WillAyd · 2018-07-26T02:20:35Z

@peterpanmj I think that is a good idea. Could target deprecate for 0.24 and officially drop in a future version

WillAyd mentioned this issue Feb 7, 2018

PERF: Cythonize Groupby Rank #19481

Merged

4 tasks

jreback added Groupby Error Reporting Incorrect or improved errors from pandas Effort Low good first issue labels Feb 9, 2018

jreback added this to the Next Major Release milestone Feb 9, 2018

jreback mentioned this issue Feb 10, 2018

TypeError: rank() got an unexpected keyword argument 'numeric_only' #11759

Closed

mapehe added a commit to mapehe/pandas that referenced this issue Apr 12, 2018

ENH: Consistent errors when attempting to rank NDFrames with unordera…

998f6a2

…ble non-numeric entries. (pandas-dev#19560)

mapehe added a commit to mapehe/pandas that referenced this issue Apr 12, 2018

ENH: Consistent error messages for ranking non-numeric data. (pandas-…

a062003

…dev#19560)

mapehe added a commit to mapehe/pandas that referenced this issue Apr 12, 2018

ENH: Consistent error messages for ranking non-numeric data. (pandas-…

ff18426

…dev#19560)

mapehe added a commit to mapehe/pandas that referenced this issue Apr 12, 2018

ERR: Consistent errors for non-numeric ranking. (pandas-dev#19560)

0990a89

mapehe mentioned this issue Apr 12, 2018

ERR: Consistent errors for non-numeric ranking. (#19560) #20670

Closed

4 tasks

mapehe added a commit to mapehe/pandas that referenced this issue Apr 14, 2018

ERR: Consistent errors for non-numeric ranking. (pandas-dev#19560)

a0f38d7

mapehe added a commit to mapehe/pandas that referenced this issue Apr 14, 2018

ERR: Consistent errors for non-numeric ranking. (pandas-dev#19560)

d330a46

jreback modified the milestones: Next Major Release, 0.23.0 Apr 14, 2018

jreback added this to the Next Major Release milestone Apr 14, 2018

This was referenced Jun 20, 2018

BUG: bug in group by rank string #21554

Closed

Rank by multiple columns #4311

Closed

WillAyd mentioned this issue Oct 23, 2018

DOC: Updated the docstring of Series.rank / DataFrame.rank #23263

Closed

1 task

WillAyd mentioned this issue Oct 10, 2019

REF: use fused types for groupby_helper #28886

Merged

jbrockmendel removed the Effort Low label Oct 21, 2019

jbrockmendel mentioned this issue Oct 22, 2019

REF: avoid getattr pattern in libgroupby rank functions #29166

Merged

TomAugspurger mentioned this issue Dec 20, 2019

rank() throws TypeError: 'NoneType' object is not callable #30364

Closed

mzeitlin11 mentioned this issue Jun 10, 2021

BUG/REF: use sorted_rank_1d for rank_2d #41931

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Jun 17, 2021

mroeschke added the Enhancement label Jun 18, 2021

jreback closed this as completed in #41931 Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise ValueError When Attempting to Rank Object Dtypes #19560

Raise ValueError When Attempting to Rank Object Dtypes #19560

WillAyd commented Feb 7, 2018 •

edited

Loading

jreback commented Feb 10, 2018

ishaan007 commented Mar 10, 2018

WillAyd commented Mar 11, 2018

mapehe commented Apr 7, 2018 •

edited

Loading

WillAyd commented Apr 8, 2018

mapehe commented Apr 9, 2018

WillAyd commented Apr 9, 2018

mapehe commented Apr 10, 2018

WillAyd commented Apr 10, 2018

mapehe commented Apr 10, 2018 •

edited

Loading

WillAyd commented Apr 12, 2018

mapehe commented Apr 12, 2018

peterpanmj commented Jul 26, 2018

WillAyd commented Jul 26, 2018

Raise ValueError When Attempting to Rank Object Dtypes #19560

Raise ValueError When Attempting to Rank Object Dtypes #19560

Comments

WillAyd commented Feb 7, 2018 • edited Loading

jreback commented Feb 10, 2018

ishaan007 commented Mar 10, 2018

WillAyd commented Mar 11, 2018

mapehe commented Apr 7, 2018 • edited Loading

WillAyd commented Apr 8, 2018

mapehe commented Apr 9, 2018

WillAyd commented Apr 9, 2018

mapehe commented Apr 10, 2018

WillAyd commented Apr 10, 2018

mapehe commented Apr 10, 2018 • edited Loading

WillAyd commented Apr 12, 2018

mapehe commented Apr 12, 2018

peterpanmj commented Jul 26, 2018

WillAyd commented Jul 26, 2018

WillAyd commented Feb 7, 2018 •

edited

Loading

mapehe commented Apr 7, 2018 •

edited

Loading

mapehe commented Apr 10, 2018 •

edited

Loading