Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise ValueError When Attempting to Rank Object Dtypes #19560

Closed
WillAyd opened this issue Feb 7, 2018 · 14 comments · Fixed by #41931
Closed

Raise ValueError When Attempting to Rank Object Dtypes #19560

WillAyd opened this issue Feb 7, 2018 · 14 comments · Fixed by #41931
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas good first issue Groupby
Milestone

Comments

@WillAyd
Copy link
Member

WillAyd commented Feb 7, 2018

Referencing the comments in #19481, right now rank operations performed against objects have a few issues, namely that they:

  • Are inherently ambiguous, relying on lexical encoding AND
  • Are not consistent across Series, DataFrame and GroupBy objects with various arguments

To illustrate the latter:

In [1]: vals = ['apple', 'orange', 'banana']
In [2]  pd.Series(vals).rank()  # this will "work"
Out[8]: 
0    1.0
1    3.0
2    2.0
dtype: float64

In [3]: pd.Series(vals).rank(method='first')  # raises
ValueError: first not supported for non-numeric data

In [4]: pd.DataFrame({'key': ['foo'] * 3, 'vals': vals}).groupby('key').rank(method='first')  # should raise?
Out[4]: 
Empty DataFrame
Columns: []
Index: []

(see also #19482)

With this change I'd propose that we simply raise ValueError consistently for rank against object dtypes regardless of which type of object performs the transformation and regardless of arguments.

One known caveat is that Categorical types currently use the rank_object methods in algos. My assumption is that we would want to continue supporting ranking for ordered Categoricals but raise for unordered Categoricals.

@jreback jreback added Groupby Error Reporting Incorrect or improved errors from pandas Effort Low good first issue labels Feb 9, 2018
@jreback jreback added this to the Next Major Release milestone Feb 9, 2018
@jreback
Copy link
Contributor

jreback commented Feb 10, 2018

this is manifesting in #11759 as well. just need a better error message.

@ishaan007
Copy link

@jreback I am looking to work on this, any pointers on which particular classes should I start inspecting ?

@WillAyd
Copy link
Member Author

WillAyd commented Mar 11, 2018

@ishaan007 you want to be looking at the various rank implementations in pandas.core.generic.NDFrame, pandas.core.algorithms and pandas.core.groupby.GroupBy

@mapehe
Copy link
Contributor

mapehe commented Apr 7, 2018

In case @ishaan007 isn't still on this I could have a look since this is tagged "good first issue" and would seem like a good way to get to know the codebase a bit.

@WillAyd
Copy link
Member Author

WillAyd commented Apr 8, 2018

@mapehe sure give it a shot

@mapehe
Copy link
Contributor

mapehe commented Apr 9, 2018

  • Are not consistent across Series, DataFrame and GroupBy objects with various arguments

So I've had a look at the above point, please correct me if I've misunderstood something. Series and DataFrame both inherit rank() from pandas.core.generic.NDFrame which essentially executes rank() in pandas.core.algorithms. This eventually calls a cython implementation in algos_rank_helper.pxi. The function rank() in GroupBy goes through a completely distinct set of functions and eventually calls an implementation in groupby_helper.pxi.

Running pd.Series(vals).rank(method='first') would cause an error because of this check:

elif tiebreak == TIEBREAK_FIRST:
raise ValueError('first not supported for non-numeric data')

On the other hand pd.Series(vals).rank() works because a similar error is not raised for method=average.

In the case of GroupBy, an implementation for object ranking doesn't seem to exist because of line 412?

{{if name != 'object'}}
@cython.boundscheck(False)
@cython.wraparound(False)
def group_rank_{{name}}(ndarray[float64_t, ndim=2] out,
ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels,
bint is_datetimelike, object ties_method,
bint ascending, bint pct, object na_option):

I don't know the codebase well enough to make a good decision about how these issues should be handled. If GroupBy.rank() is not supposed to support non-numeric data, maybe a type check would be appropriate around here?

def _cython_operation(self, kind, values, how, axis, min_count=-1,
**kwargs):
assert kind in ['transform', 'aggregate']
# can we do this operation with our cython functions
# if not raise NotImplementedError
# we raise NotImplemented if this is an invalid operation
# entirely, e.g. adding datetimes

I guess the inconsistency between pd.Series(vals).rank() and pd.Series(vals).rank(method='first') could be solved by modifying algos_rank_helper.pxi.in, but I have no idea how many things it would break. :)

@WillAyd
Copy link
Member Author

WillAyd commented Apr 9, 2018

Nice research. Best approach would be to simply raise at the start of the method you found in algos for object dtypes, rather than only doing it for the TIEBREAK_FIRST argument

@mapehe
Copy link
Contributor

mapehe commented Apr 10, 2018

Thanks for the reply @WillAyd. So you'd suggest replacing the currentrank_1d_object() with a function that just runs raise ValueError('first not supported for non-numeric data')?

@WillAyd
Copy link
Member Author

WillAyd commented Apr 10, 2018

Hmm well per the original comment we would still want to support ranking of ordered categoricals so I don't think we can entirely do away with the object function. The rank method that calls the Cython equivalent is located in pandas.core.algorithms - I'd suggest taking a look at that and seeing where it makes sense to do introspection and either allow the ranking (ordered Categorical) or raise

@mapehe
Copy link
Contributor

mapehe commented Apr 10, 2018

One could use something like this in algos.rank to only allow ordered Categorials to be ranked:

    if is_object_dtype(values):
        def raise_non_numeric_error():
            raise ValueError("pandas.core.algorithms.rank "
                            "not supported for unordered "
                            "non-numeric data")
        if is_categorical_dtype(values):
            if not values.ordered:
                raise_non_numeric_error()
        else:
            raise_non_numeric_error()

However, that would cause e.g. this test to fail. These lines, for example, would directly contradict not allowing "apple" and "orange" to be ranked.

df = DataFrame([['b', 'c', 'a'], ['a', 'c', 'b']])
expected = DataFrame([[2.0, 3.0, 1.0], [1, 3, 2]])
result = df.rank(1, numeric_only=False)
tm.assert_frame_equal(result, expected)

Should the test be modified so that raising a ValueError is expected instead?

@WillAyd
Copy link
Member Author

WillAyd commented Apr 12, 2018

There might be a few tests that require updating as part of this. At this point I’d suggest opening a PR and you can get review of your code from there

mapehe added a commit to mapehe/pandas that referenced this issue Apr 12, 2018
mapehe added a commit to mapehe/pandas that referenced this issue Apr 12, 2018
…ble non-numeric entries. (pandas-dev#19560)

test_rank2() works

ENH: Consistent error messages when attempting to order unorderable non-numeric data. (pandas-dev#19560)
@mapehe
Copy link
Contributor

mapehe commented Apr 12, 2018

Sorry for the spam, I did't know the commits are added here. I opened a PR related to this issue.

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Apr 14, 2018
@jreback jreback added this to the Next Major Release milestone Apr 14, 2018
@peterpanmj
Copy link
Contributor

@WillAyd @jreback
DataFrame.rank and Series.rank currently support none-categorical object dtype, and unordered categorical dtype.

In [42]: df
Out[42]:
   A  B
0  a  b
1  b  c
2  c  c
3  a  d

In [43]: df.dtypes
Out[43]:
A    object
B    object
dtype: object

In [44]: df.rank()
Out[44]:
     A    B
0  1.5  1.0
1  3.0  2.5
2  4.0  2.5
3  1.5  4.0

If we no longer support rank calculation on those conditions, perhaps should give a deprecation warning first ? Personally, I use DataFrame.rank on string data all the time. It will cause some inconvenience for users like me if we remove the support.

@WillAyd
Copy link
Member Author

WillAyd commented Jul 26, 2018

@peterpanmj I think that is a good idea. Could target deprecate for 0.24 and officially drop in a future version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas good first issue Groupby
Projects
None yet
7 participants