-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise ValueError When Attempting to Rank Object Dtypes #19560
Comments
this is manifesting in #11759 as well. just need a better error message. |
@jreback I am looking to work on this, any pointers on which particular classes should I start inspecting ? |
@ishaan007 you want to be looking at the various rank implementations in |
In case @ishaan007 isn't still on this I could have a look since this is tagged "good first issue" and would seem like a good way to get to know the codebase a bit. |
@mapehe sure give it a shot |
So I've had a look at the above point, please correct me if I've misunderstood something. Running pandas/pandas/_libs/algos_rank_helper.pxi.in Lines 166 to 167 in 2431641
On the other hand In the case of pandas/pandas/_libs/groupby_helper.pxi.in Lines 412 to 419 in 2431641
I don't know the codebase well enough to make a good decision about how these issues should be handled. If pandas/pandas/core/groupby/groupby.py Lines 2453 to 2461 in 2431641
I guess the inconsistency between pd.Series(vals).rank() and pd.Series(vals).rank(method='first') could be solved by modifying algos_rank_helper.pxi.in , but I have no idea how many things it would break. :)
|
Nice research. Best approach would be to simply raise at the start of the method you found in algos for object dtypes, rather than only doing it for the |
Thanks for the reply @WillAyd. So you'd suggest replacing the current |
Hmm well per the original comment we would still want to support ranking of ordered categoricals so I don't think we can entirely do away with the object function. The |
One could use something like this in if is_object_dtype(values):
def raise_non_numeric_error():
raise ValueError("pandas.core.algorithms.rank "
"not supported for unordered "
"non-numeric data")
if is_categorical_dtype(values):
if not values.ordered:
raise_non_numeric_error()
else:
raise_non_numeric_error() However, that would cause e.g. this test to fail. These lines, for example, would directly contradict not allowing "apple" and "orange" to be ranked. pandas/pandas/tests/frame/test_rank.py Lines 74 to 77 in 4e6aa1c
Should the test be modified so that raising a ValueError is expected instead?
|
There might be a few tests that require updating as part of this. At this point I’d suggest opening a PR and you can get review of your code from there |
…ble non-numeric entries. (pandas-dev#19560)
…ble non-numeric entries. (pandas-dev#19560) test_rank2() works ENH: Consistent error messages when attempting to order unorderable non-numeric data. (pandas-dev#19560)
Sorry for the spam, I did't know the commits are added here. I opened a PR related to this issue. |
@WillAyd @jreback
If we no longer support rank calculation on those conditions, perhaps should give a deprecation warning first ? Personally, I use |
@peterpanmj I think that is a good idea. Could target deprecate for 0.24 and officially drop in a future version |
Referencing the comments in #19481, right now rank operations performed against objects have a few issues, namely that they:
To illustrate the latter:
(see also #19482)
With this change I'd propose that we simply raise ValueError consistently for rank against
object
dtypes regardless of which type of object performs the transformation and regardless of arguments.One known caveat is that
Categorical
types currently use therank_object
methods inalgos
. My assumption is that we would want to continue supporting ranking for ordered Categoricals but raise for unordered Categoricals.The text was updated successfully, but these errors were encountered: