-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Make Categorical.searchsorted returns a scalar when supplied a scalar #23466
API: Make Categorical.searchsorted returns a scalar when supplied a scalar #23466
Conversation
Hello @topper-123! Thanks for submitting the PR.
|
a45c7b0
to
7f976df
Compare
pandas/core/arrays/categorical.py
Outdated
if is_scalar(value): | ||
codes = self.categories.get_loc(value) | ||
else: | ||
codes = [self.categories.get_loc(val) for val in value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use .get_indexer
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately get_indexer
is much slower than get_loc
:
>>> %timeit c.categories.get_loc('b')
6.12 µs # this PR
>>> %timeit c.categories.get_indexer(['b'])
257 µs
I've made the update to use .get_indexer
anyway, and will use this as an opportunity to look for a way to make get_indexer
faster, as that will yield benefits beyound .searchsorted
. Alternatively I can roll back this last commit, and add the get_indexer
part later, when I figure out why get_indexer is slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.get_indexer is for many items when it will be much faster than an iteration of .get_loc, but for a small number of items the reverse maybe true, e.g. there will be a cross-over point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is true here: get_loc
makes a call to get_indexer
, so get_indexer
shouldn't be slower, and the very least not this much slower. My guess is that there is some unneeded type conversion or parameter usage happening.
I'll look into to it. If everything is in get_indexer for the right reasons, I just won't pursue the case further.
7f976df
to
083e9c0
Compare
looks fine, ping on green. |
Codecov Report
@@ Coverage Diff @@
## master #23466 +/- ##
==========================================
+ Coverage 92.24% 92.24% +<.01%
==========================================
Files 161 161
Lines 51433 51434 +1
==========================================
+ Hits 47446 47447 +1
Misses 3987 3987
Continue to review full report at Codecov.
|
Green. |
pandas/core/arrays/categorical.py
Outdated
|
||
if -1 in values_as_codes: | ||
raise ValueError("Value(s) to be inserted must be in categories.") | ||
if is_scalar(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is confusing code because get_loc raises, i would rather just use .get_indexer here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of searchsorted is fast searching. get_indexer
is currently very very slow, as it always creates an array. get_loc
OTOH can return scalar or a slice, which is both faster to create and faster to use.
So I think we need to keep get_loc
, unless get_indexer
gets a redesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this actually makes a difference? show this specific case
i am sure that optimizing get_indexer would not be hard and is a better soln
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> c = pd.Categorical(list('a' + 'b' + 'c' ))
>>> %timeit c.categories.get_loc('b')
1.19 µs
>>> %timeit c.categories.get_indexer(['b'])
261 µs
I can take look at optimizing get_indexer
It turns out that making So I've reverted to make minimal changes in I'll take a look at making |
53aab0f
to
b9e5149
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny typo. ping on green.
doc/source/whatsnew/v0.24.0.rst
Outdated
@@ -960,6 +960,8 @@ Other API Changes | |||
- Slicing a single row of a DataFrame with multiple ExtensionArrays of the same type now preserves the dtype, rather than coercing to object (:issue:`22784`) | |||
- :class:`DateOffset` attribute `_cacheable` and method `_should_cache` have been removed (:issue:`23118`) | |||
- Comparing :class:`Timedelta` to be less or greater than unknown types now raises a ``TypeError`` instead of returning ``False`` (:issue:`20829`) | |||
- :meth:`Categorical.searchsorted`, when supplied a scalar value to search for, now returns a scalar instead of an array (:issue:`23466`). | |||
- :meth:`Categorical.searchsorted` now raises a ``keyError`` rather that a ``ValueError``, if a searched for key is not found in its categories (:issue:`23466`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyError
b9e5149
to
d7b6873
Compare
thanks @topper-123 |
…fixed * upstream/master: (46 commits) DEPS: bump xlrd min version to 1.0.0 (pandas-dev#23774) BUG: Don't warn if default conflicts with dialect (pandas-dev#23775) BUG: Fixing memory leaks in read_csv (pandas-dev#23072) TST: Extend datetime64 arith tests to array classes, fix several broken cases (pandas-dev#23771) STYLE: Specify bare exceptions in pandas/tests (pandas-dev#23370) ENH: between_time, at_time accept axis parameter (pandas-dev#21799) PERF: Use is_utc check to improve performance of dateutil UTC in DatetimeIndex methods (pandas-dev#23772) CLN: io/formats/html.py: refactor (pandas-dev#22726) API: Make Categorical.searchsorted returns a scalar when supplied a scalar (pandas-dev#23466) TST: Add test case for GH14080 for overflow exception (pandas-dev#23762) BUG: Don't extract header names if none specified (pandas-dev#23703) BUG: Index.str.partition not nan-safe (pandas-dev#23558) (pandas-dev#23618) DEPR: tz_convert in the Timestamp constructor (pandas-dev#23621) PERF: Datetime/Timestamp.normalize for timezone naive datetimes (pandas-dev#23634) TST: Use new arithmetic fixtures, parametrize many more tests (pandas-dev#23757) REF/TST: Add more pytest idiom to parsers tests (pandas-dev#23761) DOC: Add ignore-deprecate argument to validate_docstrings.py (pandas-dev#23650) ENH: update pandas-gbq to 0.8.0, adds credentials arg (pandas-dev#23662) DOC: Improve error message to show correct order (pandas-dev#23652) ENH: Improve error message for empty object array (pandas-dev#23718) ...
git diff upstream/master -u -- "*.py" | flake8 --diff
Categorical.searchsorted
returns the wrong shape for scalar input. Numpy arrays and all other array types return a scalar if the input is a scalar, butCategorical
does not.For example:
This new implementation is BTW quite a bit faster than the old implementation, because we avoid recoding the codes when doing the
self.codes.searchsorted(code, ...)
bit:A concequence of the new implementation is that KeyError is now raised when a key isn't found. Previously a ValueError was raised.