API: Make Categorical.searchsorted returns a scalar when supplied a scalar #23466

topper-123 · 2018-11-02T22:27:02Z

closes BUG: CategoricalIndex.searchsorted doesn't return a scalar if input was scalar #21019
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Categorical.searchsorted returns the wrong shape for scalar input. Numpy arrays and all other array types return a scalar if the input is a scalar, but Categorical does not.

For example:

>>> import numpy as np
>>> np.array([1, 2, 3]).searchsorted(1)
0
>>> np.array([1, 2, 3]).searchsorted([1])
array([0])
>>> import pandas as pd
>>> d = pd.date_range('2018', periods=4)
>>> d.searchsorted(d[0])
0
>>> d.searchsorted(d[:1])
array([0])

>>> n = 100_000
>>> c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n), ordered=True)
>>> c.searchsorted('b')
array([100000], dtype=int32)  # master
100000  # this PR. Scalar input should lead to scalar output
>>> c.searchsorted(['b'])
array([100000], dtype=int32)  # master and this PR

This new implementation is BTW quite a bit faster than the old implementation, because we avoid recoding the codes when doing the self.codes.searchsorted(code, ...) bit:

>>> %timeit c.searchsorted('b')
237 µs  # master
6.12 µs  # this PR

A concequence of the new implementation is that KeyError is now raised when a key isn't found. Previously a ValueError was raised.

pep8speaks · 2018-11-02T22:27:07Z

Hello @topper-123! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/arrays/categorical.py !
There are no PEP8 issues in the file pandas/core/indexes/category.py !
There are no PEP8 issues in the file pandas/tests/arrays/categorical/test_analytics.py !

jreback · 2018-11-03T13:43:09Z

pandas/core/arrays/categorical.py

+        if is_scalar(value):
+            codes = self.categories.get_loc(value)
+        else:
+            codes = [self.categories.get_loc(val) for val in value]


you can use .get_indexer here

Unfortunately get_indexer is much slower than get_loc:

>>> %timeit c.categories.get_loc('b') 6.12 µs # this PR >>> %timeit c.categories.get_indexer(['b']) 257 µs

I've made the update to use .get_indexer anyway, and will use this as an opportunity to look for a way to make get_indexer faster, as that will yield benefits beyound .searchsorted. Alternatively I can roll back this last commit, and add the get_indexer part later, when I figure out why get_indexer is slow.

.get_indexer is for many items when it will be much faster than an iteration of .get_loc, but for a small number of items the reverse maybe true, e.g. there will be a cross-over point.

I don't think that is true here: get_loc makes a call to get_indexer, so get_indexer shouldn't be slower, and the very least not this much slower. My guess is that there is some unneeded type conversion or parameter usage happening.

I'll look into to it. If everything is in get_indexer for the right reasons, I just won't pursue the case further.

jreback · 2018-11-03T14:42:59Z

looks fine, ping on green.

codecov · 2018-11-03T15:47:52Z

Codecov Report

Merging #23466 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #23466      +/-   ##
==========================================
+ Coverage   92.24%   92.24%   +<.01%     
==========================================
  Files         161      161              
  Lines       51433    51434       +1     
==========================================
+ Hits        47446    47447       +1     
  Misses       3987     3987

Flag	Coverage Δ
#multiple	`90.64% <100%> (ø)`	⬆️
#single	`42.28% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/arrays/categorical.py	`95.35% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 91d1c50...d7b6873. Read the comment docs.

topper-123 · 2018-11-03T16:32:07Z

Green.

jreback · 2018-11-04T15:55:31Z

pandas/core/arrays/categorical.py

-
-        if -1 in values_as_codes:
-            raise ValueError("Value(s) to be inserted must be in categories.")
+        if is_scalar(value):


i think this is confusing code because get_loc raises, i would rather just use .get_indexer here

The point of searchsorted is fast searching. get_indexer is currently very very slow, as it always creates an array. get_loc OTOH can return scalar or a slice, which is both faster to create and faster to use.

So I think we need to keep get_loc, unless get_indexer gets a redesign

and this actually makes a difference? show this specific case

i am sure that optimizing get_indexer would not be hard and is a better soln

>>> c = pd.Categorical(list('a' + 'b' + 'c' )) >>> %timeit c.categories.get_loc('b') 1.19 µs >>> %timeit c.categories.get_indexer(['b']) 261 µs

I can take look at optimizing get_indexer

topper-123 · 2018-11-14T20:20:22Z

It turns out that making get_indexer fast in not easy. The issue is that the method needs an Index as its argument, or converts its input to an Index. Converting to Index is a very slow process, and probably it's best to make get_indexer use arrays/ExtensionArrays (lower overhead when creating, presumably), but that's a completely different issue.

So I've reverted to make minimal changes in searchsorted, and only do the changes in the API (scalar input leads to scalar output).

I'll take a look at making get_indexer faster in a seperate PR and then - if I succeed - make searchsorted faster using get_indexer.

jreback

tiny typo. ping on green.

jreback · 2018-11-18T18:34:29Z

doc/source/whatsnew/v0.24.0.rst

@@ -960,6 +960,8 @@ Other API Changes
 - Slicing a single row of a DataFrame with multiple ExtensionArrays of the same type now preserves the dtype, rather than coercing to object (:issue:`22784`)
 - :class:`DateOffset` attribute `_cacheable` and method `_should_cache` have been removed (:issue:`23118`)
 - Comparing :class:`Timedelta` to be less or greater than unknown types now raises a ``TypeError`` instead of returning ``False`` (:issue:`20829`)
+- :meth:`Categorical.searchsorted`, when supplied a scalar value to search for, now returns a scalar instead of an array (:issue:`23466`).
+- :meth:`Categorical.searchsorted` now raises a ``keyError`` rather that a ``ValueError``, if a searched for key is not found in its categories (:issue:`23466`).


jreback · 2018-11-18T22:29:58Z

thanks @topper-123

…fixed * upstream/master: (46 commits) DEPS: bump xlrd min version to 1.0.0 (pandas-dev#23774) BUG: Don't warn if default conflicts with dialect (pandas-dev#23775) BUG: Fixing memory leaks in read_csv (pandas-dev#23072) TST: Extend datetime64 arith tests to array classes, fix several broken cases (pandas-dev#23771) STYLE: Specify bare exceptions in pandas/tests (pandas-dev#23370) ENH: between_time, at_time accept axis parameter (pandas-dev#21799) PERF: Use is_utc check to improve performance of dateutil UTC in DatetimeIndex methods (pandas-dev#23772) CLN: io/formats/html.py: refactor (pandas-dev#22726) API: Make Categorical.searchsorted returns a scalar when supplied a scalar (pandas-dev#23466) TST: Add test case for GH14080 for overflow exception (pandas-dev#23762) BUG: Don't extract header names if none specified (pandas-dev#23703) BUG: Index.str.partition not nan-safe (pandas-dev#23558) (pandas-dev#23618) DEPR: tz_convert in the Timestamp constructor (pandas-dev#23621) PERF: Datetime/Timestamp.normalize for timezone naive datetimes (pandas-dev#23634) TST: Use new arithmetic fixtures, parametrize many more tests (pandas-dev#23757) REF/TST: Add more pytest idiom to parsers tests (pandas-dev#23761) DOC: Add ignore-deprecate argument to validate_docstrings.py (pandas-dev#23650) ENH: update pandas-gbq to 0.8.0, adds credentials arg (pandas-dev#23662) DOC: Improve error message to show correct order (pandas-dev#23652) ENH: Improve error message for empty object array (pandas-dev#23718) ...

…calar (pandas-dev#23466)

topper-123 force-pushed the Categorical.searchsorted_II branch 3 times, most recently from a45c7b0 to 7f976df Compare November 3, 2018 06:48

jreback requested changes Nov 3, 2018

View reviewed changes

jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Nov 3, 2018

jreback added this to the 0.24.0 milestone Nov 3, 2018

topper-123 force-pushed the Categorical.searchsorted_II branch from 7f976df to 083e9c0 Compare November 3, 2018 14:29

jreback approved these changes Nov 3, 2018

View reviewed changes

jreback requested changes Nov 4, 2018

View reviewed changes

topper-123 changed the title ~~API/PERF: Categorical.searchsorted is faster and returns a scalar, when supplied a scalar~~ API: Categorical.searchsorted returns a scalar, when supplied a scalar Nov 14, 2018

topper-123 changed the title ~~API: Categorical.searchsorted returns a scalar, when supplied a scalar~~ API: Make Categorical.searchsorted returns a scalar when supplied a scalar Nov 14, 2018

topper-123 force-pushed the Categorical.searchsorted_II branch from 53aab0f to b9e5149 Compare November 14, 2018 22:08

jreback approved these changes Nov 18, 2018

View reviewed changes

topper-123 added 3 commits November 18, 2018 19:50

API/PERF: Categorical.searchsorted faster and returns scalar

9756869

Updated according to comments

1476a3e

track back to use _codes_for_values

d7b6873

topper-123 force-pushed the Categorical.searchsorted_II branch from b9e5149 to d7b6873 Compare November 18, 2018 19:56

jreback merged commit deedb5f into pandas-dev:master Nov 18, 2018

topper-123 deleted the Categorical.searchsorted_II branch November 19, 2018 07:19

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

API: Make Categorical.searchsorted returns a scalar when supplied a s…

288b796

…calar (pandas-dev#23466)

topper-123 mentioned this pull request Nov 20, 2018

API: Make Series.searchsorted return a scalar, when supplied a scalar #23801

Merged

4 tasks

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

API: Make Categorical.searchsorted returns a scalar when supplied a s…

1679582

…calar (pandas-dev#23466)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

API: Make Categorical.searchsorted returns a scalar when supplied a s…

c2ea877

…calar (pandas-dev#23466)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Make Categorical.searchsorted returns a scalar when supplied a scalar #23466

API: Make Categorical.searchsorted returns a scalar when supplied a scalar #23466

topper-123 commented Nov 2, 2018 •

edited

Loading

pep8speaks commented Nov 2, 2018

jreback Nov 3, 2018

topper-123 Nov 3, 2018 •

edited

Loading

jreback Nov 3, 2018

topper-123 Nov 3, 2018

jreback commented Nov 3, 2018

codecov bot commented Nov 3, 2018 •

edited

Loading

topper-123 commented Nov 3, 2018

jreback Nov 4, 2018

topper-123 Nov 9, 2018

jreback Nov 10, 2018

topper-123 Nov 10, 2018

topper-123 commented Nov 14, 2018 •

edited

Loading

jreback left a comment

jreback Nov 18, 2018

jreback commented Nov 18, 2018

API: Make Categorical.searchsorted returns a scalar when supplied a scalar #23466

API: Make Categorical.searchsorted returns a scalar when supplied a scalar #23466

Conversation

topper-123 commented Nov 2, 2018 • edited Loading

pep8speaks commented Nov 2, 2018

Choose a reason for hiding this comment

topper-123 Nov 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 3, 2018

codecov bot commented Nov 3, 2018 • edited Loading

Codecov Report

topper-123 commented Nov 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Nov 14, 2018 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 18, 2018

topper-123 commented Nov 2, 2018 •

edited

Loading

topper-123 Nov 3, 2018 •

edited

Loading

codecov bot commented Nov 3, 2018 •

edited

Loading

topper-123 commented Nov 14, 2018 •

edited

Loading