Series.isin fails (errors) for categoricals #16639

aviolov · 2017-06-08T14:40:47Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
#%%
print(pd.__version__)
vals = np.array([0, 1,2, 0]);
cats = ['a', 'b', 'c'];

DFtrades = pd.DataFrame({'id': pd.Series(pd.Categorical(1).from_codes(vals, cats))});
DFscores = pd.DataFrame({'id': pd.Series(pd.Categorical(1).from_codes(np.array([0, 1]), cats))});

print(DFtrades)
print(DFscores)

select_ids = DFtrades['id'].isin(DFscores['id']);

Problem description

I get an error in 0.20.1

File "", line 12, in
select_ids = DFtrades['id'].isin(DFscores['id']);

File "C:\Users\alexandre\Anaconda3\lib\site-packages\pandas\core\series.py", line 2555, in isin
result = algorithms.isin(_values_from_object(self), values)

File "C:\Users\alexandre\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 421, in isin
return f(comps, values)

File "C:\Users\alexandre\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 399, in
f = lambda x, y: htable.ismember_object(x, values)

File "pandas_libs\hashtable_func_helper.pxi", line 428, in pandas._libs.hashtable.ismember_object (pandas_libs\hashtable.c:29677)

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'

Expected Output

a boolean array (or series?) indicating the third row of DFtrades is not in DFscores but the other three are

for reference, this worked (I did not get an error) in 0.19.(something)

also this code will work as expected:

select_ids = DFtrades['id'].isin(DFscores['id'].values);

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.20.1
pytest: 3.1.1
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.19.0
xarray: 0.9.5
IPython: 6.1.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.10
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2017-06-08T17:48:51Z

I'm guessing the fix to this looks something like #16543 - did some refactoring the algorithms file and this is a case that probably got missed

jreback · 2017-06-09T10:42:58Z

this fixes. Though I think we should add some asv's with categoricals to make sure they are hitting the right path

diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py
index d74c5e6..a651817 100644
--- a/pandas/core/algorithms.py
+++ b/pandas/core/algorithms.py
@@ -113,7 +113,8 @@ def _ensure_data(values, dtype=None):
 
         return values.asi8, dtype, 'int64'
 
-    elif is_categorical_dtype(values) or is_categorical_dtype(dtype):
+    elif (is_categorical_dtype(values) and
+          (is_categorical_dtype(dtype) or dtype is None)):
         values = getattr(values, 'values', values)
         values = values.codes
         dtype = 'category'

jreback · 2017-07-06T12:28:29Z

@aviolov want to push a PR for the above fix?

aviolov · 2017-07-06T14:04:14Z

@jreback at the risk of sounding ignorant - how would I do that (maybe a link to some documentation / how-to)?

TomAugspurger · 2017-07-06T14:07:02Z

@aviolov which part, specifically? All the contributing docs are at http://pandas.pydata.org/pandas-docs/stable/contributing.html. If you have any additional questions, just ask them here.

aviolov · 2017-07-06T14:22:44Z

@TomAugspurger , thanks for the link. I guess a 'PR' is a pull request in this case. Is the idea that I download version 0.20.3 and check that my minimal example above works now or that I branch the current version and implement the fix suggested above and then try to push it back or... ? I haven't made a branch off pandas before, but would be fun to try - the how-to looks quite comprehensive

TomAugspurger · 2017-07-06T14:27:06Z

@aviolov you'll fork the repo as described in http://pandas.pydata.org/pandas-docs/stable/contributing.html#forking

Then create a new branch

Then apply your changes:

add a test in pandas/tests/test_categorical.py with your original example,
run the tests something like pytest pandas/tests/test_categorical.py -k <test name> to verify that it fails
Add the fix from @jreback
Add a release not in doc/source/whatsnew/v0.21.0.txt

Then push and make a pull request (PR)

aviolov · 2017-07-06T14:35:44Z

@TomAugspurger cool, I'll give it a try

aviolov · 2017-07-07T21:30:59Z

I could not get git rebase -i HEAD~2 to work for squashing two commits into 1 (possibly b/c i had pushed the first one prior to committing the second one) - I got

$ git rebase -i HEAD-2 fatal: Needed a single revision invalid upstream HEAD-2

jreback · 2017-07-07T21:31:47Z

you don't need to squash

chris-b1 added Bug Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version labels Jun 8, 2017

chris-b1 added this to the 0.20.3 milestone Jun 8, 2017

chris-b1 mentioned this issue Jun 10, 2017

API/BUG: Categorical.is_dtype_equal doesn't compare to Series #16659

Closed

jreback modified the milestones: 0.21.0, 0.20.3 Jul 6, 2017

aviolov mentioned this issue Jul 7, 2017

BUG: Series.isin fails or categoricals #16858

Merged

jreback closed this as completed in #16858 Jul 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series.isin fails (errors) for categoricals #16639

Series.isin fails (errors) for categoricals #16639

aviolov commented Jun 8, 2017

chris-b1 commented Jun 8, 2017

jreback commented Jun 9, 2017 •

edited

Loading

jreback commented Jul 6, 2017

aviolov commented Jul 6, 2017

TomAugspurger commented Jul 6, 2017

aviolov commented Jul 6, 2017

TomAugspurger commented Jul 6, 2017 •

edited

Loading

aviolov commented Jul 6, 2017

aviolov commented Jul 7, 2017

jreback commented Jul 7, 2017

Series.isin fails (errors) for categoricals #16639

Series.isin fails (errors) for categoricals #16639

Comments

aviolov commented Jun 8, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

chris-b1 commented Jun 8, 2017

jreback commented Jun 9, 2017 • edited Loading

jreback commented Jul 6, 2017

aviolov commented Jul 6, 2017

TomAugspurger commented Jul 6, 2017

aviolov commented Jul 6, 2017

TomAugspurger commented Jul 6, 2017 • edited Loading

aviolov commented Jul 6, 2017

aviolov commented Jul 7, 2017

jreback commented Jul 7, 2017

Output of `pd.show_versions()`

jreback commented Jun 9, 2017 •

edited

Loading

TomAugspurger commented Jul 6, 2017 •

edited

Loading