Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: preferred way to check if column/Series has Categorical dtype #8814

Closed
jorisvandenbossche opened this issue Nov 14, 2014 · 20 comments · Fixed by #9629
Closed

API: preferred way to check if column/Series has Categorical dtype #8814

jorisvandenbossche opened this issue Nov 14, 2014 · 20 comments · Fixed by #9629
Labels
Milestone

Comments

@jorisvandenbossche
Copy link
Member

From http://stackoverflow.com/questions/26924904/check-if-dataframe-column-is-categorical/26925340#26925340

What is the preferred way to check for categorical dtype?

I now answered:

In [42]: isinstance(df.cat_column.dtype, pd.core.common.CategoricalDtype)
Out[42]: True

In [43]: pd.core.common.is_categorical_dtype(df.cat_column)
Out[43]: True

But:

  • this seems somewhat buried in pandas. Should there be a more top-level function to do this?
  • we should add the preferred way to the categorical docs.
@onesandzeroes
Copy link
Contributor

That was me asking the question. I originally started writing up the question because I was working on a PR for pandas, and while writing I discovered is_categorical_dtype(). Since I was working on internal pandas code that works for my current usage.

Having something that's not so deeply buried would be good though. I tried df.col.dtype == 'category' because I thought that was a pretty standard way of doing a quick type check. Unless there are strong reasons not to use this method, it should probably work the same for categoricals as it does for other types (e.g. df.col.dtype == 'float64')

@shoyer
Copy link
Member

shoyer commented Nov 14, 2014

df.col.dtype == 'category' does appear to work for me on pandas 0.15.1. As @onesandzeroes says, I think should be the preferred way to check for categorical types.

It does looks like Categorical.dtype.__ne__ needs to be defined, though -- currently it's not, so Python defaults to something arbitrary.

@jorisvandenbossche
Copy link
Member Author

@shoyer as you can see in my answer on SO, it does indeed work, but the problem is it raises for other dtypes instead of giving False, which is not very handy (and that is a numpy thing).
With the example of SO:

In [86]: df
Out[86]:
  cat_column   x   y
0          c   0   0
1          d  10   4
2          f  20   8
3          a  30  12
4          b  40  16
5          e  50  20

In [87]: df.cat_column.dtype == 'category'
Out[87]: True

In [88]: df.x.dtype
Out[88]: dtype('float64')

In [89]: df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-89-aff611b16544> in <module>()
----> 1 df.x.dtype == 'category'

TypeError: data type "category" not understood

So exactly because that is not working (as you would expect: returning False), I think we should provide a common way to do this (or at least document this in the categorical docs what is the best way to do this)

@jorisvandenbossche jorisvandenbossche added the Categorical Categorical Data Type label Nov 14, 2014
@shoyer
Copy link
Member

shoyer commented Nov 14, 2014

Ah, I see. A reasonable solution might be to wrap the dtype in str, e.g., str(df.x.dtype) == 'category'.

@jorisvandenbossche
Copy link
Member Author

cc @JanSchulz

@jreback
Copy link
Contributor

jreback commented Nov 14, 2014

In [10]: df = DataFrame({'A' : np.random.randn(5), 'B' : Series(list('aabbc')).astype('category')})

In [11]: df
Out[11]: 
          A  B
0 -0.064981  a
1  0.852717  a
2  0.693611  b
3  0.411486  b
4 -1.425537  c

In [12]: df.dtypes
Out[12]: 
A     float64
B    category
dtype: object

In [13]: df.dtypes == 'category'
Out[13]: 
A    False
B     True
dtype: bool

In [14]: df.select_dtypes(include=['category'])
Out[14]: 
   B
0  a
1  a
2  b
3  b
4  c

n [16]: pd.core.common.is_categorical_dtype(df.A.dtype)
Out[16]: False

So the preferred method of 'cheking' dtypes is simply to use select_dtypes or [13] works as well. Check a np.dtype('float64')=='category' blows up - I think maybe should create a bug report to have this fixed upstream. Not much we can do for this. Of course as @jtratner pointed out
``np.dtype('float64').dtype.name=='category' (will work correctly with numpy dtypes).

So I don't think it necessary to have the user actually use anything internal.

If pressed, com.is_categorical_dtype(...) would be ok

would not suggestion any mention/use of com.CategoricalDType (as an instance check) though - this is TOO internal.

To be honest this rarely should if ever come up. If the OP is trying to check individual dtypes for category then this is the wrong approach (and mostly certainly .select_dtypes() is the correct method.

So if someone wants to add a small doc section, ok.

@jankatins
Copy link
Contributor

Shouldn't this work?

if s.cat: 
   print("It's a categorical!")

@jreback
Copy link
Contributor

jreback commented Nov 14, 2014

@JanSchulz

that will raise for non-cat types (as does .dt)

prevents user error. I think this is correct (these should raise)

@jorisvandenbossche
Copy link
Member Author

@JanSchulz no, because it gives a TypeError instead of False if it is not a categorical:

In [143]: df
Out[143]:
          A  B
0  0.299586  a
1  0.335853  a
2 -0.135405  b
3  1.247738  b
4 -0.232270  c

In [144]: bool(df.B.cat)
Out[144]: True

In [145]: bool(df.A.cat)
....
TypeError: Can only use .cat accessor with a 'category' dtype

@jreback
Copy link
Contributor

jreback commented Nov 30, 2014

I think is an issue that should be raised to numpy

In [9]: np.dtype('i8') == 'foo'
TypeError: data type "foo" not understood

closing on pandas side as this is sane on the pandas side

@jreback
Copy link
Contributor

jreback commented Dec 1, 2014

well, for all who care, I tried to push upstream. The user is now subject to random numpyisms that are really hard to fix downstream (impossible in this case).

@shoyer
Copy link
Member

shoyer commented Dec 1, 2014

The string coercion for dtype equality is is an ugly API, and @njsmith is right that we are probably misusing the API here with dtype == 'category'. The category dtype should probably include all the metadata for the type of category (i.e., the categories and sorted-ness). This is how it currently works in dynd, for example.

So it think it would indeed be better to do this differently. Perhaps dtype.kind == 'C'? Or we could even make pd.is_categorical part of the API. Both options are compatible with numpy and not too terrible, IMO (the first has even fewer characters).

Either way, I think this should probably change evaluate to False (because the categories are different):

In [9]: pd.Categorical([1]).dtype == pd.Categorical([2]).dtype
Out[9]: True

@jankatins
Copy link
Contributor

If the the last example should work (I think that was discussed during the design of Categoricals), then we have to put the categories into the dtype.

@jreback
Copy link
Contributor

jreback commented Dec 1, 2014

@shoyer I disagree.

DyND does support categorical as a full-fledged datashape (see here. But using that impl is prob a ways away.

CategoricalDtype is basically a super-type for categories. You could implement a concrete sub-class that allows an categories comparison, and maybe we should do that; it is nicer from a theoretical point of view.

But to be honest its a fair amount of complexity and not sure how much gain from that.

I am not sure anything is actually gained from explicty type checking with a pd.is_categorical().
pandas is meant to be practical, and I think s.dtype == 'category' is useful and in the spirit of all other numpy dtype comparisons.

@shoyer
Copy link
Member

shoyer commented Dec 2, 2014

@jreback It's one thing for pandas to take a pragmatic approach instead of waiting for a full solution, but designing an API that is incompatible with that full solution seems like a bad idea. s.dtype == 'category' is quite practical but it probably will/should break when we switch to dynd. The dynd API is certainly more flexible, but IMO Nathaniel raised some good points that will likely apply there as well.

In any case, perhaps it was premature to close this issue? s.dtype != 'category' does not currently work -- do we have a preferred alternative? I do understand that you are frustrated with the response from upstream, but even if numpy changed things tomorrow this would still be an issue.

(I do agree it's probably not worth refining CategoricalDtype given that it's pretty well hidden from the public API.)

@jreback
Copy link
Contributor

jreback commented Dec 2, 2014

@shoyer I'll buy that s.dtype != 'category' should work. Pls create a separate issue for that.

This is closed because pandas has does all it can to facilitate s.dtype == 'category' and provides many solns which don't need that.

Changing to use DyND type system will likely cause a bit of pain all around (good pain though). And will have to be revisited when DyND is more of a fixture.

If you have a better API idea which doesn't break anything, all ears.

@jorisvandenbossche
Copy link
Member Author

This is closed because pandas has does all it can to facilitate s.dtype == 'category' and provides many solns which don't need that.

The reason I think this should not be closed already is the reason I initially opened this issue: just to document this issue in the categorical.rst docs.
And this isn't done yet, and we all know s.dtype == 'category' has it problems (and whether this are limitations on the numpy side or not is another dicussion but does not really matter for the current situation and users who have this problem).

So I can do quick PR to include this in the docs, but therefore, just make a quick choice what I put in there:

  • pd.core.common.is_categorical_dtype(df['cat'])
  • df['cat'].dtype.name == 'category'
  • ..

Or provide this is_categorical_dtype (or is_categorical) as a top-level function.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2014

well, neither of those are preferred at all

df.dtypes == 'category'
df.select_dtypes(include=['category'])

are the most correct ways to do this
if you want to mention in a very small not that df['cat'].dtype.name == 'category' then ok with that
using com.is_categorical_dtype(...) is actually ok too, but that is just so completely different for the avg user then it should be advertised.

of course s.dtype == 'category' WILL work if its actually a categorical type.....

amzing that this works!

In [2]: Series([1,2,3],dtype='int32').dtype=='i123'
Out[2]: True

@jreback jreback reopened this Dec 2, 2014
@jreback
Copy link
Contributor

jreback commented Dec 11, 2014

going to bump this

@jreback jreback modified the milestones: 0.16.0, 0.15.2 Dec 11, 2014
@jreback jreback removed this from the 0.16.0 milestone Mar 6, 2015
@jreback jreback modified the milestones: Next Major Release, 0.16.0 Mar 6, 2015
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.16.0, Next Major Release Mar 8, 2015
@shoyer
Copy link
Member

shoyer commented Mar 11, 2015

Another option (see #9629) is that the preferred way to check if a series is categorical should be hasattr(s, 'cat'). This will work with pandas 0.16 or newer and sidesteps the numpy comparison issues...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants