API: preferred way to check if column/Series has Categorical dtype #8814

jorisvandenbossche · 2014-11-14T07:56:34Z

From http://stackoverflow.com/questions/26924904/check-if-dataframe-column-is-categorical/26925340#26925340

What is the preferred way to check for categorical dtype?

I now answered:

In [42]: isinstance(df.cat_column.dtype, pd.core.common.CategoricalDtype)
Out[42]: True

In [43]: pd.core.common.is_categorical_dtype(df.cat_column)
Out[43]: True

But:

this seems somewhat buried in pandas. Should there be a more top-level function to do this?
we should add the preferred way to the categorical docs.

The text was updated successfully, but these errors were encountered:

onesandzeroes · 2014-11-14T08:15:14Z

That was me asking the question. I originally started writing up the question because I was working on a PR for pandas, and while writing I discovered is_categorical_dtype(). Since I was working on internal pandas code that works for my current usage.

Having something that's not so deeply buried would be good though. I tried df.col.dtype == 'category' because I thought that was a pretty standard way of doing a quick type check. Unless there are strong reasons not to use this method, it should probably work the same for categoricals as it does for other types (e.g. df.col.dtype == 'float64')

shoyer · 2014-11-14T11:04:24Z

df.col.dtype == 'category' does appear to work for me on pandas 0.15.1. As @onesandzeroes says, I think should be the preferred way to check for categorical types.

It does looks like Categorical.dtype.__ne__ needs to be defined, though -- currently it's not, so Python defaults to something arbitrary.

jorisvandenbossche · 2014-11-14T11:08:26Z

@shoyer as you can see in my answer on SO, it does indeed work, but the problem is it raises for other dtypes instead of giving False, which is not very handy (and that is a numpy thing).
With the example of SO:

In [86]: df
Out[86]:
  cat_column   x   y
0          c   0   0
1          d  10   4
2          f  20   8
3          a  30  12
4          b  40  16
5          e  50  20

In [87]: df.cat_column.dtype == 'category'
Out[87]: True

In [88]: df.x.dtype
Out[88]: dtype('float64')

In [89]: df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-89-aff611b16544> in <module>()
----> 1 df.x.dtype == 'category'

TypeError: data type "category" not understood

So exactly because that is not working (as you would expect: returning False), I think we should provide a common way to do this (or at least document this in the categorical docs what is the best way to do this)

shoyer · 2014-11-14T11:11:29Z

Ah, I see. A reasonable solution might be to wrap the dtype in str, e.g., str(df.x.dtype) == 'category'.

jorisvandenbossche · 2014-11-14T11:11:48Z

cc @JanSchulz

jreback · 2014-11-14T12:30:21Z

In [10]: df = DataFrame({'A' : np.random.randn(5), 'B' : Series(list('aabbc')).astype('category')})

In [11]: df
Out[11]: 
          A  B
0 -0.064981  a
1  0.852717  a
2  0.693611  b
3  0.411486  b
4 -1.425537  c

In [12]: df.dtypes
Out[12]: 
A     float64
B    category
dtype: object

In [13]: df.dtypes == 'category'
Out[13]: 
A    False
B     True
dtype: bool

In [14]: df.select_dtypes(include=['category'])
Out[14]: 
   B
0  a
1  a
2  b
3  b
4  c

n [16]: pd.core.common.is_categorical_dtype(df.A.dtype)
Out[16]: False

So the preferred method of 'cheking' dtypes is simply to use select_dtypes or [13] works as well. Check a np.dtype('float64')=='category' blows up - I think maybe should create a bug report to have this fixed upstream. Not much we can do for this. Of course as @jtratner pointed out
``np.dtype('float64').dtype.name=='category' (will work correctly with numpy dtypes).

So I don't think it necessary to have the user actually use anything internal.

If pressed, com.is_categorical_dtype(...) would be ok

would not suggestion any mention/use of com.CategoricalDType (as an instance check) though - this is TOO internal.

To be honest this rarely should if ever come up. If the OP is trying to check individual dtypes for category then this is the wrong approach (and mostly certainly .select_dtypes() is the correct method.

So if someone wants to add a small doc section, ok.

jankatins · 2014-11-14T16:03:27Z

Shouldn't this work?

if s.cat: 
   print("It's a categorical!")

jreback · 2014-11-14T16:07:44Z

@JanSchulz

that will raise for non-cat types (as does .dt)

prevents user error. I think this is correct (these should raise)

jorisvandenbossche · 2014-11-14T16:07:51Z

@JanSchulz no, because it gives a TypeError instead of False if it is not a categorical:

In [143]: df
Out[143]:
          A  B
0  0.299586  a
1  0.335853  a
2 -0.135405  b
3  1.247738  b
4 -0.232270  c

In [144]: bool(df.B.cat)
Out[144]: True

In [145]: bool(df.A.cat)
....
TypeError: Can only use .cat accessor with a 'category' dtype

jreback · 2014-11-30T17:40:31Z

I think is an issue that should be raised to numpy

In [9]: np.dtype('i8') == 'foo'
TypeError: data type "foo" not understood

closing on pandas side as this is sane on the pandas side

jreback · 2014-12-01T00:37:41Z

well, for all who care, I tried to push upstream. The user is now subject to random numpyisms that are really hard to fix downstream (impossible in this case).

shoyer · 2014-12-01T01:43:55Z

The string coercion for dtype equality is is an ugly API, and @njsmith is right that we are probably misusing the API here with dtype == 'category'. The category dtype should probably include all the metadata for the type of category (i.e., the categories and sorted-ness). This is how it currently works in dynd, for example.

So it think it would indeed be better to do this differently. Perhaps dtype.kind == 'C'? Or we could even make pd.is_categorical part of the API. Both options are compatible with numpy and not too terrible, IMO (the first has even fewer characters).

Either way, I think this should probably change evaluate to False (because the categories are different):

In [9]: pd.Categorical([1]).dtype == pd.Categorical([2]).dtype
Out[9]: True

jankatins · 2014-12-01T09:01:46Z

If the the last example should work (I think that was discussed during the design of Categoricals), then we have to put the categories into the dtype.

jreback · 2014-12-01T11:12:56Z

@shoyer I disagree.

DyND does support categorical as a full-fledged datashape (see here. But using that impl is prob a ways away.

CategoricalDtype is basically a super-type for categories. You could implement a concrete sub-class that allows an categories comparison, and maybe we should do that; it is nicer from a theoretical point of view.

But to be honest its a fair amount of complexity and not sure how much gain from that.

I am not sure anything is actually gained from explicty type checking with a pd.is_categorical().
pandas is meant to be practical, and I think s.dtype == 'category' is useful and in the spirit of all other numpy dtype comparisons.

shoyer · 2014-12-02T09:04:30Z

@jreback It's one thing for pandas to take a pragmatic approach instead of waiting for a full solution, but designing an API that is incompatible with that full solution seems like a bad idea. s.dtype == 'category' is quite practical but it probably will/should break when we switch to dynd. The dynd API is certainly more flexible, but IMO Nathaniel raised some good points that will likely apply there as well.

In any case, perhaps it was premature to close this issue? s.dtype != 'category' does not currently work -- do we have a preferred alternative? I do understand that you are frustrated with the response from upstream, but even if numpy changed things tomorrow this would still be an issue.

(I do agree it's probably not worth refining CategoricalDtype given that it's pretty well hidden from the public API.)

jreback · 2014-12-02T12:05:53Z

@shoyer I'll buy that s.dtype != 'category' should work. Pls create a separate issue for that.

This is closed because pandas has does all it can to facilitate s.dtype == 'category' and provides many solns which don't need that.

Changing to use DyND type system will likely cause a bit of pain all around (good pain though). And will have to be revisited when DyND is more of a fixture.

If you have a better API idea which doesn't break anything, all ears.

jorisvandenbossche · 2014-12-02T12:36:28Z

This is closed because pandas has does all it can to facilitate s.dtype == 'category' and provides many solns which don't need that.

The reason I think this should not be closed already is the reason I initially opened this issue: just to document this issue in the categorical.rst docs.
And this isn't done yet, and we all know s.dtype == 'category' has it problems (and whether this are limitations on the numpy side or not is another dicussion but does not really matter for the current situation and users who have this problem).

So I can do quick PR to include this in the docs, but therefore, just make a quick choice what I put in there:

pd.core.common.is_categorical_dtype(df['cat'])
df['cat'].dtype.name == 'category'
..

Or provide this is_categorical_dtype (or is_categorical) as a top-level function.

jreback · 2014-12-02T12:42:04Z

well, neither of those are preferred at all

df.dtypes == 'category'
df.select_dtypes(include=['category'])

are the most correct ways to do this
if you want to mention in a very small not that df['cat'].dtype.name == 'category' then ok with that
using com.is_categorical_dtype(...) is actually ok too, but that is just so completely different for the avg user then it should be advertised.

of course s.dtype == 'category' WILL work if its actually a categorical type.....

amzing that this works!

In [2]: Series([1,2,3],dtype='int32').dtype=='i123'
Out[2]: True

jreback · 2014-12-11T13:34:47Z

going to bump this

shoyer · 2015-03-11T00:23:32Z

Another option (see #9629) is that the preferred way to check if a series is categorical should be hasattr(s, 'cat'). This will work with pandas 0.16 or newer and sidesteps the numpy comparison issues...

jorisvandenbossche added API Design Docs labels Nov 14, 2014

jorisvandenbossche added this to the 0.15.2 milestone Nov 14, 2014

jorisvandenbossche added the Categorical Categorical Data Type label Nov 14, 2014

jreback closed this as completed Nov 30, 2014

jreback mentioned this issue Nov 30, 2014

API/BUG: np.dtype equality comparisons versus string-like is inconsistent numpy/numpy#5329

Closed

jreback reopened this Dec 2, 2014

jreback modified the milestones: 0.16.0, 0.15.2 Dec 11, 2014

jreback mentioned this issue Dec 26, 2014

Ordered vs. Unordered Categoricals #9148

Closed

jreback removed this from the 0.16.0 milestone Mar 6, 2015

jreback modified the milestones: Next Major Release, 0.16.0 Mar 6, 2015

jorisvandenbossche modified the milestones: 0.16.0, Next Major Release Mar 8, 2015

This was referenced Mar 9, 2015

BUG/API: Accessors like .cat raise AttributeError when invalid #9617

Merged

TST/DOC: Fix tests and docs for .cat raising AttributeError if invalid #9629

Merged

shoyer closed this as completed in #9629 Mar 11, 2015

jreback mentioned this issue Jan 14, 2018

BUG: dtype comparison between numpy dtypes and pandas dtypes fail #19238

Closed

iliatimofeev mentioned this issue Jun 9, 2018

Fix tz datetime vega/altair#928

Merged

toobaz mentioned this issue Dec 5, 2020

issubdtype(<categorical>, np.bool_) raises error #9581

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: preferred way to check if column/Series has Categorical dtype #8814

API: preferred way to check if column/Series has Categorical dtype #8814

jorisvandenbossche commented Nov 14, 2014

onesandzeroes commented Nov 14, 2014

shoyer commented Nov 14, 2014

jorisvandenbossche commented Nov 14, 2014

shoyer commented Nov 14, 2014

jorisvandenbossche commented Nov 14, 2014

jreback commented Nov 14, 2014

jankatins commented Nov 14, 2014

jreback commented Nov 14, 2014

jorisvandenbossche commented Nov 14, 2014

jreback commented Nov 30, 2014

jreback commented Dec 1, 2014

shoyer commented Dec 1, 2014

jankatins commented Dec 1, 2014

jreback commented Dec 1, 2014

shoyer commented Dec 2, 2014

jreback commented Dec 2, 2014

jorisvandenbossche commented Dec 2, 2014

jreback commented Dec 2, 2014

jreback commented Dec 11, 2014

shoyer commented Mar 11, 2015

API: preferred way to check if column/Series has Categorical dtype #8814

API: preferred way to check if column/Series has Categorical dtype #8814

Comments

jorisvandenbossche commented Nov 14, 2014

onesandzeroes commented Nov 14, 2014

shoyer commented Nov 14, 2014

jorisvandenbossche commented Nov 14, 2014

shoyer commented Nov 14, 2014

jorisvandenbossche commented Nov 14, 2014

jreback commented Nov 14, 2014

jankatins commented Nov 14, 2014

jreback commented Nov 14, 2014

jorisvandenbossche commented Nov 14, 2014

jreback commented Nov 30, 2014

jreback commented Dec 1, 2014

shoyer commented Dec 1, 2014

jankatins commented Dec 1, 2014

jreback commented Dec 1, 2014

shoyer commented Dec 2, 2014

jreback commented Dec 2, 2014

jorisvandenbossche commented Dec 2, 2014

jreback commented Dec 2, 2014

jreback commented Dec 11, 2014

shoyer commented Mar 11, 2015