Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add support for Categoricals in BlockManager #5313

Closed
jtratner opened this issue Oct 24, 2013 · 19 comments · Fixed by #7217
Closed

ENH: Add support for Categoricals in BlockManager #5313

jtratner opened this issue Oct 24, 2013 · 19 comments · Fixed by #7217
Assignees
Labels
API Design Categorical Categorical Data Type Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance
Milestone

Comments

@jtratner
Copy link
Contributor

tl;dr - add true support for Categoricals in NDFrame.

There was an issue on the mailing list about using cut and sorting the results that brought this to mind. The issue is both that (I believe) a categorical loses its representation when you put it in a DataFrame and so the output of cut has to just be strings. I propose the following:

  1. Add a CategoricalBlock (or FactorBlock) internally that can handle categoricals like those produced from cut that could share most of MI's internals, as a 2D int ndarray with an associated list of indexes for each column (again, nearly the same as MI except most ops would be working on just one 'level' and underlying could/would be 2D rather than list of Int64Index). Probably also would mean abstracting common operations to a separate mixin class.
  2. Change Categorical to be a Series subclass with a SingleBlockManager that's a CategoricalBlock. This would not change its API, but it would gain Series methods.
  3. Add a to_categorical method to Series (bonus points if we change convert_objects to detect if there are < Some_Max number of labels and convert object dtypes to categoricals).
  4. Add a registration method to make_block so it iterates over a set of functions that either return a klass or None before falling back to ObjectBlock (so abstract existing else clause into a function and make the list of functions semi-public).

I'm going to work on this and I don't think it will be that difficult to implement, but it would make pandas more useful for representing level sets and other normalized data.

@ghost ghost assigned jtratner Oct 24, 2013
@jreback
Copy link
Contributor

jreback commented Oct 24, 2013

+1

  1. just have to be careful since this is done a lot that don't sacrifice performance here on the type inference (currently it's setup for most common types first and fast path skips this anyhow)

@jtratner
Copy link
Contributor Author

Right - could even keep existing else and then iterate over additional
functions if they are there.

@jreback
Copy link
Contributor

jreback commented Oct 24, 2013

obviously related to #4551 which is essentially the common op mixin - I may work on this

@jtratner
Copy link
Contributor Author

Yeah, definitely similarities. I'm most interested in getting an efficient internal representation set up (and maybe changing MI to use the same setup of 2D int ndarray + list of indexes rather than list of int ndarray + list of indexes) and then building out. The nice thing is that, if you know that labelsare sorted, it's trivial to get min and max, plus value_counts are the same as for int block with a mini wrapper around result.

@jtratner
Copy link
Contributor Author

@jseabold - anything from the statsmodels side on this? Not sure if there's anything on statsmodels wishlist for Categorical that we should keep in mind (I think you're one of the right people to ask).

@jtratner
Copy link
Contributor Author

Heck, MI's from_arrays already uses Categorical anyways, so there's already quite a bit of overlap.

@jseabold
Copy link
Contributor

This will be a welcome feature for us. We'll work to make any changes we need to support it, though I don't suspect we'll need to do anything.

@jtratner
Copy link
Contributor Author

Yeah I wouldn't expect anything.

@jseabold
Copy link
Contributor

The status quo won't change, but we will be able to catch categoricals and handle them specially now. This is only available through formulas now and if we get a DataFrame we more or less do np.asarray(df). I'll file a ticket for this.

@ghost
Copy link

ghost commented Feb 4, 2014

Closed #6219 in favor of this, just noting here the large memory hit associated
with not storing catagorical string columns as a factored data structure.
Just one more reason for doing it.

@jankatins
Copy link
Contributor

Closed #6242 in favor of this issue.

#6242 has some "what should work with categorials" (in the form of pseudo test code): #6242 (comment) and #6242 (comment)

"[regarding prind(df) behaviour] I would think it look the same as if you would print out Series(factor(...)), only the dtypes and so would show that it is a factor."

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

@jtratner going to be able to do this at some point?

@jtratner
Copy link
Contributor Author

I'd like to, I'm just not sure how much time I have. We're releasing a new
version Monday, so I'm hoping I can set aside some time for pandas next
week :)

On Thu, Feb 13, 2014 at 7:36 PM, jreback [email protected] wrote:

@jtratner https://github.com/jtratner going to be able to do this at
some point?

Reply to this email directly or view it on GitHubhttps://github.com//issues/5313#issuecomment-35052819
.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 28, 2014
@jankatins
Copy link
Contributor

Any news on this? What is actually needed here? Is there any code I can have a look at and try to copy it?

@jreback
Copy link
Contributor

jreback commented Mar 31, 2014

well it's an internal enhancement to support categorical as a real data type

a but non trivial

welcome to have a stab at it

@jankatins
Copy link
Contributor

I tried to make sense of pandas.core.internals but this seems to be too much for my pandas knowledge. As far was I can make out, such support would be quite difficult because each value needs to hold both a representation (could be int) and needs access to metadata about all levels (1-> "very supportive" and so on).

@jreback
Copy link
Contributor

jreback commented May 23, 2014

ha....see #7217, which I just pushed.

This block is built on the Categorical class, internally using it as needed for dispatching, but externally representing in the long-form, similar to the way was one for Sparse features.

What could really use would be some example of uses...can you provide some?
e.g. simple operations for testing and such

@jreback
Copy link
Contributor

jreback commented May 23, 2014

@JanSchulz

it seems to me that an op like + doesn't mean anything for categoricals? (as well as the rest of the numeric ops) (going to raise TypeError in case you try it).

however things like min/max are ok

can you give me an example of some ops? thanks

@jankatins
Copy link
Contributor

As far as I understand Rs factors (https://www.stat.berkeley.edu/classes/s133/factors.html), all numeric operation should fail. If you want to do them, you first have to convert them to numeric:

> fert = c(10,20,20,50,10,20,10,50,20)
> fert = factor(fert,levels=c(10,20,50),ordered=TRUE)
> fert
[1] 10 20 20 50 10 20 10 50 20
Levels: 10 < 20 < 50

# If we wished to calculate the mean of the original numeric values of the fert variable, we would have to convert the values using the levels function: 
> mean(fert)
[1] NA
Warning message:
argument is not numeric or logical: 
      returning NA in: mean.default(fert)
> mean(as.numeric(levels(fert)[fert]))
[1] 23.33333

Also, min/max are only defined when the factor is ordered:

> fert = c(10,20,20,50,10,20,10,50,20)
> min(fert)
[1] 10
> fert = factor(fert,levels=c(10,20,50))
> min(fert)
Error in Summary.factor(c(1L, 2L, 2L, 3L, 1L, 2L, 1L, 3L, 2L), na.rm = FALSE) : 
  min nicht sinnvoll für Faktoren
> fert = factor(fert,levels=c(10,20,50),ordered=TRUE)
> min(fert)
[1] 10
Levels: 10 < 20 < 50
> fert = factor(fert,levels=c(50,20,10),ordered=TRUE)
> min(fert)
[1] 50
Levels: 50 < 20 < 10
> fert = factor(fert,levels=c(0,20,10),ordered=TRUE)
> min(fert)
[1] <NA>
Levels: 0 < 20 < 10
> max(fert)
[1] <NA>
Levels: 0 < 20 < 10
> fert
[1] 10   20   20   <NA> 10   20   10   <NA> 20  
Levels: 0 < 20 < 10

if you specify labels, the original values are lost:

> fert = c(10,20,20,50,10,20,10,50,20)
> fert = factor(fert,levels=c(10,20,50), labels=c("I","II","III"),ordered=TRUE)
> min(fert)
[1] I
Levels: I < II < III
> as.numeric(levels(fert)[fert])
[1] NA NA NA NA NA NA NA NA NA
Warning message:
NAs introduced by coercion 

Interestingly, each entry keeps a reference to the levels:

> fert
[1] I   II  II  III I   II  I   III II 
Levels: I < II < III
> fert[4]
[1] III
Levels: I < II < III
> levels(fert[4])
[1] "I"   "II"  "III"

So translating that to pandas

s = pd.Series([1,2,3,1])
c = s.astype('category')
c+1 # runtime error or Series([nan, nan, nan, nan]), R returns NAs and prints a warning
# I would vote for error, as nans just hide the problem
c.astype('int') == s # True
c.min() # fails, not ordered
c.max() # fails, not ordered
co = pd.Categorical(s, levels=[1,2,3], ordered=True)
co+1 # still fails
co.min() == 1 # True
co.max() == 3 # True
# There should probably also a method to convert to a ordered factor:
cr = s.to_ordered(levels=[3,2,1]) # reversed
co.min() == 3 # True
co.max() == 1 # True
cnan = pd.Categorical(s, levels=[1,5,3], ordered=True) # 2 not a level
cnan.values == [1,nan,3,1]
cnan.levels == [1,5,3]
cnan.min() == 1 # pandas ignores nan in min/max
croman = pd.Categorical(s, levels=[1,2,3], labels=["I","II","III"]) 
croman.astype(int) # fails or Series([nan, nan, nan, nan]); 
# R returns NAs and prints a coercion warning
croman[2] == "III"
# not sure how a single value should behave: is it in this case a string or 
# does it have some reference to the levels?

Note that here c.values is a "translated" list of values, not the pointer list as it was in previously #6242 (comment) (changed that to val_pointer). as far as I understand pandas, series.values is the underlying data, so that should be the "translated" value: c.values == [c.labels[x] for x in c.val_pointer]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants