-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add support for Categoricals in BlockManager #5313
Comments
+1
|
Right - could even keep existing else and then iterate over additional |
obviously related to #4551 which is essentially the common op mixin - I may work on this |
Yeah, definitely similarities. I'm most interested in getting an efficient internal representation set up (and maybe changing MI to use the same setup of 2D int ndarray + list of indexes rather than list of int ndarray + list of indexes) and then building out. The nice thing is that, if you know that labelsare sorted, it's trivial to get min and max, plus value_counts are the same as for int block with a mini wrapper around result. |
@jseabold - anything from the statsmodels side on this? Not sure if there's anything on statsmodels wishlist for Categorical that we should keep in mind (I think you're one of the right people to ask). |
Heck, MI's from_arrays already uses Categorical anyways, so there's already quite a bit of overlap. |
This will be a welcome feature for us. We'll work to make any changes we need to support it, though I don't suspect we'll need to do anything. |
Yeah I wouldn't expect anything. |
The status quo won't change, but we will be able to catch categoricals and handle them specially now. This is only available through formulas now and if we get a DataFrame we more or less do np.asarray(df). I'll file a ticket for this. |
Closed #6219 in favor of this, just noting here the large memory hit associated |
Closed #6242 in favor of this issue. #6242 has some "what should work with categorials" (in the form of pseudo test code): #6242 (comment) and #6242 (comment) "[regarding |
@jtratner going to be able to do this at some point? |
I'd like to, I'm just not sure how much time I have. We're releasing a new On Thu, Feb 13, 2014 at 7:36 PM, jreback [email protected] wrote:
|
Any news on this? What is actually needed here? Is there any code I can have a look at and try to copy it? |
well it's an internal enhancement to support categorical as a real data type a but non trivial welcome to have a stab at it |
I tried to make sense of pandas.core.internals but this seems to be too much for my pandas knowledge. As far was I can make out, such support would be quite difficult because each value needs to hold both a representation (could be |
ha....see #7217, which I just pushed. This block is built on the What could really use would be some example of uses...can you provide some? |
it seems to me that an op like however things like can you give me an example of some ops? thanks |
As far as I understand Rs factors (https://www.stat.berkeley.edu/classes/s133/factors.html), all numeric operation should fail. If you want to do them, you first have to convert them to numeric:
Also, min/max are only defined when the factor is ordered:
if you specify labels, the original values are lost:
Interestingly, each entry keeps a reference to the levels:
So translating that to pandas
Note that here |
tl;dr - add true support for Categoricals in NDFrame.
There was an issue on the mailing list about using cut and sorting the results that brought this to mind. The issue is both that (I believe) a categorical loses its representation when you put it in a DataFrame and so the output of cut has to just be strings. I propose the following:
CategoricalBlock
(orFactorBlock
) internally that can handle categoricals like those produced from cut that could share most of MI's internals, as a 2D int ndarray with an associated list of indexes for each column (again, nearly the same as MI except most ops would be working on just one 'level' and underlying could/would be 2D rather than list of Int64Index). Probably also would mean abstracting common operations to a separate mixin class.Categorical
to be a Series subclass with a SingleBlockManager that's a CategoricalBlock. This would not change its API, but it would gain Series methods.to_categorical
method to Series (bonus points if we change convert_objects to detect if there are < Some_Max number of labels and convert object dtypes to categoricals).I'm going to work on this and I don't think it will be that difficult to implement, but it would make pandas more useful for representing level sets and other normalized data.
The text was updated successfully, but these errors were encountered: