Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: get_dummies on DataFrames #8133

Closed
TomAugspurger opened this issue Aug 28, 2014 · 4 comments · Fixed by #8140
Closed

ENH: get_dummies on DataFrames #8133

TomAugspurger opened this issue Aug 28, 2014 · 4 comments · Fixed by #8140
Labels
API Design Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@TomAugspurger
Copy link
Contributor

get_dummies currently just expects a Series.

In [17]: data
Out[17]: 
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   

                                                Name     Sex  Age  SibSp  \
0                            Braund, Mr. Owen Harris    male   22      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   

   Parch     Ticket     Fare Cabin Embarked  
0      0  A/5 21171   7.2500   NaN        S  
1      0   PC 17599  71.2833   C85        C  

If it took DataFrames we could change the required call from

features = pd.concat([data.get(['Fare', 'Age']),
                      pd.get_dummies(data.Sex, prefix='Sex'),
                      pd.get_dummies(data.Pclass, prefix='Pclass'),
                      pd.get_dummies(data.Embarked, prefix='Embarked')],
                     axis=1)

to

features = pd.get_dummies(data)

We'll infer that things with object dtype need to be encoded as 0's and 1's, but also take arguments to explicitly encode a column, or not.

The column names in the output will automatically include the original column name as a prefix, which can be overridden by the prefix kwarg by passing a list or dictionary.

Same thing with prefix separators.

On NaN handling, I think we'll have one {prefix}_NaN output column per original column when dummy_na is True.

I've got some tests written already.

@TomAugspurger
Copy link
Contributor Author

Ha, something like this already exists! convert_dummies in pandas/core/reshape, but it isn't exported under the pd. namespace, and I didn't find it in the documentaiton.

I'll think about whether to adjust that function at all, or just document it as is. I think the defaults can be improved a bit (which would be API changing) but I wonder if this function is ever used..,

@TomAugspurger
Copy link
Contributor Author

Actually, what I have in mind should be backwards incompatible. It's changing a positional argument to a keyword argument, so we should be fine.

Turns out it was Wes who wrote this originally.

@jreback
Copy link
Contributor

jreback commented Aug 28, 2014

doesn't look like convert_dummies is used anywhere (internal/external).

so you can go ahead an integrate with get_dummies for functionaility as described above (which is prob more useful)

@jorisvandenbossche
Copy link
Member

also not a single mention of convert_dummies on SO. I also would just integrate it in get_dummies with the API we want, instead of adding (or better publicizing) another function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants