-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity? #12042
Comments
Sounds good, interested in submitting a pull request? |
👍 |
cldy
pushed a commit
to cldy/pandas
that referenced
this issue
Feb 11, 2016
…iables out of n levels. closes pandas-dev#12042 Some times it's useful to only accept n-1 variables out of n categorical levels. Author: Bran Yang <[email protected]> Closes pandas-dev#12092 from BranYang/master and squashes the following commits: 0528c57 [Bran Yang] Compare with empty DataFrame, not just check empty 0d99c2a [Bran Yang] Test the case that `drop_first` is on and categorical variable only has one level. 45f14e8 [Bran Yang] ENH: GH12042 Add parameter `drop_first` to get_dummies to get k-1 variables out of n levels.
Would be advantageous to allow dropping a specific value, not just the 'first'. The omitted category (reference group) influences the interpretation of coefficients. For example, one best practice is to omit the 'largest' value as the reference category;
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When doing linear regression and encoding categorical variables, perfect collinearity can be a problem. To get around this, the suggested approach is to use n-1 columns. It would be useful if
pd.get_dummies()
had a boolean parameter that returns n-1 for each categorical column that gets encoded.Example:
Instead, I'd like to have some parameter such as
drop_first=True
inget_dummies()
and it does something like this:Sources
http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/
http://stackoverflow.com/questions/31498390/how-to-get-pandas-get-dummies-to-emit-n-1-variables-to-avoid-co-lineraity
http://dss.princeton.edu/online_help/analysis/dummy_variables.htm
The text was updated successfully, but these errors were encountered: