Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Deprecate one_hot_encoding in favor of get_dummies #9330

Closed
isVoid opened this issue Sep 28, 2021 · 0 comments · Fixed by #9435
Closed

[FEA] Deprecate one_hot_encoding in favor of get_dummies #9330

isVoid opened this issue Sep 28, 2021 · 0 comments · Fixed by #9435
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@isVoid
Copy link
Contributor

isVoid commented Sep 28, 2021

Is your feature request related to a problem? Please describe.
This is a general pandas API alignment request. Pandas did not provide an one_hot_encoding API and the get_dummies API does not provide a measure to configure the categories included in the final encoding matrix. During implementing one_hot_encode in libcudf, the majority agrees that situation when only a subset of the categories is used in one hot encoding is rare. Deprecating the support of specifying the categories from python is thus deemed plausible and should reduce complexity of get_dummies.

Describe the solution you'd like
Add deprecation warning to series.one_hot_encoding and dataframe.one_hot_encoding in this release. Remove in the next and implement cudf.get_dummies directly with libcudf one_hot_encode.

Describe alternatives you've considered
N/A

Additional context
Related: #8608

@isVoid isVoid added feature request New feature or request Needs Triage Need team to review and classify labels Sep 28, 2021
@beckernick beckernick added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Oct 5, 2021
rapids-bot bot pushed a commit that referenced this issue Oct 14, 2021
…mmies` with Cython API (#9435)

Closes #9330 

This PR adds deprecation warning to `one_hot_encoding` and implements `get_dummies` directly with Cython API. Testing with a simple one hot encoding of dataframe, python overhead is reduced:

```
-------------------------------- benchmark 'None': 2 tests --------------------------------
Name (time in ms)                       Min                Max               Mean          
-------------------------------------------------------------------------------------------
get_dummies_simple[None] (afte)      7.7220 (1.0)      11.6960 (1.0)       7.8378 (1.0)    
get_dummies_simple[None] (befo)     15.3472 (1.99)     15.8865 (1.36)     15.4148 (1.97)   
-------------------------------------------------------------------------------------------

-------------------------------- benchmark 'pre': 2 tests --------------------------------
Name (time in ms)                      Min                Max               Mean          
------------------------------------------------------------------------------------------
get_dummies_simple[pre] (afte)      7.6924 (1.0)      12.7497 (1.0)       7.7758 (1.0)    
get_dummies_simple[pre] (befo)     15.3385 (1.99)     19.3915 (1.52)     15.4682 (1.99)   
------------------------------------------------------------------------------------------
```

<details>

<summary> Data Setup </summary>

```python
df = cudf.DataFrame(
            {
                'col1': list(range(10)),
                'col2': list('abcdefghij'),
                'col3': cudf.Series(list(range(100, 110)), dtype='category')
            }
        )
```

</details>

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Sheilah Kirui (https://github.com/skirui-source)

URL: #9435
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants