Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Deprecate label_encoding method in favor of categorical columns #8608

Closed
lastephey opened this issue Jun 25, 2021 · 6 comments
Closed
Labels
bug Something isn't working good first issue Good for newcomers Python Affects Python cuDF API.

Comments

@lastephey
Copy link

Dear cuDF developers,

Is your feature request related to a problem? Please describe.

My feature request is for cuDF to add the ability for the label_encoding function to encode string data. My specific use-case is to encode string values as ints.

Describe the solution you'd like

I would like to use the label_encoding function to encode some values that are currently strings as ints. My larger goal is to use these data in CuPy's histogram2d, but since CuPy does not yet support string arrays, I was hoping that label encoding would help me get around this problem.

Describe alternatives you've considered

I have also tried the cuDF replace function for this, but it looks like it doesn't support encoding strings as ints either:

to_replace and value should be of same types,got to_replace dtype: object and value dtype: int64

I'd be equally happy with replace supporting the ability to replace objects of different types.

Additional context

If there are better ways to accomplish this task, I would be grateful for your suggestions.

Thank you very much,
Laurie

@lastephey lastephey added Needs Triage Need team to review and classify feature request New feature or request labels Jun 25, 2021
@davidwendt
Copy link
Contributor

What does the string data look like? You may be able to use the subword_tokenize function to assign integers to words.

If you want to replace an entire string row with an integer you could use replace with strings that contain integers and then cast that result to int64

>>> import cudf
>>> s = cudf.Series(['a','b','a','c'])
>>> s.str.replace(['a','b','c'],['1','2','3'], regex=False).astype('int64')
0    1
1    2
2    1
3    3
dtype: int64

@beckernick
Copy link
Member

In addition to @davidwendt , you could do this in a couple of other ways to avoid enumerating your categories (the former probably being preferred method in pandas):

import cudfs = cudf.Series(["a","b","c"])
s.astype("category").cat.codes
0    0
1    1
2    2
dtype: uint8
import cudf
from cuml.preprocessing import LabelEncoders = cudf.Series(["a","b","c"])
​
le = LabelEncoder()
le.fit(s)
le.transform(s)
0    0
1    1
2    2
dtype: uint8

cc @shwina perhaps we should consider deprecating label_encode as the core functionality is generally covered by categorical columns and cuML's LabelEncoder. From the git blame it's from the early pyGDF days years ago. I don't think this method is supported in pandas.

@beckernick beckernick added bug Something isn't working Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 25, 2021
@lastephey
Copy link
Author

Thank you @davidwendt and @beckernick for your suggestions. I think they all sound like they will do what I need. The strings I need to replace are all values in a single column.

I haven't had time to try these out yet but I just wanted to say thank you and that I really appreciate your quick answers and advice.

@lastephey
Copy link
Author

Ok, I've had some time to read about and test the options you suggested.

  1. subword_tokenize would definitely work for what I need to do, but there's some overhead to set this up.
  2. Using replace as @davidwendt suggested works well, but it does result in this warning for me using cuDF 21.08.00a:
/opt/miniconda3/lib/python3.8/site-packages/cudf/core/column/string.py:945: UserWarning: `n` parameter is not supported when `pat` and `repl` are list-like inputs
  warnings.warn(
  1. Using categorial data is the simplest and my favorite. I didn't even know about this functionality within Pandas, so thank you very much for the pointer.
  2. the cuML solution also seems fine, although in my case I'm not using cuML so I don't want to add this extra dependency unless I have to.

Thank you both again for your help and suggestions. I really appreciate it. My question is resolved. Should I close this, or would you like me to keep it open for issue tracking?

@beckernick
Copy link
Member

Glad this is resolved. Let's keep this open and edit the title to be a discussion around deprecating label_encoding.

@beckernick beckernick changed the title [FEA] enable label_encoding function to encode string values [FEA] Deprecate label_encoding method in favor of categorical columns Jun 28, 2021
@beckernick beckernick added tech debt and removed feature request New feature or request labels Jun 28, 2021
@beckernick beckernick added the good first issue Good for newcomers label Sep 8, 2021
rapids-bot bot pushed a commit that referenced this issue Sep 24, 2021
This PR addresses issue #8608 by adding a deprecation warning before we remove the functionality entirely.

Authors:
  - Mayank Anand (https://github.com/mayankanand007)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #9289
@vyasr
Copy link
Contributor

vyasr commented Jul 15, 2022

This was done in #9535 and #9289

@vyasr vyasr closed this as completed Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

4 participants