Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Supporting get_feature_names for TfidfVectorizer #4219

Closed
mayankanand007 opened this issue Sep 21, 2021 · 2 comments
Closed

[FEA] Supporting get_feature_names for TfidfVectorizer #4219

mayankanand007 opened this issue Sep 21, 2021 · 2 comments
Assignees
Labels
doc Documentation inactive-30d

Comments

@mayankanand007
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I'm looking to get similar functionality from TfidfVectorizer as we have in CountVectorizer in the form of get_feature_names()

Describe the solution you'd like

corpus = ['This is the first document.' , 'This document is the second document.' , 'And this is the third one.' , 
'Is this the first document?']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

# Output: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Describe alternatives you've considered
vec.vocabulary_ is an available attribute to return array mapping from feature integer indices to feature name, however since sklearn has this method available, then it will be great if we can get it in cuML as well.

Additional context
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://docs.rapids.ai/api/cuml/stable/api.html#cuml.feature_extraction.text.TfidfVectorizer

@mayankanand007 mayankanand007 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Sep 21, 2021
@mayankanand007
Copy link
Contributor Author

After looking through the codebase, I realized that since TfidfVectorizer inherits from CountVectorizer class, it also inherits the implementation of .get_feature_names() from there. I fired up a quick example which showed it to work:

from cuml.feature_extraction.text import TfidfVectorizer
import cudf

corpus = ['This is the first document.' , 'This document is the second document.' , 'And this is the third one.' , 
'Is this the first document?']
vectorizer = TfidfVectorizer()
corpus = cudf.Series(corpus)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

# output: 
# 0         and
# 1    document
# 2       first
# 3          is
# 4         one
# 5      second
# 6         the
# 7       third
# 8        this
Name: token, dtype: object

However, I feel we should still add it to the documentation as well since the functionality is already present (just as sklearn added here)

@VibhuJawa VibhuJawa added doc Documentation and removed ? - Needs Triage Need team to review and classify feature request New feature or request labels Sep 22, 2021
rapids-bot bot pushed a commit that referenced this issue Sep 27, 2021
This PR resolves issue #4219 by adding docs for `.get_feature_names()` in the `TfidfVectorizer` class.

As mentioned in the linked issue, the method already exists in `CountVectorizer` and `TfidfVectorizer` inherits from that class, hence the functionality is present but not documented.

Authors:
  - Mayank Anand (https://github.com/mayankanand007)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #4226
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this issue Oct 9, 2023
…dsai#4226)

This PR resolves issue rapidsai#4219 by adding docs for `.get_feature_names()` in the `TfidfVectorizer` class.

As mentioned in the linked issue, the method already exists in `CountVectorizer` and `TfidfVectorizer` inherits from that class, hence the functionality is present but not documented.

Authors:
  - Mayank Anand (https://github.com/mayankanand007)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4226
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation inactive-30d
Projects
None yet
Development

No branches or pull requests

2 participants