[FEA] Supporting `get_feature_names` for `TfidfVectorizer` #4219

mayankanand007 · 2021-09-21T19:41:15Z

Is your feature request related to a problem? Please describe.
I'm looking to get similar functionality from TfidfVectorizer as we have in CountVectorizer in the form of get_feature_names()

Describe the solution you'd like

corpus = ['This is the first document.' , 'This document is the second document.' , 'And this is the third one.' , 
'Is this the first document?']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

# Output: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Describe alternatives you've considered
vec.vocabulary_ is an available attribute to return array mapping from feature integer indices to feature name, however since sklearn has this method available, then it will be great if we can get it in cuML as well.

Additional context
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://docs.rapids.ai/api/cuml/stable/api.html#cuml.feature_extraction.text.TfidfVectorizer

The text was updated successfully, but these errors were encountered:

mayankanand007 · 2021-09-21T21:42:36Z

After looking through the codebase, I realized that since TfidfVectorizer inherits from CountVectorizer class, it also inherits the implementation of .get_feature_names() from there. I fired up a quick example which showed it to work:

from cuml.feature_extraction.text import TfidfVectorizer
import cudf

corpus = ['This is the first document.' , 'This document is the second document.' , 'And this is the third one.' , 
'Is this the first document?']
vectorizer = TfidfVectorizer()
corpus = cudf.Series(corpus)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

# output: 
# 0         and
# 1    document
# 2       first
# 3          is
# 4         one
# 5      second
# 6         the
# 7       third
# 8        this
Name: token, dtype: object

However, I feel we should still add it to the documentation as well since the functionality is already present (just as sklearn added here)

This PR resolves issue #4219 by adding docs for `.get_feature_names()` in the `TfidfVectorizer` class. As mentioned in the linked issue, the method already exists in `CountVectorizer` and `TfidfVectorizer` inherits from that class, hence the functionality is present but not documented. Authors: - Mayank Anand (https://github.com/mayankanand007) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4226

github-actions · 2021-11-23T20:03:19Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

…dsai#4226) This PR resolves issue rapidsai#4219 by adding docs for `.get_feature_names()` in the `TfidfVectorizer` class. As mentioned in the linked issue, the method already exists in `CountVectorizer` and `TfidfVectorizer` inherits from that class, hence the functionality is present but not documented. Authors: - Mayank Anand (https://github.com/mayankanand007) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4226

mayankanand007 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Sep 21, 2021

beckernick assigned mayankanand007 Sep 21, 2021

VibhuJawa added doc Documentation and removed ? - Needs Triage Need team to review and classify feature request New feature or request labels Sep 22, 2021

mayankanand007 mentioned this issue Sep 22, 2021

Adding docs for .get_feature_names() inside TfidfVectorizer #4226

Merged

github-actions bot added the inactive-30d label Nov 23, 2021

mayankanand007 closed this as completed Nov 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Supporting `get_feature_names` for `TfidfVectorizer` #4219

[FEA] Supporting `get_feature_names` for `TfidfVectorizer` #4219

mayankanand007 commented Sep 21, 2021

mayankanand007 commented Sep 21, 2021

github-actions bot commented Nov 23, 2021

[FEA] Supporting get_feature_names for TfidfVectorizer #4219

[FEA] Supporting get_feature_names for TfidfVectorizer #4219

Comments

mayankanand007 commented Sep 21, 2021

mayankanand007 commented Sep 21, 2021

github-actions bot commented Nov 23, 2021

[FEA] Supporting `get_feature_names` for `TfidfVectorizer` #4219

[FEA] Supporting `get_feature_names` for `TfidfVectorizer` #4219