-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Supporting get_feature_names
for TfidfVectorizer
#4219
Comments
After looking through the codebase, I realized that since from cuml.feature_extraction.text import TfidfVectorizer
import cudf
corpus = ['This is the first document.' , 'This document is the second document.' , 'And this is the third one.' ,
'Is this the first document?']
vectorizer = TfidfVectorizer()
corpus = cudf.Series(corpus)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# output:
# 0 and
# 1 document
# 2 first
# 3 is
# 4 one
# 5 second
# 6 the
# 7 third
# 8 this
Name: token, dtype: object However, I feel we should still add it to the documentation as well since the functionality is already present (just as sklearn added here) |
This PR resolves issue #4219 by adding docs for `.get_feature_names()` in the `TfidfVectorizer` class. As mentioned in the linked issue, the method already exists in `CountVectorizer` and `TfidfVectorizer` inherits from that class, hence the functionality is present but not documented. Authors: - Mayank Anand (https://github.com/mayankanand007) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4226
This issue has been labeled |
…dsai#4226) This PR resolves issue rapidsai#4219 by adding docs for `.get_feature_names()` in the `TfidfVectorizer` class. As mentioned in the linked issue, the method already exists in `CountVectorizer` and `TfidfVectorizer` inherits from that class, hence the functionality is present but not documented. Authors: - Mayank Anand (https://github.com/mayankanand007) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4226
Is your feature request related to a problem? Please describe.
I'm looking to get similar functionality from TfidfVectorizer as we have in CountVectorizer in the form of get_feature_names()
Describe the solution you'd like
Describe alternatives you've considered
vec.vocabulary_
is an available attribute to return array mapping from feature integer indices to feature name, however sincesklearn
has this method available, then it will be great if we can get it incuML
as well.Additional context
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://docs.rapids.ai/api/cuml/stable/api.html#cuml.feature_extraction.text.TfidfVectorizer
The text was updated successfully, but these errors were encountered: