You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Upon using the code provided to fit a CountVectorizer on a given text series, it causes an error to pop up where the lengths of the calculated vocabulary and document frequencies don't match, leading to an error in the _limit_features method, when using a mask for the stop_words_ and vocabulary_ variables.
The length of the document frequencies calculated using the document_frequency() method is one less compared to the length of the calculated vocabulary.
Upon further inspection, the vocabulary seems to have one last entry (when sorted alphabetically) which is <NA>. I'm not sure, but it seems like this is causing the off by one error. This only occurs when the last string shown below (443) is included in the Series, otherwise this error does not occur.
Steps/Code to reproduce bug
Minimum Code required to reproduce:
fromcudf.core.seriesimportSeriesfromcuml.feature_extraction.textimportCountVectorizer# make a random text series with 5 rowstext=Series(['1788', '1788', 'update.zip', '1788', '1788', 'update.zip', '', '', '443'])
# use the text series to create a CountVectorizervectorizer=CountVectorizer(ngram_range=(2, 3), analyzer='char')
# fit the vectorizer to the text seriesvectorizer.fit(text)
Expected behavior
The CountVectorizer should be easily fit to even such a small Dataset.
Environment details (please complete the following information):
Describe the bug
Upon using the code provided to fit a
CountVectorizer
on a given text series, it causes an error to pop up where the lengths of the calculated vocabulary and document frequencies don't match, leading to an error in the_limit_features
method, when using a mask for thestop_words_
andvocabulary_
variables.The length of the document frequencies calculated using the
document_frequency()
method is one less compared to the length of the calculated vocabulary.Upon further inspection, the vocabulary seems to have one last entry (when sorted alphabetically) which is
<NA>
. I'm not sure, but it seems like this is causing the off by one error. This only occurs when the last string shown below (443) is included in the Series, otherwise this error does not occur.Steps/Code to reproduce bug
Minimum Code required to reproduce:
Expected behavior
The
CountVectorizer
should be easily fit to even such a small Dataset.Environment details (please complete the following information):
pip list:
The text was updated successfully, but these errors were encountered: