[BUG] str.character_ngrams
produces <NA> with strings < ngram length
#14684
Labels
bug
Something isn't working
Describe the bug
The
str.character_ngrams
function produces token<NA>
for strings which are lesser than the providedn
(shown in image for the case of bigrams).I have debugged this and as far as I understand it, it is being caused by an empty list returned by the
libstrings.generate_character_ngrams
function. This causes to be a part of the result when it is exploded in the problematic function.This issue causes several bugs in downstream tasks (like when using cuml for
CountVectorizer
etc).Steps/Code to reproduce bug
Minimum code required to reproduce the bug:
Expected behavior
should not be a part of the output. This causes several downstream tasks to fail because is not a valid token in the actual input string series.
Environment overview (please complete the following information)
Environment details
The text was updated successfully, but these errors were encountered: