[BUG] CountVectorizer Vocabulary Length Mismatch #5709

Vortexx2 · 2023-12-28T15:26:25Z

Describe the bug
Upon using the code provided to fit a CountVectorizer on a given text series, it causes an error to pop up where the lengths of the calculated vocabulary and document frequencies don't match, leading to an error in the _limit_features method, when using a mask for the stop_words_ and vocabulary_ variables.
The length of the document frequencies calculated using the document_frequency() method is one less compared to the length of the calculated vocabulary.
Upon further inspection, the vocabulary seems to have one last entry (when sorted alphabetically) which is <NA>. I'm not sure, but it seems like this is causing the off by one error. This only occurs when the last string shown below (443) is included in the Series, otherwise this error does not occur.

Steps/Code to reproduce bug
Minimum Code required to reproduce:

from cudf.core.series import Series
from cuml.feature_extraction.text import CountVectorizer

# make a random text series with 5 rows
text = Series(['1788', '1788', 'update.zip', '1788', '1788', 'update.zip', '', '', '443'])
# use the text series to create a CountVectorizer
vectorizer = CountVectorizer(ngram_range=(2, 3), analyzer='char')
# fit the vectorizer to the text series
vectorizer.fit(text)

Expected behavior
The CountVectorizer should be easily fit to even such a small Dataset.

Environment details (please complete the following information):

Environment location: GCP cloud, pip used
Linux Distro/Architecture: [Ubuntu 16.04 amd64]
GPU Model/Driver: T4 / 525.105.17
CUDA: 11.8

Method of cuDF & cuML install: pip
pip list:

aiohttp                   3.9.1
aiosignal                 1.3.1
anyio                     4.2.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.4.1
async-timeout             4.0.3
attrs                     23.1.0
beautifulsoup4            4.12.2
bleach                    6.1.0
bokeh                     3.3.2
cachetools                5.3.2
certifi                   2023.11.17
cffi                      1.16.0
charset-normalizer        3.3.2
click                     8.1.7
click-plugins             1.1.1
cligj                     0.7.2
cloudpickle               3.0.0
colorcet                  3.0.1
comm                      0.2.0
contourpy                 1.2.0
cubinlinker-cu11          0.3.0.post1
cucim-cu11                23.12.1
cuda-python               11.8.3
cudf-cu11                 23.12.1
cugraph-cu11              23.12.0
cuml-cu11                 23.12.0
cuproj-cu11               23.12.1
cupy-cuda11x              12.3.0
cuspatial-cu11            23.12.1
cuxfilter-cu11            23.12.0
cycler                    0.12.1
dask                      2023.11.0
dask-cuda                 23.12.0
dask-cudf-cu11            23.12.0
datashader                0.16.0
debugpy                   1.8.0
decorator                 5.1.1
defusedxml                0.7.1
distributed               2023.11.0
exceptiongroup            1.2.0
executing                 2.0.1
fastjsonschema            2.19.0
fastrlock                 0.8.2
filelock                  3.13.1
fiona                     1.9.5
fonttools                 4.47.0
fqdn                      1.5.1
frozenlist                1.4.1
fsspec                    2023.12.2
geopandas                 0.14.1
holoviews                 1.18.1
idna                      3.6
imageio                   2.33.1
importlib-metadata        7.0.1
iniconfig                 2.0.0
ipykernel                 6.28.0
ipython                   8.19.0
isoduration               20.11.0
jedi                      0.19.1
Jinja2                    3.1.2
joblib                    1.3.2
jsonpointer               2.4
jsonschema                4.20.0
jsonschema-specifications 2023.12.1
jupyter_client            8.6.0
jupyter_core              5.6.0
jupyter-events            0.9.0
jupyter_server            2.12.1
jupyter_server_proxy      4.1.0
jupyter_server_terminals  0.5.1
jupyterlab_pygments       0.3.0
kiwisolver                1.4.5
lazy_loader               0.3
linkify-it-py             2.0.2
llvmlite                  0.40.1
locket                    1.0.0
Markdown                  3.5.1
markdown-it-py            3.0.0
MarkupSafe                2.1.3
matplotlib                3.8.2
matplotlib-inline         0.1.6
mdit-py-plugins           0.4.0
mdurl                     0.1.2
mistune                   3.0.2
msgpack                   1.0.7
multidict                 6.0.4
multipledispatch          1.0.0
nbclient                  0.9.0
nbconvert                 7.13.1
nbformat                  5.9.2
nest-asyncio              1.5.8
networkx                  3.2.1
numba                     0.57.1
numpy                     1.24.4
nvtx                      0.2.8
overrides                 7.4.0
packaging                 23.2
pandas                    1.5.3
pandocfilters             1.5.0
panel                     1.3.6
param                     2.0.1
parso                     0.8.3
partd                     1.4.1
pexpect                   4.9.0
Pillow                    10.1.0
pip                       23.0.1
platformdirs              4.1.0
pluggy                    1.3.0
polars                    0.20.2
prometheus-client         0.19.0
prompt-toolkit            3.0.43
protobuf                  4.25.1
psutil                    5.9.7
ptxcompiler-cu11          0.7.0.post1
ptyprocess                0.7.0
pure-eval                 0.2.2
pyarrow                   14.0.2
pycparser                 2.21
pyct                      0.5.0
Pygments                  2.17.2
pylibcugraph-cu11         23.12.0
pylibraft-cu11            23.12.0
pynvml                    11.4.1
pyparsing                 3.1.1
pyproj                    3.6.1
pytest                    7.4.3
python-dateutil           2.8.2
python-json-logger        2.0.7
pytz                      2023.3.post1
pyviz_comms               3.0.0
PyWavelets                1.5.0
PyYAML                    6.0.1
pyzmq                     25.1.2
raft-dask-cu11            23.12.0
rapids-dask-dependency    23.12.1
referencing               0.32.0
requests                  2.31.0
requests-file             1.5.1
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rich                      13.7.0
rmm-cu11                  23.12.0
rpds-py                   0.15.2
scikit-image              0.21.0
scikit-learn              1.3.2
scipy                     1.11.4
Send2Trash                1.8.2
setuptools                65.5.0
shapely                   2.0.2
simpervisor               1.0.0
six                       1.16.0
sniffio                   1.3.0
sortedcontainers          2.4.0
soupsieve                 2.5
stack-data                0.6.3
tblib                     3.0.0
terminado                 0.18.0
threadpoolctl             3.2.0
tifffile                  2023.12.9
tinycss2                  1.2.1
tldextract                5.1.1
tomli                     2.0.1
toolz                     0.12.0
tornado                   6.4
tqdm                      4.66.1
traitlets                 5.14.0
treelite                  3.9.1
treelite-runtime          3.9.1
types-python-dateutil     2.8.19.14
typing_extensions         4.9.0
uc-micro-py               1.0.2
ucx-py-cu11               0.35.0
uri-template              1.3.0
urllib3                   2.1.0
wcwidth                   0.2.12
webcolors                 1.13
webencodings              0.5.1
websocket-client          1.7.0
xarray                    2023.12.0
xyzservices               2023.10.1
yarl                      1.9.4
zict                      3.0.0
zipp                      3.17.0

The text was updated successfully, but these errors were encountered:

Vortexx2 · 2023-12-29T10:42:28Z

It seems like this is a problem occurring not in cuML, but in cuDF. I have made a PR there to fix this issue as well.
cuDF issue

dantegd · 2024-01-05T00:05:33Z

Thanks for the issue @Vortexx2 and fix in cuDF! Looking forward to the review and merge process over there.

beckernick · 2024-11-19T14:28:47Z

Resolved by rapidsai/cudf#15371

Vortexx2 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 28, 2023

beckernick closed this as completed Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CountVectorizer Vocabulary Length Mismatch #5709

[BUG] CountVectorizer Vocabulary Length Mismatch #5709

Vortexx2 commented Dec 28, 2023

Vortexx2 commented Dec 29, 2023

dantegd commented Jan 5, 2024

beckernick commented Nov 19, 2024

[BUG] CountVectorizer Vocabulary Length Mismatch #5709

[BUG] CountVectorizer Vocabulary Length Mismatch #5709

Comments

Vortexx2 commented Dec 28, 2023

Vortexx2 commented Dec 29, 2023

dantegd commented Jan 5, 2024

beckernick commented Nov 19, 2024