Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CountVectorizer Vocabulary Length Mismatch #5709

Closed
Vortexx2 opened this issue Dec 28, 2023 · 3 comments
Closed

[BUG] CountVectorizer Vocabulary Length Mismatch #5709

Vortexx2 opened this issue Dec 28, 2023 · 3 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@Vortexx2
Copy link

Describe the bug
Upon using the code provided to fit a CountVectorizer on a given text series, it causes an error to pop up where the lengths of the calculated vocabulary and document frequencies don't match, leading to an error in the _limit_features method, when using a mask for the stop_words_ and vocabulary_ variables.
The length of the document frequencies calculated using the document_frequency() method is one less compared to the length of the calculated vocabulary.
Upon further inspection, the vocabulary seems to have one last entry (when sorted alphabetically) which is <NA>. I'm not sure, but it seems like this is causing the off by one error. This only occurs when the last string shown below (443) is included in the Series, otherwise this error does not occur.

Steps/Code to reproduce bug
Minimum Code required to reproduce:

from cudf.core.series import Series
from cuml.feature_extraction.text import CountVectorizer

# make a random text series with 5 rows
text = Series(['1788', '1788', 'update.zip', '1788', '1788', 'update.zip', '', '', '443'])
# use the text series to create a CountVectorizer
vectorizer = CountVectorizer(ngram_range=(2, 3), analyzer='char')
# fit the vectorizer to the text series
vectorizer.fit(text)

Expected behavior
The CountVectorizer should be easily fit to even such a small Dataset.

Environment details (please complete the following information):

  • Environment location: GCP cloud, pip used
  • Linux Distro/Architecture: [Ubuntu 16.04 amd64]
  • GPU Model/Driver: T4 / 525.105.17
  • CUDA: 11.8
  • Method of cuDF & cuML install: pip
    pip list:
    aiohttp                   3.9.1
    aiosignal                 1.3.1
    anyio                     4.2.0
    argon2-cffi               23.1.0
    argon2-cffi-bindings      21.2.0
    arrow                     1.3.0
    asttokens                 2.4.1
    async-timeout             4.0.3
    attrs                     23.1.0
    beautifulsoup4            4.12.2
    bleach                    6.1.0
    bokeh                     3.3.2
    cachetools                5.3.2
    certifi                   2023.11.17
    cffi                      1.16.0
    charset-normalizer        3.3.2
    click                     8.1.7
    click-plugins             1.1.1
    cligj                     0.7.2
    cloudpickle               3.0.0
    colorcet                  3.0.1
    comm                      0.2.0
    contourpy                 1.2.0
    cubinlinker-cu11          0.3.0.post1
    cucim-cu11                23.12.1
    cuda-python               11.8.3
    cudf-cu11                 23.12.1
    cugraph-cu11              23.12.0
    cuml-cu11                 23.12.0
    cuproj-cu11               23.12.1
    cupy-cuda11x              12.3.0
    cuspatial-cu11            23.12.1
    cuxfilter-cu11            23.12.0
    cycler                    0.12.1
    dask                      2023.11.0
    dask-cuda                 23.12.0
    dask-cudf-cu11            23.12.0
    datashader                0.16.0
    debugpy                   1.8.0
    decorator                 5.1.1
    defusedxml                0.7.1
    distributed               2023.11.0
    exceptiongroup            1.2.0
    executing                 2.0.1
    fastjsonschema            2.19.0
    fastrlock                 0.8.2
    filelock                  3.13.1
    fiona                     1.9.5
    fonttools                 4.47.0
    fqdn                      1.5.1
    frozenlist                1.4.1
    fsspec                    2023.12.2
    geopandas                 0.14.1
    holoviews                 1.18.1
    idna                      3.6
    imageio                   2.33.1
    importlib-metadata        7.0.1
    iniconfig                 2.0.0
    ipykernel                 6.28.0
    ipython                   8.19.0
    isoduration               20.11.0
    jedi                      0.19.1
    Jinja2                    3.1.2
    joblib                    1.3.2
    jsonpointer               2.4
    jsonschema                4.20.0
    jsonschema-specifications 2023.12.1
    jupyter_client            8.6.0
    jupyter_core              5.6.0
    jupyter-events            0.9.0
    jupyter_server            2.12.1
    jupyter_server_proxy      4.1.0
    jupyter_server_terminals  0.5.1
    jupyterlab_pygments       0.3.0
    kiwisolver                1.4.5
    lazy_loader               0.3
    linkify-it-py             2.0.2
    llvmlite                  0.40.1
    locket                    1.0.0
    Markdown                  3.5.1
    markdown-it-py            3.0.0
    MarkupSafe                2.1.3
    matplotlib                3.8.2
    matplotlib-inline         0.1.6
    mdit-py-plugins           0.4.0
    mdurl                     0.1.2
    mistune                   3.0.2
    msgpack                   1.0.7
    multidict                 6.0.4
    multipledispatch          1.0.0
    nbclient                  0.9.0
    nbconvert                 7.13.1
    nbformat                  5.9.2
    nest-asyncio              1.5.8
    networkx                  3.2.1
    numba                     0.57.1
    numpy                     1.24.4
    nvtx                      0.2.8
    overrides                 7.4.0
    packaging                 23.2
    pandas                    1.5.3
    pandocfilters             1.5.0
    panel                     1.3.6
    param                     2.0.1
    parso                     0.8.3
    partd                     1.4.1
    pexpect                   4.9.0
    Pillow                    10.1.0
    pip                       23.0.1
    platformdirs              4.1.0
    pluggy                    1.3.0
    polars                    0.20.2
    prometheus-client         0.19.0
    prompt-toolkit            3.0.43
    protobuf                  4.25.1
    psutil                    5.9.7
    ptxcompiler-cu11          0.7.0.post1
    ptyprocess                0.7.0
    pure-eval                 0.2.2
    pyarrow                   14.0.2
    pycparser                 2.21
    pyct                      0.5.0
    Pygments                  2.17.2
    pylibcugraph-cu11         23.12.0
    pylibraft-cu11            23.12.0
    pynvml                    11.4.1
    pyparsing                 3.1.1
    pyproj                    3.6.1
    pytest                    7.4.3
    python-dateutil           2.8.2
    python-json-logger        2.0.7
    pytz                      2023.3.post1
    pyviz_comms               3.0.0
    PyWavelets                1.5.0
    PyYAML                    6.0.1
    pyzmq                     25.1.2
    raft-dask-cu11            23.12.0
    rapids-dask-dependency    23.12.1
    referencing               0.32.0
    requests                  2.31.0
    requests-file             1.5.1
    rfc3339-validator         0.1.4
    rfc3986-validator         0.1.1
    rich                      13.7.0
    rmm-cu11                  23.12.0
    rpds-py                   0.15.2
    scikit-image              0.21.0
    scikit-learn              1.3.2
    scipy                     1.11.4
    Send2Trash                1.8.2
    setuptools                65.5.0
    shapely                   2.0.2
    simpervisor               1.0.0
    six                       1.16.0
    sniffio                   1.3.0
    sortedcontainers          2.4.0
    soupsieve                 2.5
    stack-data                0.6.3
    tblib                     3.0.0
    terminado                 0.18.0
    threadpoolctl             3.2.0
    tifffile                  2023.12.9
    tinycss2                  1.2.1
    tldextract                5.1.1
    tomli                     2.0.1
    toolz                     0.12.0
    tornado                   6.4
    tqdm                      4.66.1
    traitlets                 5.14.0
    treelite                  3.9.1
    treelite-runtime          3.9.1
    types-python-dateutil     2.8.19.14
    typing_extensions         4.9.0
    uc-micro-py               1.0.2
    ucx-py-cu11               0.35.0
    uri-template              1.3.0
    urllib3                   2.1.0
    wcwidth                   0.2.12
    webcolors                 1.13
    webencodings              0.5.1
    websocket-client          1.7.0
    xarray                    2023.12.0
    xyzservices               2023.10.1
    yarl                      1.9.4
    zict                      3.0.0
    zipp                      3.17.0
    
@Vortexx2 Vortexx2 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 28, 2023
@Vortexx2
Copy link
Author

It seems like this is a problem occurring not in cuML, but in cuDF. I have made a PR there to fix this issue as well.
cuDF issue

@dantegd
Copy link
Member

dantegd commented Jan 5, 2024

Thanks for the issue @Vortexx2 and fix in cuDF! Looking forward to the review and merge process over there.

@beckernick
Copy link
Member

Resolved by rapidsai/cudf#15371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants