Fix issue with below limit strings in ngram calculation #14685

Vortexx2 · 2023-12-29T10:39:21Z

When strings provided were below n characters, the ngram function returns an empty list. This was previously exploded, without filtering out the empty lists, causing the token to occur erroneously. Now, the empty lists should be filtered out.

Description

Changes Proposed:
Should close the issue [BUG] #14684.
Addressed an issue in the ngram function where input strings below a specified length (n) resulted in the function returning an empty list. Previously, this empty list was not filtered out during further processing, leading to the inadvertent occurrence of the token.

Solution:
Modified the code to include a filtering step that removes empty lists generated by the ngram function when input strings are below the specified length. This ensures that the token is no longer erroneously introduced.

Impact:
This enhancement improves the accuracy and reliability of the ngram function by handling edge cases where input strings are shorter than the specified length. The filtering of empty lists prevents unintended consequences, specifically the occurrence of the token in such scenarios.

Checklist

[✅ ] I am familiar with the Contributing Guidelines.
[ ❌ ] New or existing tests cover these changes.
[✅ ] The documentation is up to date with these changes.

When strings provided were below `n` characters, the ngram function returns an empty list. This was previously exploded, without filtering out the empty lists, causing the <NA> token to occur erroneously. Now, the empty lists should be filtered out.

copy-pr-bot · 2023-12-29T10:39:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

vyasr · 2024-01-11T03:16:40Z

/ok to test

vyasr · 2024-01-11T03:17:09Z

/ok to test

davidwendt

Just wondering if this fix is an appropriate, breaking change to cuDF behavior.

davidwendt · 2024-01-11T13:07:16Z

python/cudf/cudf/core/column/string.py

@@ -4843,6 +4843,7 @@ def character_ngrams(
        result = self._return_or_inplace(lc, retain_index=True)

        if isinstance(result, cudf.Series) and not as_list:
+            result = result[result.list.len() > 0] # before exploding, removes those lists which have 0 length


Seems this fix could go in the calling code (cuml?).
Just pass as_list=True and do the filter step and explode outside this function.

This fix should be included in the cuDF repo due to it being a bug in the character_ngrams function itself. sklearn does not have this underlying issue with the sklearn.feature_extraction.text.CountVectorizer class that causes NA tokens to be returned on empty strings for example.

Ok. Could you add a pytest for this?

Any progress on this @Vortexx2 ?

davidwendt · 2024-03-21T23:11:35Z

Closing this in favor of #15371

Fixes `character_ngrams` function to not include empty entries when `as_list=False`. That is, the exploded view (non-list result) should not contain empty or NA elements. This PR changes the `nvtext::generate_character_ngrams()` API to return a lists column instead of a flat strings column. The python code had been converting the return object into lists column and then exploding it if `as_list=False`. Returning as a list column simplifies the logic and prevents the double conversion. There is almost no impact to the nvtext code since the offsets for the output lists column were already being generated. All tests were updated to expect the new result. Also changed some exception types from `cudf::logic_error` to `std::invalid_argument` as appropriate. Continues work of abandoned PR #14685 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15371

Vortexx2 requested a review from a team as a code owner December 29, 2023 10:39

Vortexx2 requested review from mroeschke and galipremsagar December 29, 2023 10:39

github-actions bot added the Python Affects Python cuDF API. label Dec 29, 2023

Merge branch 'branch-24.02' into patch-1

6c286f8

vyasr added bug Something isn't working non-breaking Non-breaking change labels Jan 11, 2024

davidwendt requested changes Jan 11, 2024

View reviewed changes

davidwendt added breaking Breaking change and removed non-breaking Non-breaking change labels Jan 12, 2024

davidwendt mentioned this pull request Mar 21, 2024

Remove empty elements from exploded character-ngrams output #15371

Merged

3 tasks

davidwendt closed this Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue with below limit strings in ngram calculation #14685

Fix issue with below limit strings in ngram calculation #14685

Vortexx2 commented Dec 29, 2023

copy-pr-bot bot commented Dec 29, 2023

vyasr commented Jan 11, 2024

vyasr commented Jan 11, 2024

davidwendt left a comment

davidwendt Jan 11, 2024

Vortexx2 Jan 19, 2024

davidwendt Jan 19, 2024

Vortexx2 Jan 20, 2024

davidwendt Mar 4, 2024

davidwendt commented Mar 21, 2024

Fix issue with below limit strings in ngram calculation #14685

Fix issue with below limit strings in ngram calculation #14685

Conversation

Vortexx2 commented Dec 29, 2023

Description

Checklist

copy-pr-bot bot commented Dec 29, 2023

vyasr commented Jan 11, 2024

vyasr commented Jan 11, 2024

davidwendt left a comment

Choose a reason for hiding this comment

davidwendt Jan 11, 2024

Choose a reason for hiding this comment

Vortexx2 Jan 19, 2024

Choose a reason for hiding this comment

davidwendt Jan 19, 2024

Choose a reason for hiding this comment

Vortexx2 Jan 20, 2024

Choose a reason for hiding this comment

davidwendt Mar 4, 2024

Choose a reason for hiding this comment

davidwendt commented Mar 21, 2024