Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with below limit strings in ngram calculation #14685

Closed
wants to merge 2 commits into from

Conversation

Vortexx2
Copy link

When strings provided were below n characters, the ngram function returns an empty list. This was previously exploded, without filtering out the empty lists, causing the token to occur erroneously. Now, the empty lists should be filtered out.

Description

Changes Proposed:
Should close the issue [BUG] #14684.
Addressed an issue in the ngram function where input strings below a specified length (n) resulted in the function returning an empty list. Previously, this empty list was not filtered out during further processing, leading to the inadvertent occurrence of the token.

Solution:
Modified the code to include a filtering step that removes empty lists generated by the ngram function when input strings are below the specified length. This ensures that the token is no longer erroneously introduced.

Impact:
This enhancement improves the accuracy and reliability of the ngram function by handling edge cases where input strings are shorter than the specified length. The filtering of empty lists prevents unintended consequences, specifically the occurrence of the token in such scenarios.

Checklist

  • [✅ ] I am familiar with the Contributing Guidelines.
  • [ ❌ ] New or existing tests cover these changes.
  • [✅ ] The documentation is up to date with these changes.

When strings provided were below `n` characters, the ngram function returns an empty list. This was previously exploded, without filtering out the empty lists, causing the <NA> token to occur erroneously. Now, the empty lists should be filtered out.
@Vortexx2 Vortexx2 requested a review from a team as a code owner December 29, 2023 10:39
Copy link

copy-pr-bot bot commented Dec 29, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Dec 29, 2023
@vyasr
Copy link
Contributor

vyasr commented Jan 11, 2024

/ok to test

@vyasr vyasr added bug Something isn't working non-breaking Non-breaking change labels Jan 11, 2024
@vyasr
Copy link
Contributor

vyasr commented Jan 11, 2024

/ok to test

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if this fix is an appropriate, breaking change to cuDF behavior.

@@ -4843,6 +4843,7 @@ def character_ngrams(
result = self._return_or_inplace(lc, retain_index=True)

if isinstance(result, cudf.Series) and not as_list:
result = result[result.list.len() > 0] # before exploding, removes those lists which have 0 length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this fix could go in the calling code (cuml?).
Just pass as_list=True and do the filter step and explode outside this function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix should be included in the cuDF repo due to it being a bug in the character_ngrams function itself. sklearn does not have this underlying issue with the sklearn.feature_extraction.text.CountVectorizer class that causes NA tokens to be returned on empty strings for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Could you add a pytest for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any progress on this @Vortexx2 ?

@davidwendt davidwendt added breaking Breaking change and removed non-breaking Non-breaking change labels Jan 12, 2024
@davidwendt
Copy link
Contributor

Closing this in favor of #15371

@davidwendt davidwendt closed this Mar 21, 2024
rapids-bot bot pushed a commit that referenced this pull request Apr 4, 2024
Fixes `character_ngrams` function to not include empty entries when `as_list=False`. That is, the exploded view (non-list result) should not contain empty or NA elements.

This PR changes the `nvtext::generate_character_ngrams()` API to return a lists column instead of a flat strings column. The python code had been converting the return object into lists column and then exploding it if `as_list=False`. Returning as a list column simplifies the logic and prevents the double conversion. There is almost no impact to the nvtext code since the offsets for the output lists column were already being generated.

All tests were updated to expect the new result. Also changed some exception types from `cudf::logic_error` to `std::invalid_argument` as appropriate.

Continues work of abandoned PR #14685

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #15371
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants