-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue with below limit strings in ngram calculation #14685
Conversation
When strings provided were below `n` characters, the ngram function returns an empty list. This was previously exploded, without filtering out the empty lists, causing the <NA> token to occur erroneously. Now, the empty lists should be filtered out.
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if this fix is an appropriate, breaking change to cuDF behavior.
@@ -4843,6 +4843,7 @@ def character_ngrams( | |||
result = self._return_or_inplace(lc, retain_index=True) | |||
|
|||
if isinstance(result, cudf.Series) and not as_list: | |||
result = result[result.list.len() > 0] # before exploding, removes those lists which have 0 length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this fix could go in the calling code (cuml?).
Just pass as_list=True
and do the filter step and explode outside this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix should be included in the cuDF repo due to it being a bug in the character_ngrams
function itself. sklearn does not have this underlying issue with the sklearn.feature_extraction.text.CountVectorizer
class that causes NA tokens to be returned on empty strings for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Could you add a pytest for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any progress on this @Vortexx2 ?
Closing this in favor of #15371 |
Fixes `character_ngrams` function to not include empty entries when `as_list=False`. That is, the exploded view (non-list result) should not contain empty or NA elements. This PR changes the `nvtext::generate_character_ngrams()` API to return a lists column instead of a flat strings column. The python code had been converting the return object into lists column and then exploding it if `as_list=False`. Returning as a list column simplifies the logic and prevents the double conversion. There is almost no impact to the nvtext code since the offsets for the output lists column were already being generated. All tests were updated to expect the new result. Also changed some exception types from `cudf::logic_error` to `std::invalid_argument` as appropriate. Continues work of abandoned PR #14685 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15371
When strings provided were below
n
characters, the ngram function returns an empty list. This was previously exploded, without filtering out the empty lists, causing the token to occur erroneously. Now, the empty lists should be filtered out.Description
Changes Proposed:
Should close the issue [BUG] #14684.
Addressed an issue in the ngram function where input strings below a specified length (n) resulted in the function returning an empty list. Previously, this empty list was not filtered out during further processing, leading to the inadvertent occurrence of the token.
Solution:
Modified the code to include a filtering step that removes empty lists generated by the ngram function when input strings are below the specified length. This ensures that the token is no longer erroneously introduced.
Impact:
This enhancement improves the accuracy and reliability of the ngram function by handling edge cases where input strings are shorter than the specified length. The filtering of empty lists prevents unintended consequences, specifically the occurrence of the token in such scenarios.
Checklist