Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove empty elements from exploded character-ngrams output #15371

Merged
merged 1 commit into from
Apr 4, 2024

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Mar 21, 2024

Description

Fixes character_ngrams function to not include empty entries when as_list=False. That is, the exploded view (non-list result) should not contain empty or NA elements.

This PR changes the nvtext::generate_character_ngrams() API to return a lists column instead of a flat strings column. The python code had been converting the return object into lists column and then exploding it if as_list=False. Returning as a list column simplifies the logic and prevents the double conversion. There is almost no impact to the nvtext code since the offsets for the output lists column were already being generated.

All tests were updated to expect the new result. Also changed some exception types from cudf::logic_error to std::invalid_argument as appropriate.

Continues work of abandoned PR #14685

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) breaking Breaking change labels Mar 21, 2024
@davidwendt davidwendt self-assigned this Mar 21, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 21, 2024
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 25, 2024
@davidwendt davidwendt marked this pull request as ready for review March 25, 2024 14:32
@davidwendt davidwendt requested review from a team as code owners March 25, 2024 14:32
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit d7b8fc4 into rapidsai:branch-24.06 Apr 4, 2024
68 checks passed
@davidwendt davidwendt deleted the fix-char-ngram branch April 4, 2024 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants