Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change thrust::count_if call to raw kernel in strings split APIs #15762

Merged
merged 17 commits into from
May 30, 2024

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented May 15, 2024

Description

Fixes calls to thrust::count_if in strings split APIs to better handle large strings.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 15, 2024
@davidwendt davidwendt self-assigned this May 15, 2024
@github-actions github-actions bot added the CMake CMake build issue label May 15, 2024
@davidwendt davidwendt changed the base branch from branch-24.06 to branch-24.08 May 21, 2024 16:47
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 24, 2024
@davidwendt davidwendt marked this pull request as ready for review May 28, 2024 14:39
@davidwendt davidwendt requested a review from a team as a code owner May 28, 2024 14:39
@davidwendt davidwendt requested a review from nvdbaranec May 28, 2024 14:39
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@pmattione-nvidia
Copy link
Contributor

What is the benefit of rolling your own kernel vs using count_if? In what way is it better?

@davidwendt
Copy link
Contributor Author

What is the benefit of rolling your own kernel vs using count_if? In what way is it better?

Yes, good question. In this case, the thrust::count_if would technically not work because the begin/end iterators go beyond the max int32 type -- number of bytes in a strings column can now exceed max size_type. We've purposely compiled out some of the thrust calls that iterate beyond max int32 to help reduce the size of our binary. This affects only a couple of places including this one. We could certainly call count_if repeatedly but a custom kernel here is simple enough and will perform better.

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 3e9cff2 into rapidsai:branch-24.08 May 30, 2024
71 checks passed
@davidwendt davidwendt deleted the split-count-if branch May 30, 2024 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants