Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use offsetalator in cudf::strings::findall #14745

Merged
merged 8 commits into from
Feb 1, 2024

Conversation

davidwendt
Copy link
Contributor

Description

Use make_offsets_child_column and offsetalator_iterator to build/access offsets instead of hardcoded types.
This cleans up the code nicely by automatically handling offset overflow and computing the total number of matches.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 11, 2024
@davidwendt davidwendt self-assigned this Jan 11, 2024
@davidwendt davidwendt changed the base branch from branch-24.02 to branch-24.04 January 22, 2024 14:18
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jan 22, 2024
@davidwendt davidwendt marked this pull request as ready for review January 22, 2024 19:05
@davidwendt davidwendt requested a review from a team as a code owner January 22, 2024 19:05
// Create indices vector with the total number of groups that will be extracted
auto const total_matches =
cudf::detail::get_value<size_type>(offsets->view(), strings_count, stream);
auto const sizes = count_matches(*d_strings, *d_prog, strings_count, stream, mr); //+1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the //+1 mean here? I see previously the code passed strings_count + 1. Did the behavior change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. That was left-over note-to-self. The +1 is not needed since the number of counts equals the strings_count. Previously we were storing the counts in an offsets column temporarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment here: ba0d432

@davidwendt davidwendt requested a review from bdice January 26, 2024 18:56
@davidwendt davidwendt changed the title Use offsetalator in strings::findall Use offsetalator in cudf::strings::findall Jan 26, 2024
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 9916395 into rapidsai:branch-24.04 Feb 1, 2024
69 checks passed
@davidwendt davidwendt deleted the findall-offsetalator branch February 1, 2024 23:44
rapids-bot bot pushed a commit that referenced this pull request Feb 23, 2024
…15043)

Fixes `cudf::strings::extract_all()` to use `cudf::detail::make_offsets_child_column` so it properly computes the output-size and checks for overflow when building offsets for a lists column.
Also undo some changes from #14745 that incorrectly called `cudf::strings::detail::make_offsets_child_column` to create offsets for a lists column.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #15043
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants