Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental make_strings_children utility #15363

Merged
merged 39 commits into from
Apr 23, 2024

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Mar 21, 2024

Description

Adds new cudf::strings::detail::experimental::make_strings_children which uses the offsetalator to build output columns. The current d_offsets member required by the given functors no longer stores sizes and offsets but is now split into d_sizes and d_offsets where d_sizes is computed in the first pass and then d_offsets is set to an offsetalator for building output in d_chars.

Once all the uses of make_strings_children (~50 or so) are converted to use the experimental implementation, this will replace the old implementation and the 'experimental' namespace will be removed.

This PR includes 2 changes, repeat_strings and concatenate (per row) since each use different overloaded make_strings_children functions to verify the code does not break any current tests.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 21, 2024
@davidwendt davidwendt self-assigned this Mar 21, 2024
@davidwendt davidwendt changed the title Add experimental make-strings-children utility Add experimental make_strings_children utility Mar 21, 2024
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 10, 2024
@davidwendt davidwendt marked this pull request as ready for review April 10, 2024 17:38
@davidwendt davidwendt requested a review from a team as a code owner April 10, 2024 17:38
@davidwendt davidwendt marked this pull request as draft April 18, 2024 20:49
rapids-bot bot pushed a commit that referenced this pull request Apr 18, 2024
Updates the `replace_re()` and `replace_with_backrefs()` internal logic to support large strings.
These functions use a regex-specific version of make-strings-children.

Depends on #15363

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15524
@davidwendt davidwendt marked this pull request as ready for review April 19, 2024 14:23
@karthikeyann
Copy link
Contributor

The functor is passed by value. The members stored in them won't be passed back to caller. The caller's functor object scope won't matter. Do we need to change the functor members to functor's arguments?
The user anyway should be aware of passing functor by value or reference. So, the scope of these members should be clear that it's only internal to make_strings_children.
(Also, increasing arguments to functor isn't desirable).

@davidwendt
Copy link
Contributor Author

The functor is passed by value. The members stored in them won't be passed back to caller. The caller's functor object scope won't matter. Do we need to change the functor members to functor's arguments? The user anyway should be aware of passing functor by value or reference. So, the scope of these members should be clear that it's only internal to make_strings_children. (Also, increasing arguments to functor isn't desirable).

You are right. I had forgotten about the functor is passed by value so there are 2 independent copies and therefore really 2 scopes for the data.
Also, I looked through the 50 or so usages and the additional parameters would to troublesome for a significant number of implementations. Like where member utility functions use the members and now would need to forward these parameters. The concatenate implementation in this PR is an example of that.

I'm happy to change it back to member variables unless there is more compelling reason for keeping them as parameters.

Requiring a base-class does not really help much here in my opinion since they would hide the members in a separate header file that are directly and frequently being referenced by the functor logic. I'd like to keep them close the logic that uses them.

Also, I want to preserve the raw kernel introduced here for calling the functors. This allows functors to legally use shared memory and warp and block intrinsics, etc which are considered UB in thrust lambdas/functors.

@bdice
Copy link
Contributor

bdice commented Apr 22, 2024

@davidwendt I’m not able to re-review until next week but please feel free to move forward with your plan described above. I proposed the alternative structure but I’m happy with your reasoning for rejecting that proposal.

Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A clarification on the docs -

cpp/include/cudf/strings/detail/strings_children_ex.cuh Outdated Show resolved Hide resolved
@davidwendt davidwendt requested review from ttnghia and shrshi April 22, 2024 22:54
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@mythrocks
Copy link
Contributor

LGTM!

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 702706d into rapidsai:branch-24.06 Apr 23, 2024
70 checks passed
@davidwendt davidwendt deleted the exp-make-strings-children branch April 23, 2024 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants