Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cudf::strings::reverse function #12227

Merged
merged 9 commits into from
Dec 1, 2022

Conversation

davidwendt
Copy link
Contributor

Description

Adds cudf::strings::reverse function.
This is to support NVIDIA/spark-rapids#6885

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Nov 22, 2022
@davidwendt davidwendt self-assigned this Nov 22, 2022
@github-actions github-actions bot added CMake CMake build issue conda labels Nov 22, 2022
@codecov
Copy link

codecov bot commented Nov 22, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.02@f063d9e). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 46f6baa differs from pull request most recent head 4cceacd. Consider uploading reports for the commit 4cceacd to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.02   #12227   +/-   ##
===============================================
  Coverage                ?   88.19%           
===============================================
  Files                   ?      137           
  Lines                   ?    22660           
  Branches                ?        0           
===============================================
  Hits                    ?    19984           
  Misses                  ?     2676           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

if (input.is_empty()) { return make_empty_column(type_id::STRING); }

// copy the column; replace data in the chars column
auto result = std::make_unique<column>(input.parent(), stream, mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this making a full copy of the string data and then replacing it in the functor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was the intention of the comment. Should've been worded better?

Copy link
Contributor

@ttnghia ttnghia Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the char child of the result column will be overwritten. Should we just copy the offsets over? This seems a bit over-optimization thus I'm fine without having it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one step does everything we need. It builds the output column and normalizes the offsets if the input is sliced.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 28, 2022
@davidwendt davidwendt marked this pull request as ready for review November 28, 2022 13:33
@davidwendt davidwendt requested review from a team as code owners November 28, 2022 13:33
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving CMakes changes

Copy link
Contributor

@cwharris cwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really nice!

{"abcdef", "12345", "", "", "aébé", "A é Z", "X", "é"}, {1, 1, 1, 0, 1, 1, 1, 1});
auto results = cudf::strings::reverse(cudf::strings_column_view(input));
auto expected = cudf::test::strings_column_wrapper(
{"fedcba", "54321", "", "", "ébéa", "Z é A", "X", "é"}, {1, 1, 1, 0, 1, 1, 1, 1});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of these multiple byte characters? Should we have multiple byte characters in the test? I don't recall if we support them or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the accented ones (é) are multi-byte UTF-8 characters.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

One interesting thing that came up while thinking hard about reversed strings: Emoji ZWJ sequences. I don't think we need special handling here, just wanted to share my findings for curiosity.

An emoji like Woman Shrugging, Medium Skin Tone 🤷🏽‍♀️ is composed of a sequence of 5 code points:
🤷 U+1F937
🏽 U+1F3FD
U+200D
♀ U+2640
U+FE0F

Python reverses the sequence literally if you use [::-1], with the code points in the opposite order, thereby no longer making a single emoji but a string that renders like ️♀‍🏽🤷. I would expect this PR to behave similarly because each Unicode code point is handled individually, so I don't think additional testing or work is needed.

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit be26260 into rapidsai:branch-23.02 Dec 1, 2022
@davidwendt davidwendt deleted the reverse-strings branch December 1, 2022 23:36
rapids-bot bot pushed a commit that referenced this pull request Dec 3, 2022
This adds JNI and corresponding Java function for `strings::reverse`.

Depends on:
 * #12227

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #12283
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants