-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cudf::strings::reverse function #12227
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-23.02 #12227 +/- ##
===============================================
Coverage ? 88.19%
===============================================
Files ? 137
Lines ? 22660
Branches ? 0
===============================================
Hits ? 19984
Misses ? 2676
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
if (input.is_empty()) { return make_empty_column(type_id::STRING); } | ||
|
||
// copy the column; replace data in the chars column | ||
auto result = std::make_unique<column>(input.parent(), stream, mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this making a full copy of the string data and then replacing it in the functor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was the intention of the comment. Should've been worded better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the char child of the result column will be overwritten. Should we just copy the offsets over? This seems a bit over-optimization thus I'm fine without having it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one step does everything we need. It builds the output column and normalizes the offsets if the input is sliced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving CMakes changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really nice!
{"abcdef", "12345", "", "", "aébé", "A é Z", "X", "é"}, {1, 1, 1, 0, 1, 1, 1, 1}); | ||
auto results = cudf::strings::reverse(cudf::strings_column_view(input)); | ||
auto expected = cudf::test::strings_column_wrapper( | ||
{"fedcba", "54321", "", "", "ébéa", "Z é A", "X", "é"}, {1, 1, 1, 0, 1, 1, 1, 1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any of these multiple byte characters? Should we have multiple byte characters in the test? I don't recall if we support them or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the accented ones (é
) are multi-byte UTF-8 characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
One interesting thing that came up while thinking hard about reversed strings: Emoji ZWJ sequences. I don't think we need special handling here, just wanted to share my findings for curiosity.
An emoji like Woman Shrugging, Medium Skin Tone 🤷🏽♀️ is composed of a sequence of 5 code points:
🤷 U+1F937
🏽 U+1F3FD
U+200D
♀ U+2640
U+FE0F
Python reverses the sequence literally if you use [::-1]
, with the code points in the opposite order, thereby no longer making a single emoji but a string that renders like ️♀🏽🤷
. I would expect this PR to behave similarly because each Unicode code point is handled individually, so I don't think additional testing or work is needed.
@gpucibot merge |
This adds JNI and corresponding Java function for `strings::reverse`. Depends on: * #12227 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #12283
Description
Adds
cudf::strings::reverse
function.This is to support NVIDIA/spark-rapids#6885
Checklist