-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance for cudf::strings::count_characters for long strings #12779
Improve performance for cudf::strings::count_characters for long strings #12779
Conversation
Benchmark results show most improvement for string widths greater 64 bytes on average.
|
Will the performance with strings shorter than 64 bytes be impacted? |
No. The optimized code did not perform as well as the original code under 64 bytes so the solution implemented here is to include both code paths. |
auto strings_view = cudf::strings_column_view(strings); | ||
|
||
auto results = cudf::strings::count_characters(strings_view); | ||
std::vector<int32_t> h_expected(h_strings.size(), 64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected that these tests should all be using cudf::size_type
for the counts, but I also see that count_characters
is defined to return an INT32
in its docstring. That seems undesirable, right? Shouldn't we be using cudf::size_type
instead of hardcoding int32_t
for the function return value and test comparisons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no SIZE_TYPE
column type, I think it is more correct to use the appropriate data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If changing the definition of size type would affect the correctness or consistency of this code, we should use size type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree but this seems out of scope for this PR.
There are at least 6 APIs that do this today.
Perhaps you can create a separate issue if you want to reopen the discussion.
We document what data-type is used for the output of the column of these APIs.
I'm not comfortable saying the column type is whatever size-type equates to at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed a bit offline. The latest documented status is #3958 (comment) but from conversation since that was posted, @GregoryKimball seemed interested in moving towards option 3. I won't hold this PR up on that account.
/merge |
Adds section to developer guide about `cudf::size_type` and adds links to it from other relevant parts of the document. The fundamental nature of this type seems important enough to mention in the developer guide since it is the basis for how much of the code is designed and implemented. Also updates some doxygen for public APIs that are return `size_type` column values but had cited `INT32` specifically. Reference: #12779 (comment) Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Lawrence Mitchell (https://github.com/wence-) - Yunsong Wang (https://github.com/PointKernel) - Karthikeyan (https://github.com/karthikeyann) URL: #12904
Description
Adds more efficient counting algorithm specifically for columns with long strings--greater than 64 bytes on average.
The internal detail method will be used to help improve performance in other strings functions.
Checklist