-
Notifications
You must be signed in to change notification settings - Fork 922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UTF-8 chars to create_random_column<string_view> benchmark utility #7292
Add UTF-8 chars to create_random_column<string_view> benchmark utility #7292
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good change. One perf-related suggestion:
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmake lgtm
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7292 +/- ##
==============================================
Coverage ? 82.20%
==============================================
Files ? 100
Lines ? 16966
Branches ? 0
==============================================
Hits ? 13947
Misses ? 3019
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple small changes / question about generate
if (ch < '\x7F') return static_cast<char>(ch); | ||
// x7F is at the top edge of ASCII; | ||
// the next set of characters are assigned two bytes | ||
column_data.chars.push_back('\xC4'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment from dependent PR: #7316 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consensus from C++ meetup was to change this to a for-loop.
#7316 (comment)
@gpucibot merge |
Reference #5698 This creates a gbenchmark for the `cudf::strings::to_lower`. The device logic is the same for `cudf::strings::to_upper` and `cudf::strings::swapcase` so this a good measure for the 3 APIs. This PR is dependent on changes in PR #7292 These are mostly in the `generate_benchmark_input.cpp` The initial results were as follows: ``` -------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------- StringCase/strings/4096/manual_time 0.278 ms 0.296 ms 2514 bytes_per_second=248.756M/s StringCase/strings/32768/manual_time 0.289 ms 0.307 ms 2421 bytes_per_second=1.86625G/s StringCase/strings/262144/manual_time 0.419 ms 0.438 ms 1662 bytes_per_second=10.2869G/s StringCase/strings/2097152/manual_time 2.59 ms 2.61 ms 269 bytes_per_second=13.3449G/s StringCase/strings/16777216/manual_time 25.9 ms 25.9 ms 27 bytes_per_second=10.6531G/s ``` The `convert_case` code here is a bit old. I changed it to use the more efficient `make_strings_children` utility and found the performance improved by 2x ``` -------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------- StringCase/strings/4096/manual_time 0.117 ms 0.135 ms 5877 bytes_per_second=592.795M/s StringCase/strings/32768/manual_time 0.122 ms 0.140 ms 5641 bytes_per_second=4.42664G/s StringCase/strings/262144/manual_time 0.274 ms 0.292 ms 2535 bytes_per_second=15.768G/s StringCase/strings/2097152/manual_time 1.59 ms 1.61 ms 441 bytes_per_second=21.759G/s StringCase/strings/16777216/manual_time 12.1 ms 12.1 ms 58 bytes_per_second=22.8626G/s ``` So these changes are also included in this PR. Authors: - David (@davidwendt) Approvers: - Conor Hoekstra (@codereport) - Vukasin Milovanovic (@vuule) - Mark Harris (@harrism) URL: #7316
This updates the
create_random_column<string_view>
benchmark generate utility to support multi-byte UTF-8 characters. The original code only created columns with ASCII characters. The update also adds the space character which will be useful for text-based benchmarks in the future. Only 10 UTF-8 characters are included so a default distribution will still be mostly ASCII strings.This will help in providing accurate measurements for adding benchmarks for strings APIs #5698.
This change also includes renaming
DURATION_TO_STRING_BENCH_SRC
to justSTRINGS_BENCH
since this benchmark will be folded into a general strings benchmark executable.