Add UTF-8 chars to create_random_column<string_view> benchmark utility #7292

davidwendt · 2021-02-03T19:01:48Z

This updates the create_random_column<string_view> benchmark generate utility to support multi-byte UTF-8 characters. The original code only created columns with ASCII characters. The update also adds the space character which will be useful for text-based benchmarks in the future. Only 10 UTF-8 characters are included so a default distribution will still be mostly ASCII strings.

This will help in providing accurate measurements for adding benchmarks for strings APIs #5698.

This change also includes renaming DURATION_TO_STRING_BENCH_SRC to just STRINGS_BENCH since this benchmark will be folded into a general strings benchmark executable.

vuule

Good change. One perf-related suggestion:

cpp/benchmarks/common/generate_benchmark_input.cpp

vuule · 2021-02-03T20:28:11Z

rerun tests

kkraus14

cmake lgtm

codecov · 2021-02-04T01:35:41Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.19@8215b5b). Click here to learn what that means.
The diff coverage is n/a.

@@              Coverage Diff               @@
##             branch-0.19    #7292   +/-   ##
==============================================
  Coverage               ?   82.20%           
==============================================
  Files                  ?      100           
  Lines                  ?    16966           
  Branches               ?        0           
==============================================
  Hits                   ?    13947           
  Misses                 ?     3019           
  Partials               ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8215b5b...301b906. Read the comment docs.

cpp/benchmarks/common/generate_benchmark_input.cpp

codereport

Couple small changes / question about generate

cpp/benchmarks/common/generate_benchmark_input.cpp

codereport · 2021-02-04T23:57:11Z

cpp/benchmarks/common/generate_benchmark_input.cpp

+    if (ch < '\x7F') return static_cast<char>(ch);
+    // x7F is at the top edge of ASCII;
+    // the next set of characters are assigned two bytes
+    column_data.chars.push_back('\xC4');


See comment from dependent PR: #7316 (comment)

Consensus from C++ meetup was to change this to a for-loop.
#7316 (comment)

davidwendt · 2021-02-05T21:58:19Z

@gpucibot merge

@davidwendt

Reference #5698 This creates a gbenchmark for the `cudf::strings::to_lower`. The device logic is the same for `cudf::strings::to_upper` and `cudf::strings::swapcase` so this a good measure for the 3 APIs. This PR is dependent on changes in PR #7292 These are mostly in the `generate_benchmark_input.cpp` The initial results were as follows: ``` -------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------- StringCase/strings/4096/manual_time 0.278 ms 0.296 ms 2514 bytes_per_second=248.756M/s StringCase/strings/32768/manual_time 0.289 ms 0.307 ms 2421 bytes_per_second=1.86625G/s StringCase/strings/262144/manual_time 0.419 ms 0.438 ms 1662 bytes_per_second=10.2869G/s StringCase/strings/2097152/manual_time 2.59 ms 2.61 ms 269 bytes_per_second=13.3449G/s StringCase/strings/16777216/manual_time 25.9 ms 25.9 ms 27 bytes_per_second=10.6531G/s ``` The `convert_case` code here is a bit old. I changed it to use the more efficient `make_strings_children` utility and found the performance improved by 2x ``` -------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------- StringCase/strings/4096/manual_time 0.117 ms 0.135 ms 5877 bytes_per_second=592.795M/s StringCase/strings/32768/manual_time 0.122 ms 0.140 ms 5641 bytes_per_second=4.42664G/s StringCase/strings/262144/manual_time 0.274 ms 0.292 ms 2535 bytes_per_second=15.768G/s StringCase/strings/2097152/manual_time 1.59 ms 1.61 ms 441 bytes_per_second=21.759G/s StringCase/strings/16777216/manual_time 12.1 ms 12.1 ms 58 bytes_per_second=22.8626G/s ``` So these changes are also included in this PR. Authors: - David (@davidwendt) Approvers: - Conor Hoekstra (@codereport) - Vukasin Milovanovic (@vuule) - Mark Harris (@harrism) URL: #7316

Add UTF-8 chars to create_random_column utility

cb8b1ef

davidwendt added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 3, 2021

davidwendt self-assigned this Feb 3, 2021

davidwendt requested review from a team as code owners February 3, 2021 19:01

davidwendt requested review from cwharris, devavret and vuule February 3, 2021 19:01

vuule requested changes Feb 3, 2021

View reviewed changes

cpp/benchmarks/common/generate_benchmark_input.cpp Outdated Show resolved Hide resolved

cpp/benchmarks/common/generate_benchmark_input.cpp Show resolved Hide resolved

davidwendt added 2 commits February 3, 2021 16:05

update column_data.chars directly in append_string

c385976

Merge branch 'branch-0.19' into benchmark-random-utf8

b0a999a

kkraus14 approved these changes Feb 4, 2021

View reviewed changes

davidwendt requested a review from vuule February 4, 2021 12:47

Merge branch 'branch-0.19' into benchmark-random-utf8

d4c5cac

vuule approved these changes Feb 4, 2021

View reviewed changes

cpp/benchmarks/common/generate_benchmark_input.cpp Show resolved Hide resolved

davidwendt mentioned this pull request Feb 4, 2021

Add gbenchmark for cudf::strings::to_lower #7316

Merged

Merge branch 'branch-0.19' into benchmark-random-utf8

0efaa8c

codereport suggested changes Feb 4, 2021

View reviewed changes

davidwendt added 3 commits February 4, 2021 20:12

add const to variable declaration

7df4c95

Merge branch 'branch-0.19' into benchmark-random-utf8

ecc7823

change generate_n to for-loop

301b906

davidwendt requested a review from codereport February 5, 2021 16:34

codereport approved these changes Feb 5, 2021

View reviewed changes

vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Feb 5, 2021

rapids-bot bot merged commit 7e0437d into rapidsai:branch-0.19 Feb 5, 2021

davidwendt deleted the benchmark-random-utf8 branch February 5, 2021 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UTF-8 chars to create_random_column<string_view> benchmark utility #7292

Add UTF-8 chars to create_random_column<string_view> benchmark utility #7292

davidwendt commented Feb 3, 2021

vuule left a comment

vuule commented Feb 3, 2021

kkraus14 left a comment

codecov bot commented Feb 4, 2021 •

edited

Loading

codereport left a comment

codereport Feb 4, 2021

davidwendt Feb 5, 2021

davidwendt commented Feb 5, 2021

Add UTF-8 chars to create_random_column<string_view> benchmark utility #7292

Add UTF-8 chars to create_random_column<string_view> benchmark utility #7292

Conversation

davidwendt commented Feb 3, 2021

vuule left a comment

Choose a reason for hiding this comment

vuule commented Feb 3, 2021

kkraus14 left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 4, 2021 • edited Loading

Codecov Report

codereport left a comment

Choose a reason for hiding this comment

codereport Feb 4, 2021

Choose a reason for hiding this comment

davidwendt Feb 5, 2021

Choose a reason for hiding this comment

davidwendt commented Feb 5, 2021

codecov bot commented Feb 4, 2021 •

edited

Loading