Add gbenchmark for nvtext normalize functions #7668

davidwendt · 2021-03-22T15:03:51Z

Reference #5696
Creates a gbenchmark for nvtext::normalize_spaces() and nvtext::normalize_characters() functions.
The benchmarks measures various string lengths and number of rows.
I found that normalize_spaces() is used in haproxy parsing along with extract so having this benchmark helps measure possible performance improvement solutions there.
The normalize_characters is the same code used as part of the subword_tokenizer.

Since each requires different memory footprint my initial goal for them to share a common benchmark structure did not work out. So the 2 tests are separate gbenchmark test files.

I refactored some of this code to use the more efficient make_strings_children and this improved the performance of normalize_spaces by 2-3x.

The current subword-tokenizer gbenchmark is also incorporated into the the TEXT_BENCHMARK gbenchmark.

…lity

codecov · 2021-03-22T18:10:59Z

Codecov Report

Merging #7668 (6598945) into branch-0.19 (5d7767e) will increase coverage by 0.38%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.19    #7668      +/-   ##
===============================================
+ Coverage        82.08%   82.47%   +0.38%     
===============================================
  Files              101      101              
  Lines            17036    17397     +361     
===============================================
+ Hits             13984    14348     +364     
+ Misses            3052     3049       -3

Impacted Files	Coverage Δ
python/cudf/cudf/utils/gpu_utils.py	`53.65% <0.00%> (-4.88%)`	⬇️
python/cudf/cudf/core/abc.py	`87.23% <0.00%> (-1.14%)`	⬇️
python/cudf/cudf/io/feather.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/comm/serialize.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/io.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/struct.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/fuzzer.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/hash_vocab_utils.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_csv.py	`100.00% <0.00%> (ø)`
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d7767e...6598945. Read the comment docs.

codereport · 2021-03-23T03:19:17Z

cpp/benchmarks/text/normalize_benchmark.cpp

+  auto const n_rows         = static_cast<cudf::size_type>(state.range(0));
+  auto const max_str_length = static_cast<cudf::size_type>(state.range(1));


harrism

cmake approval.

harrism · 2021-03-23T03:25:22Z

@gpucibot merge

davidwendt added 3 commits March 22, 2021 10:23

Add gbenchmark for nvtext normalize functions

392f6c9

refactor normalize.cu to use more efficient make_strings_children uti…

b76ddcb

…lity

Merge branch 'branch-0.19' into benchmark-nvtext-normalize

6598945

davidwendt added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 22, 2021

davidwendt self-assigned this Mar 22, 2021

davidwendt requested review from a team as code owners March 22, 2021 15:03

davidwendt requested review from vuule and codereport March 22, 2021 15:03

github-actions bot added the CMake CMake build issue label Mar 22, 2021

vuule approved these changes Mar 22, 2021

View reviewed changes

codereport approved these changes Mar 23, 2021

View reviewed changes

harrism approved these changes Mar 23, 2021

View reviewed changes

rapids-bot bot merged commit e0056ed into rapidsai:branch-0.19 Mar 23, 2021

harrism removed the 3 - Ready for Review Ready for review by team label Mar 23, 2021

davidwendt deleted the benchmark-nvtext-normalize branch March 23, 2021 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gbenchmark for nvtext normalize functions #7668

Add gbenchmark for nvtext normalize functions #7668

davidwendt commented Mar 22, 2021

codecov bot commented Mar 22, 2021 •

edited

Loading

codereport Mar 23, 2021

harrism left a comment

harrism commented Mar 23, 2021

		auto const n_rows = static_cast<cudf::size_type>(state.range(0));
		auto const max_str_length = static_cast<cudf::size_type>(state.range(1));

Add gbenchmark for nvtext normalize functions #7668

Add gbenchmark for nvtext normalize functions #7668

Conversation

davidwendt commented Mar 22, 2021

codecov bot commented Mar 22, 2021 • edited Loading

Codecov Report

codereport Mar 23, 2021

Choose a reason for hiding this comment

harrism left a comment

Choose a reason for hiding this comment

harrism commented Mar 23, 2021

codecov bot commented Mar 22, 2021 •

edited

Loading