Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gbenchmark for nvtext normalize functions #7668

Merged

Conversation

davidwendt
Copy link
Contributor

Reference #5696
Creates a gbenchmark for nvtext::normalize_spaces() and nvtext::normalize_characters() functions.
The benchmarks measures various string lengths and number of rows.
I found that normalize_spaces() is used in haproxy parsing along with extract so having this benchmark helps measure possible performance improvement solutions there.
The normalize_characters is the same code used as part of the subword_tokenizer.

Since each requires different memory footprint my initial goal for them to share a common benchmark structure did not work out. So the 2 tests are separate gbenchmark test files.

I refactored some of this code to use the more efficient make_strings_children and this improved the performance of normalize_spaces by 2-3x.

The current subword-tokenizer gbenchmark is also incorporated into the the TEXT_BENCHMARK gbenchmark.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 22, 2021
@davidwendt davidwendt self-assigned this Mar 22, 2021
@davidwendt davidwendt requested review from a team as code owners March 22, 2021 15:03
@davidwendt davidwendt requested review from vuule and codereport March 22, 2021 15:03
@github-actions github-actions bot added the CMake CMake build issue label Mar 22, 2021
@codecov
Copy link

codecov bot commented Mar 22, 2021

Codecov Report

Merging #7668 (6598945) into branch-0.19 (5d7767e) will increase coverage by 0.38%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7668      +/-   ##
===============================================
+ Coverage        82.08%   82.47%   +0.38%     
===============================================
  Files              101      101              
  Lines            17036    17397     +361     
===============================================
+ Hits             13984    14348     +364     
+ Misses            3052     3049       -3     
Impacted Files Coverage Δ
python/cudf/cudf/utils/gpu_utils.py 53.65% <0.00%> (-4.88%) ⬇️
python/cudf/cudf/core/abc.py 87.23% <0.00%> (-1.14%) ⬇️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/struct.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/hash_vocab_utils.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d7767e...6598945. Read the comment docs.

Comment on lines +34 to +35
auto const n_rows = static_cast<cudf::size_type>(state.range(0));
auto const max_str_length = static_cast<cudf::size_type>(state.range(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmake approval.

@harrism
Copy link
Member

harrism commented Mar 23, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e0056ed into rapidsai:branch-0.19 Mar 23, 2021
@harrism harrism removed the 3 - Ready for Review Ready for review by team label Mar 23, 2021
@davidwendt davidwendt deleted the benchmark-nvtext-normalize branch March 23, 2021 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants