-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gbenchmark for nvtext normalize functions #7668
Add gbenchmark for nvtext normalize functions #7668
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7668 +/- ##
===============================================
+ Coverage 82.08% 82.47% +0.38%
===============================================
Files 101 101
Lines 17036 17397 +361
===============================================
+ Hits 13984 14348 +364
+ Misses 3052 3049 -3
Continue to review full report at Codecov.
|
auto const n_rows = static_cast<cudf::size_type>(state.range(0)); | ||
auto const max_str_length = static_cast<cudf::size_type>(state.range(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmake approval.
@gpucibot merge |
Reference #5696
Creates a gbenchmark for
nvtext::normalize_spaces()
andnvtext::normalize_characters()
functions.The benchmarks measures various string lengths and number of rows.
I found that
normalize_spaces()
is used in haproxy parsing along withextract
so having this benchmark helps measure possible performance improvement solutions there.The
normalize_characters
is the same code used as part of thesubword_tokenizer
.Since each requires different memory footprint my initial goal for them to share a common benchmark structure did not work out. So the 2 tests are separate gbenchmark test files.
I refactored some of this code to use the more efficient
make_strings_children
and this improved the performance ofnormalize_spaces
by 2-3x.The current subword-tokenizer gbenchmark is also incorporated into the the TEXT_BENCHMARK gbenchmark.