Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BGZIP multibyte_split benchmark #11723

Merged
merged 7 commits into from
Oct 10, 2022

Conversation

upsj
Copy link
Contributor

@upsj upsj commented Sep 20, 2022

Description

This refactors #11652 to extract the BGZIP IO and adds another source_type to the multibyte_split benchmark, creating a compressed file using zlib.

A quick benchmark shows performance results around 2.5x slower than reading from a device buffer at around 1:5 compression ratio

[0] Tesla T4

source_type delim_size delim_percent size_approx byte_range_percent Time Peak Memory Usage Encoded file size
bgzip 1 1 2^30 = 1073741824 100 507.479 ms 4.022 GiB 1006.638 MiB
file 1 1 2^30 = 1073741824 100 339.860 ms 3.947 GiB 1006.638 MiB
device 1 1 2^30 = 1073741824 100 201.556 ms 3.947 GiB 1006.638 MiB

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@upsj upsj added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 20, 2022
@upsj upsj added this to the Genomics read_text support milestone Sep 20, 2022
@upsj upsj self-assigned this Sep 20, 2022
@upsj upsj requested review from a team as code owners September 20, 2022 19:34
@upsj upsj requested review from harrism and PointKernel September 20, 2022 19:34
@github-actions github-actions bot added the CMake CMake build issue label Sep 20, 2022
@PointKernel
Copy link
Member

@upsj Is this PR depending on #11652?

@upsj
Copy link
Contributor Author

upsj commented Sep 21, 2022

@PointKernel yes, should we wait with the review until #11652 is merged? Many of the suggestions also apply to the refactored code.

@upsj upsj requested review from vuule and removed request for harrism September 21, 2022 06:00
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes LGTM

@PointKernel
Copy link
Member

should we wait with the review until #11652 is merged?

Yes please, this will avoid unnecessary back-and-forth effort.

@upsj upsj removed the 3 - Ready for Review Ready for review by team label Sep 21, 2022
@upsj upsj mentioned this pull request Sep 26, 2022
7 tasks
@vuule vuule added the 0 - Waiting on Author Waiting for author to respond to review label Sep 27, 2022
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be larger than just a benchmark.

@vuule
Copy link
Contributor

vuule commented Sep 27, 2022

This seems to be larger than just a benchmark.

I think 22.10 needs to be merged so #11652 changes are excluded. @upsj do you want to keep this targeted for 22.10?

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Some minor comments/questions.

cpp/tests/io/text/data_chunk_source_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/text/data_chunk_source_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/text/data_chunk_source_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/text/data_chunk_source_test.cpp Show resolved Hide resolved
@upsj
Copy link
Contributor Author

upsj commented Oct 3, 2022

rerun tests

@codecov
Copy link

codecov bot commented Oct 3, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.12@a270ae6). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.12   #11723   +/-   ##
===============================================
  Coverage                ?   87.48%           
===============================================
  Files                   ?      133           
  Lines                   ?    21866           
  Branches                ?        0           
===============================================
  Hits                    ?    19130           
  Misses                  ?     2736           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@PointKernel
Copy link
Member

PointKernel commented Oct 3, 2022

source_type
bgzip
file
device

@upsj How did you print out (display) the source type without using NVBENCH_DECLARE_ENUM_TYPE_STRINGS?

@upsj
Copy link
Contributor Author

upsj commented Oct 4, 2022

@PointKernel I didn't 😉 I edited the output manually for clarity, but thanks for reminding me of the macro, I'll use it instead

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpick. Looks great!

cpp/src/io/text/bgzip_utils.cpp Outdated Show resolved Hide resolved
cpp/src/io/text/bgzip_utils.cpp Outdated Show resolved Hide resolved
cpp/src/io/text/bgzip_utils.cpp Outdated Show resolved Hide resolved
cpp/src/io/text/bgzip_utils.cpp Outdated Show resolved Hide resolved
@vuule vuule changed the base branch from branch-22.12 to branch-22.10 October 4, 2022 16:06
@vuule vuule changed the base branch from branch-22.10 to branch-22.12 October 4, 2022 16:06
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

cpp/benchmarks/io/text/multibyte_split.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/io/text/multibyte_split.cpp Outdated Show resolved Hide resolved
* move header to include
* more constexpr
* rename ostreams to output_stream
* fewer magic numbers
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small comments that you can handle as you wish, otherwise LGTM.

cpp/include/cudf/io/text/detail/bgzip_utils.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/text/detail/bgzip_utils.hpp Outdated Show resolved Hide resolved
@upsj upsj added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 8, 2022
@upsj upsj requested a review from a team as a code owner October 9, 2022 08:30
@github-actions github-actions bot added the conda label Oct 9, 2022
@upsj upsj removed the 5 - Ready to Merge Testing and reviews complete, ready to merge label Oct 9, 2022
@upsj upsj added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Oct 10, 2022
@upsj
Copy link
Contributor Author

upsj commented Oct 10, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4eb9c6c into rapidsai:branch-22.12 Oct 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants