Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add file size counter to cuIO benchmarks #10154

Merged
merged 4 commits into from
Jan 29, 2022

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Jan 28, 2022

Most cuIO benchmarks used dataframes of fixed size as input. After writing to a file in the given format, its size can vary greatly depending on the encoding and compression.
This PR adds a counter to output the file size, as it can be often corelated with the performance of readers/writers.

@vuule vuule added feature request New feature or request cuIO cuIO issue Performance Performance related issue non-breaking Non-breaking change labels Jan 28, 2022
@vuule vuule self-assigned this Jan 28, 2022
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 28, 2022
@vuule
Copy link
Contributor Author

vuule commented Jan 28, 2022

Sample output:

OrcRead/integral_file_input/30/0/1/1/0/manual_time               97.2 ms         63.0 ms            7 bytes_per_second=5.1425G/s file_size=389.961M peak_memory_usage=1096.41M
OrcRead/integral_file_input/30/1000/1/1/0/manual_time             112 ms         82.1 ms            6 bytes_per_second=4.47958G/s file_size=331.69M peak_memory_usage=1.14553G
OrcRead/integral_file_input/30/0/32/1/0/manual_time              69.9 ms         67.2 ms           10 bytes_per_second=7.15121G/s file_size=22.2183M peak_memory_usage=683.382M
OrcRead/integral_file_input/30/1000/32/1/0/manual_time           69.6 ms         67.2 ms           10 bytes_per_second=7.18037G/s file_size=20.5541M peak_memory_usage=683.241M
OrcRead/integral_file_input/30/0/1/0/0/manual_time               81.7 ms         47.1 ms            8 bytes_per_second=6.12364G/s file_size=396.36M peak_memory_usage=951.602M
OrcRead/integral_file_input/30/1000/1/0/0/manual_time            78.0 ms         43.5 ms            9 bytes_per_second=6.40907G/s file_size=396.308M peak_memory_usage=951.549M
OrcRead/integral_file_input/30/0/32/0/0/manual_time              65.1 ms         62.0 ms           11 bytes_per_second=7.6859G/s file_size=24.9541M peak_memory_usage=580.196M
OrcRead/integral_file_input/30/1000/32/0/0/manual_time           64.5 ms         61.4 ms           11 bytes_per_second=7.75274G/s file_size=24.916M peak_memory_usage=580.158M
OrcRead/integral_buffer_input/30/0/1/1/1/manual_time              105 ms          105 ms            7 bytes_per_second=4.76301G/s file_size=389.961M peak_memory_usage=1096.41M
OrcRead/integral_buffer_input/30/1000/1/1/1/manual_time           118 ms          118 ms            6 bytes_per_second=4.24934G/s file_size=331.69M peak_memory_usage=1.14553G
OrcRead/integral_buffer_input/30/0/32/1/1/manual_time            68.8 ms         68.8 ms           10 bytes_per_second=7.27146G/s file_size=22.2183M peak_memory_usage=683.382M
OrcRead/integral_buffer_input/30/1000/32/1/1/manual_time         68.6 ms         68.7 ms           10 bytes_per_second=7.28465G/s file_size=20.5542M peak_memory_usage=683.241M
OrcRead/integral_buffer_input/30/0/1/0/1/manual_time             88.9 ms         88.9 ms            7 bytes_per_second=5.62704G/s file_size=396.36M peak_memory_usage=951.602M
OrcRead/integral_buffer_input/30/1000/1/0/1/manual_time          87.2 ms         87.2 ms            8 bytes_per_second=5.73492G/s file_size=396.308M peak_memory_usage=951.549M
OrcRead/integral_buffer_input/30/0/32/0/1/manual_time            63.9 ms         63.9 ms           11 bytes_per_second=7.8284G/s file_size=24.9541M peak_memory_usage=580.196M
OrcRead/integral_buffer_input/30/1000/32/0/1/manual_time         64.1 ms         64.1 ms           11 bytes_per_second=7.80321G/s file_size=24.916M peak_memory_usage=580.158M

@@ -132,6 +130,7 @@ void BM_csv_read_varying_options(benchmark::State& state)
auto const data_processed = data_size * cols_to_read.size() / view.num_columns();
state.SetBytesProcessed(data_processed * state.iterations());
state.counters["peak_memory_usage"] = mem_stats_logger.peak_memory_usage();
state.counters["file_size"] = source_sink.size();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or "encoded_size" maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"encoded_file_size" ?

@codecov
Copy link

codecov bot commented Jan 28, 2022

Codecov Report

Merging #10154 (c426ce9) into branch-22.04 (e24fa8f) will increase coverage by 0.10%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10154      +/-   ##
================================================
+ Coverage         10.37%   10.48%   +0.10%     
================================================
  Files               119      122       +3     
  Lines             20149    20493     +344     
================================================
+ Hits               2091     2148      +57     
- Misses            18058    18345     +287     
Impacted Files Coverage Δ
python/cudf/cudf/errors.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 66 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5dd1c39...c426ce9. Read the comment docs.

@vuule vuule marked this pull request as ready for review January 28, 2022 02:46
@vuule vuule requested a review from a team as a code owner January 28, 2022 02:46
@vuule
Copy link
Contributor Author

vuule commented Jan 28, 2022

CC @GregoryKimball who sort of asked for this feature

Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest looks good

@@ -132,6 +130,7 @@ void BM_csv_read_varying_options(benchmark::State& state)
auto const data_processed = data_size * cols_to_read.size() / view.num_columns();
state.SetBytesProcessed(data_processed * state.iterations());
state.counters["peak_memory_usage"] = mem_stats_logger.peak_memory_usage();
state.counters["file_size"] = source_sink.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"encoded_file_size" ?

@vuule
Copy link
Contributor Author

vuule commented Jan 29, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit cf81b1a into rapidsai:branch-22.04 Jan 29, 2022
@vuule vuule deleted the fea-cuio-bm-file-size branch January 29, 2022 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants