Persist string statistics data across multiple calls to orc chunked write #10694

hyperbolic2346 · 2022-04-20T20:43:41Z

This is the second half of the chunked orc write statistics work. This part enables persisting the string data between write calls, building the file-level statistics from the stripe data, and writing out the statistics in a chunked-write file. Care is made to ensure that everything is persisted by re-using the same variable in the added test for both writes to ensure nothing is missed. This was verified to invalidate the first table before the second call to write.

~~This will clean up once 10567 goes in as this is branched off that work.~~

depends on #10567
closes #5826

codecov · 2022-04-21T17:37:59Z

Codecov Report

Merging #10694 (253f0a1) into branch-22.06 (8d861ce) will increase coverage by 0.04%.
The diff coverage is 96.49%.

@@               Coverage Diff                @@
##           branch-22.06   #10694      +/-   ##
================================================
+ Coverage         86.40%   86.44%   +0.04%     
================================================
  Files               143      143              
  Lines             22448    22493      +45     
================================================
+ Hits              19396    19445      +49     
+ Misses             3052     3048       -4

Impacted Files	Coverage Δ
python/cudf/cudf/core/indexed_frame.py	`91.70% <ø> (ø)`
python/cudf/cudf/core/dataframe.py	`93.77% <96.29%> (+0.08%)`	⬆️
python/cudf/cudf/testing/_utils.py	`94.05% <100.00%> (+0.06%)`	⬆️
python/cudf/cudf/core/column/string.py	`89.21% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.79% <0.00%> (+0.22%)`	⬆️
python/cudf/cudf/core/tools/datetimes.py	`84.49% <0.00%> (+0.30%)`	⬆️
python/cudf/cudf/core/column/lists.py	`92.91% <0.00%> (+0.83%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 14b5169...253f0a1. Read the comment docs.

cpp/src/io/orc/writer_impl.cu

…o-chunked-stats-p2

cpp/src/io/orc/writer_impl.cu

python/cudf/cudf/testing/_utils.py

vuule

Very partial review to flush my comments

python/cudf/cudf/tests/test_orc.py

cpp/src/io/orc/writer_impl.cu

vuule

Few suggestions for the new benchmark.
Thanks for adding it!

cpp/benchmarks/io/orc/orc_writer_chunks.cpp

vyasr

This looks great! But I think all our new benchmarks are supposed to use nvbench instead of Google benchmark (sorry to be that guy).

cpp/src/io/orc/writer_impl.cu

…o-chunked-stats-p2

hyperbolic2346 · 2022-05-05T04:56:19Z

But I think all our new benchmarks are supposed to use nvbench instead of Google benchmark (sorry to be that guy).

Not at all. Switched over to nvbench.

vuule · 2022-05-05T06:00:57Z

rerun tests

hyperbolic2346 · 2022-05-05T15:54:03Z

rerun tests

mythrocks · 2022-05-05T18:08:33Z

Well, this is odd.

09:12:25 CMakeFiles/cudf.dir/src/io/utilities/datasource.cpp.o: In function `kvikio::detail::open_flags(int)':
09:12:25 datasource.cpp:(.text+0x650): multiple definition of `kvikio::detail::open_flags(int)'
09:12:25 CMakeFiles/cudf.dir/src/io/utilities/data_sink.cpp.o:data_sink.cpp:(.text+0x468): first defined here
09:12:25 CMakeFiles/cudf.dir/src/io/utilities/datasource.cpp.o: In function `bool kvikio::detail::getenv_or<bool>(std::basic_string_view<char, std::char_traits<char> >, bool)':
09:12:25 datasource.cpp:(.text+0x950): multiple definition of `bool kvikio::detail::getenv_or<bool>(std::basic_string_view<char, std::char_traits<char> >, bool)'
09:12:25 CMakeFiles/cudf.dir/src/io/utilities/data_sink.cpp.o:data_sink.cpp:(.text+0x5c0): first defined here
09:12:25 collect2: error: ld returned 1 exit status

It doesn't look like kvikio has this wrong. The functions seem to be inline.
Edit: D'oh! It looks like I was viewing older code. Thanks for the fixes, @vyasr!

hyperbolic2346 · 2022-05-05T18:12:23Z

Well, this is odd.

09:12:25 CMakeFiles/cudf.dir/src/io/utilities/datasource.cpp.o: In function `kvikio::detail::open_flags(int)':
09:12:25 datasource.cpp:(.text+0x650): multiple definition of `kvikio::detail::open_flags(int)'
09:12:25 CMakeFiles/cudf.dir/src/io/utilities/data_sink.cpp.o:data_sink.cpp:(.text+0x468): first defined here
09:12:25 CMakeFiles/cudf.dir/src/io/utilities/datasource.cpp.o: In function `bool kvikio::detail::getenv_or<bool>(std::basic_string_view<char, std::char_traits<char> >, bool)':
09:12:25 datasource.cpp:(.text+0x950): multiple definition of `bool kvikio::detail::getenv_or<bool>(std::basic_string_view<char, std::char_traits<char> >, bool)'
09:12:25 CMakeFiles/cudf.dir/src/io/utilities/data_sink.cpp.o:data_sink.cpp:(.text+0x5c0): first defined here
09:12:25 collect2: error: ld returned 1 exit status

It doesn't look like kvikio has this wrong. The functions seem to be inline.

rapidsai/kvikio#69 should fix it.

hyperbolic2346 · 2022-05-05T20:01:45Z

rerun tests

cpp/src/io/orc/writer_impl.cu

mythrocks

Sorry it took so long to get to. Thanks for the pointers on writing benchmarks.
(P.S. The approval is on the C++ code. I didn't review the .py files.)

vuule

Good stuff! Few very minor suggestions.

cpp/src/io/orc/writer_impl.cu

hyperbolic2346 · 2022-05-06T05:10:21Z

@gpucibot merge

galipremsagar · 2022-05-09T20:38:46Z

python/cudf/cudf/tests/test_orc.py

@@ -724,6 +724,105 @@ def test_orc_write_statistics(tmpdir, datadir, nrows, stats_freq):
                assert stats_num_vals == actual_num_vals


+@pytest.mark.parametrize("stats_freq", ["STRIPE", "ROWGROUP"])
+@pytest.mark.parametrize("nrows", [2, 100, 6000000])
+def test_orc_chunked_write_statistics(tmpdir, datadir, nrows, stats_freq):


I ran into this during a refactor, datadir and stats_freq aren't being used in this pytest. Do we want to keep them or remove them @hyperbolic2346 ?

hyperbolic2346 requested review from a team as code owners April 20, 2022 20:43

hyperbolic2346 requested review from vyasr, charlesbluca and rgsl888prabhu April 20, 2022 20:43

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Apr 20, 2022

hyperbolic2346 added feature request New feature or request 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Apr 21, 2022

hyperbolic2346 changed the title ~~Mwilson/cuio chunked stats p2~~ Persist string statistics data across multiple calls to orc chunked write Apr 23, 2022

mythrocks reviewed Apr 25, 2022

View reviewed changes

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved

mythrocks reviewed Apr 25, 2022

View reviewed changes

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved

mythrocks reviewed Apr 25, 2022

View reviewed changes

cpp/src/io/orc/writer_impl.cu Show resolved Hide resolved

mythrocks reviewed Apr 25, 2022

View reviewed changes

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved

hyperbolic2346 added 10 commits April 26, 2022 05:29

first pass at splitting up stats

b345be8

cleanup

dabd18e

Fixing some dangling pointer issues

2ebf073

first pass at chunked writing stastistics

2077123

updates from review comments

b53432b

updating test and fixing merge issue

4cc292e

linting

0864760

fixing other tests that were unhappy with variable-length arrays

68230fe

Merge remote-tracking branch 'upstream/branch-22.06' into mwilson/cui…

5a06b36

…o-chunked-stats-p2

rebasing on branch-22.06

cf53ed4

hyperbolic2346 force-pushed the mwilson/cuio-chunked-stats-p2 branch from ef13290 to cf53ed4 Compare April 27, 2022 02:20

updating to use cooperative group memcpy_async

63116e5

fixing string persisting

63233db

hyperbolic2346 requested a review from vuule April 28, 2022 21:16

vyasr requested changes May 2, 2022

View reviewed changes

vuule reviewed May 3, 2022

View reviewed changes

python/cudf/cudf/tests/test_orc.py Show resolved Hide resolved

cpp/src/io/orc/writer_impl.cu Show resolved Hide resolved

hyperbolic2346 added 2 commits May 3, 2022 00:41

updating from review comments

7c67e5a

adding orc chunked writer benchmarks and adding a test comment

585a165

github-actions bot added the CMake CMake build issue label May 3, 2022

vuule reviewed May 4, 2022

View reviewed changes

vyasr requested changes May 4, 2022

View reviewed changes

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved

cpp/src/io/orc/writer_impl.cu Show resolved Hide resolved

hyperbolic2346 added 2 commits May 5, 2022 04:52

switching to nvbench for orc benchmarks

0cba0a6

Merge remote-tracking branch 'upstream/branch-22.06' into mwilson/cui…

3447424

…o-chunked-stats-p2

vyasr approved these changes May 5, 2022

View reviewed changes

mythrocks reviewed May 5, 2022

View reviewed changes

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved

mythrocks approved these changes May 5, 2022

View reviewed changes

vuule approved these changes May 6, 2022

View reviewed changes

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved

updating from review comments

253f0a1

rapids-bot bot merged commit d574c69 into rapidsai:branch-22.06 May 6, 2022

hyperbolic2346 deleted the mwilson/cuio-chunked-stats-p2 branch May 6, 2022 05:10

galipremsagar reviewed May 9, 2022

View reviewed changes

vuule mentioned this pull request Jun 8, 2022

[FEA] Add File Statistic when writing the ORC file #10075

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist string statistics data across multiple calls to orc chunked write #10694

Persist string statistics data across multiple calls to orc chunked write #10694

hyperbolic2346 commented Apr 20, 2022 •

edited

Loading

codecov bot commented Apr 21, 2022 •

edited

Loading

vuule left a comment

vuule left a comment

vyasr left a comment

hyperbolic2346 commented May 5, 2022

vuule commented May 5, 2022

hyperbolic2346 commented May 5, 2022

mythrocks commented May 5, 2022 •

edited

Loading

hyperbolic2346 commented May 5, 2022

hyperbolic2346 commented May 5, 2022

mythrocks left a comment •

edited

Loading

vuule left a comment

hyperbolic2346 commented May 6, 2022

galipremsagar May 9, 2022

Persist string statistics data across multiple calls to orc chunked write #10694

Persist string statistics data across multiple calls to orc chunked write #10694

Conversation

hyperbolic2346 commented Apr 20, 2022 • edited Loading

codecov bot commented Apr 21, 2022 • edited Loading

Codecov Report

vuule left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

hyperbolic2346 commented May 5, 2022

vuule commented May 5, 2022

hyperbolic2346 commented May 5, 2022

mythrocks commented May 5, 2022 • edited Loading

hyperbolic2346 commented May 5, 2022

hyperbolic2346 commented May 5, 2022

mythrocks left a comment • edited Loading

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

hyperbolic2346 commented May 6, 2022

galipremsagar May 9, 2022

Choose a reason for hiding this comment

hyperbolic2346 commented Apr 20, 2022 •

edited

Loading

codecov bot commented Apr 21, 2022 •

edited

Loading

mythrocks commented May 5, 2022 •

edited

Loading

mythrocks left a comment •

edited

Loading