Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist string statistics data across multiple calls to orc chunked write #10694

Merged

Conversation

hyperbolic2346
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 commented Apr 20, 2022

This is the second half of the chunked orc write statistics work. This part enables persisting the string data between write calls, building the file-level statistics from the stripe data, and writing out the statistics in a chunked-write file. Care is made to ensure that everything is persisted by re-using the same variable in the added test for both writes to ensure nothing is missed. This was verified to invalidate the first table before the second call to write.

This will clean up once 10567 goes in as this is branched off that work.

depends on #10567
closes #5826

@hyperbolic2346 hyperbolic2346 requested review from a team as code owners April 20, 2022 20:43
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Apr 20, 2022
@hyperbolic2346 hyperbolic2346 added feature request New feature or request 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Apr 21, 2022
@codecov
Copy link

codecov bot commented Apr 21, 2022

Codecov Report

Merging #10694 (253f0a1) into branch-22.06 (8d861ce) will increase coverage by 0.04%.
The diff coverage is 96.49%.

@@               Coverage Diff                @@
##           branch-22.06   #10694      +/-   ##
================================================
+ Coverage         86.40%   86.44%   +0.04%     
================================================
  Files               143      143              
  Lines             22448    22493      +45     
================================================
+ Hits              19396    19445      +49     
+ Misses             3052     3048       -4     
Impacted Files Coverage Δ
python/cudf/cudf/core/indexed_frame.py 91.70% <ø> (ø)
python/cudf/cudf/core/dataframe.py 93.77% <96.29%> (+0.08%) ⬆️
python/cudf/cudf/testing/_utils.py 94.05% <100.00%> (+0.06%) ⬆️
python/cudf/cudf/core/column/string.py 89.21% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.79% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
python/cudf/cudf/core/column/lists.py 92.91% <0.00%> (+0.83%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 14b5169...253f0a1. Read the comment docs.

@hyperbolic2346 hyperbolic2346 changed the title Mwilson/cuio chunked stats p2 Persist string statistics data across multiple calls to orc chunked write Apr 23, 2022
@hyperbolic2346 hyperbolic2346 force-pushed the mwilson/cuio-chunked-stats-p2 branch from ef13290 to cf53ed4 Compare April 27, 2022 02:20
@hyperbolic2346 hyperbolic2346 requested a review from vuule April 28, 2022 21:16
cpp/src/io/orc/writer_impl.cu Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Show resolved Hide resolved
python/cudf/cudf/testing/_utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very partial review to flush my comments

python/cudf/cudf/tests/test_orc.py Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Show resolved Hide resolved
@github-actions github-actions bot added the CMake CMake build issue label May 3, 2022
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few suggestions for the new benchmark.
Thanks for adding it!

cpp/benchmarks/io/orc/orc_writer_chunks.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/io/orc/orc_writer_chunks.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/io/orc/orc_writer_chunks.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/io/orc/orc_writer_chunks.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! But I think all our new benchmarks are supposed to use nvbench instead of Google benchmark (sorry to be that guy).

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Show resolved Hide resolved
@hyperbolic2346
Copy link
Contributor Author

But I think all our new benchmarks are supposed to use nvbench instead of Google benchmark (sorry to be that guy).

Not at all. Switched over to nvbench.

@vuule
Copy link
Contributor

vuule commented May 5, 2022

rerun tests

1 similar comment
@hyperbolic2346
Copy link
Contributor Author

rerun tests

@mythrocks
Copy link
Contributor

mythrocks commented May 5, 2022

Well, this is odd.

09:12:25 CMakeFiles/cudf.dir/src/io/utilities/datasource.cpp.o: In function `kvikio::detail::open_flags(int)':
09:12:25 datasource.cpp:(.text+0x650): multiple definition of `kvikio::detail::open_flags(int)'
09:12:25 CMakeFiles/cudf.dir/src/io/utilities/data_sink.cpp.o:data_sink.cpp:(.text+0x468): first defined here
09:12:25 CMakeFiles/cudf.dir/src/io/utilities/datasource.cpp.o: In function `bool kvikio::detail::getenv_or<bool>(std::basic_string_view<char, std::char_traits<char> >, bool)':
09:12:25 datasource.cpp:(.text+0x950): multiple definition of `bool kvikio::detail::getenv_or<bool>(std::basic_string_view<char, std::char_traits<char> >, bool)'
09:12:25 CMakeFiles/cudf.dir/src/io/utilities/data_sink.cpp.o:data_sink.cpp:(.text+0x5c0): first defined here
09:12:25 collect2: error: ld returned 1 exit status

It doesn't look like kvikio has this wrong. The functions seem to be inline.
Edit: D'oh! It looks like I was viewing older code. Thanks for the fixes, @vyasr!

@hyperbolic2346
Copy link
Contributor Author

Well, this is odd.

09:12:25 CMakeFiles/cudf.dir/src/io/utilities/datasource.cpp.o: In function `kvikio::detail::open_flags(int)':
09:12:25 datasource.cpp:(.text+0x650): multiple definition of `kvikio::detail::open_flags(int)'
09:12:25 CMakeFiles/cudf.dir/src/io/utilities/data_sink.cpp.o:data_sink.cpp:(.text+0x468): first defined here
09:12:25 CMakeFiles/cudf.dir/src/io/utilities/datasource.cpp.o: In function `bool kvikio::detail::getenv_or<bool>(std::basic_string_view<char, std::char_traits<char> >, bool)':
09:12:25 datasource.cpp:(.text+0x950): multiple definition of `bool kvikio::detail::getenv_or<bool>(std::basic_string_view<char, std::char_traits<char> >, bool)'
09:12:25 CMakeFiles/cudf.dir/src/io/utilities/data_sink.cpp.o:data_sink.cpp:(.text+0x5c0): first defined here
09:12:25 collect2: error: ld returned 1 exit status

It doesn't look like kvikio has this wrong. The functions seem to be inline.

rapidsai/kvikio#69 should fix it.

@hyperbolic2346
Copy link
Contributor Author

rerun tests

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took so long to get to. Thanks for the pointers on writing benchmarks.
(P.S. The approval is on the C++ code. I didn't review the .py files.)

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Few very minor suggestions.

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
@hyperbolic2346
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit d574c69 into rapidsai:branch-22.06 May 6, 2022
@hyperbolic2346 hyperbolic2346 deleted the mwilson/cuio-chunked-stats-p2 branch May 6, 2022 05:10
@@ -724,6 +724,105 @@ def test_orc_write_statistics(tmpdir, datadir, nrows, stats_freq):
assert stats_num_vals == actual_num_vals


@pytest.mark.parametrize("stats_freq", ["STRIPE", "ROWGROUP"])
@pytest.mark.parametrize("nrows", [2, 100, 6000000])
def test_orc_chunked_write_statistics(tmpdir, datadir, nrows, stats_freq):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into this during a refactor, datadir and stats_freq aren't being used in this pytest. Do we want to keep them or remove them @hyperbolic2346 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] ORC file-level statistics omitted with chunked writes
5 participants