Fix maximum page size estimate in Parquet writer #11962

vuule · 2022-10-20T22:33:25Z

Description

Closes #11916

cuda memcheck reports an OOB write in one of the tests. The root cause is an underallocated buffer for encoded pages.
This PR fixes the computation of the maximum size of data pages (RLE encoded) when dictionary encoding is used.
Other changes:
Refactored max RLE page size computation to avoid code repetition.
Use actual dictionary index width instead of (outdated) worst case.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vuule · 2022-10-20T23:50:34Z

cpp/src/io/parquet/page_enc.cu

-          page_size =
-            1 + 5 + ((values_in_page * ck_g.dict_rle_bits + 7) >> 3) + (values_in_page >> 8);
+          // Additional byte to store entry bit width
+          page_size = 1 + max_RLE_page_size(ck_g.dict_rle_bits, values_in_page);


Based on Apache Parquet format docs:

Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width).

This is the bug fix and the rest of the diff is refactoring, right? Is a C++ test or Python comparison to another reader/writer needed?

Technically there is already a gtest for this. The error was an OOB write and only manifested with a memcheck without an rmm pool which is tested nightly.

Good call. Approving.

codecov · 2022-10-21T01:20:07Z

Codecov Report

Base: 88.09% // Head: 88.14% // Increases project coverage by +0.04% 🎉

Coverage data is based on head (3a2fe47) compared to base (6ca2ceb).
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.12   #11962      +/-   ##
================================================
+ Coverage         88.09%   88.14%   +0.04%     
================================================
  Files               133      133              
  Lines             21905    21982      +77     
================================================
+ Hits              19298    19376      +78     
+ Misses             2607     2606       -1

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/backends.py	`84.90% <0.00%> (-0.37%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`73.92% <0.00%> (-0.21%)`	⬇️
python/dask_cudf/dask_cudf/io/tests/test_csv.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_orc.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_json.py	`100.00% <0.00%> (ø)`
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/core/column/column.py	`88.46% <0.00%> (+0.04%)`	⬆️
python/cudf/cudf/core/dataframe.py	`93.77% <0.00%> (+0.04%)`	⬆️
python/dask_cudf/dask_cudf/tests/test_core.py	`95.37% <0.00%> (+0.09%)`	⬆️
python/cudf/cudf/core/column/string.py	`88.65% <0.00%> (+0.12%)`	⬆️
... and 4 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

bdice · 2022-10-21T12:53:03Z

cpp/src/io/parquet/page_enc.cu

@@ -217,6 +217,14 @@ __global__ void __launch_bounds__(128)
  if (frag_id < num_fragments_per_column and lane_id == 0) groups[column_id][frag_id] = *g;
 }

+constexpr uint32_t max_RLE_page_size(uint8_t value_bit_width, uint32_t num_values)


Nice refactoring here.

bdice · 2022-10-21T12:53:59Z

cpp/src/io/parquet/page_enc.cu

-          page_size =
-            1 + 5 + ((values_in_page * ck_g.dict_rle_bits + 7) >> 3) + (values_in_page >> 8);
+          // Additional byte to store entry bit width
+          page_size = 1 + max_RLE_page_size(ck_g.dict_rle_bits, values_in_page);


This is the bug fix and the rest of the diff is refactoring, right? Is a C++ test or Python comparison to another reader/writer needed?

vuule · 2022-10-21T18:00:29Z

@gpucibot merge

vuule added 2 commits October 20, 2022 14:27

fix dict data page size; refactor dupe code

15de42a

don't assume dict width

3a2fe47

vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Oct 20, 2022

vuule self-assigned this Oct 20, 2022

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 20, 2022

vuule commented Oct 20, 2022

View reviewed changes

vuule marked this pull request as ready for review October 21, 2022 05:59

vuule requested a review from a team as a code owner October 21, 2022 05:59

vuule requested review from robertmaynard and davidwendt October 21, 2022 05:59

bdice reviewed Oct 21, 2022

View reviewed changes

davidwendt approved these changes Oct 21, 2022

View reviewed changes

bdice approved these changes Oct 21, 2022

View reviewed changes

rapids-bot bot merged commit 7940b5b into rapidsai:branch-22.12 Oct 21, 2022

vuule deleted the bug-dict-data-rle-page-size branch October 21, 2022 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix maximum page size estimate in Parquet writer #11962

Fix maximum page size estimate in Parquet writer #11962

vuule commented Oct 20, 2022 •

edited

Loading

vuule Oct 20, 2022

bdice Oct 21, 2022

davidwendt Oct 21, 2022

bdice Oct 21, 2022

codecov bot commented Oct 21, 2022 •

edited

Loading

bdice Oct 21, 2022

bdice Oct 21, 2022

vuule commented Oct 21, 2022

Fix maximum page size estimate in Parquet writer #11962

Fix maximum page size estimate in Parquet writer #11962

Conversation

vuule commented Oct 20, 2022 • edited Loading

Description

Checklist

vuule Oct 20, 2022

Choose a reason for hiding this comment

bdice Oct 21, 2022

Choose a reason for hiding this comment

davidwendt Oct 21, 2022

Choose a reason for hiding this comment

bdice Oct 21, 2022

Choose a reason for hiding this comment

codecov bot commented Oct 21, 2022 • edited Loading

Codecov Report

bdice Oct 21, 2022

Choose a reason for hiding this comment

bdice Oct 21, 2022

Choose a reason for hiding this comment

vuule commented Oct 21, 2022

vuule commented Oct 20, 2022 •

edited

Loading

codecov bot commented Oct 21, 2022 •

edited

Loading