Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix maximum page size estimate in Parquet writer #11962

Merged
merged 2 commits into from
Oct 21, 2022

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Oct 20, 2022

Description

Closes #11916

cuda memcheck reports an OOB write in one of the tests. The root cause is an underallocated buffer for encoded pages.
This PR fixes the computation of the maximum size of data pages (RLE encoded) when dictionary encoding is used.
Other changes:
Refactored max RLE page size computation to avoid code repetition.
Use actual dictionary index width instead of (outdated) worst case.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Oct 20, 2022
@vuule vuule self-assigned this Oct 20, 2022
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 20, 2022
page_size =
1 + 5 + ((values_in_page * ck_g.dict_rle_bits + 7) >> 3) + (values_in_page >> 8);
// Additional byte to store entry bit width
page_size = 1 + max_RLE_page_size(ck_g.dict_rle_bits, values_in_page);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Apache Parquet format docs:

Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug fix and the rest of the diff is refactoring, right? Is a C++ test or Python comparison to another reader/writer needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically there is already a gtest for this. The error was an OOB write and only manifested with a memcheck without an rmm pool which is tested nightly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Approving.

@codecov
Copy link

codecov bot commented Oct 21, 2022

Codecov Report

Base: 88.09% // Head: 88.14% // Increases project coverage by +0.04% 🎉

Coverage data is based on head (3a2fe47) compared to base (6ca2ceb).
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11962      +/-   ##
================================================
+ Coverage         88.09%   88.14%   +0.04%     
================================================
  Files               133      133              
  Lines             21905    21982      +77     
================================================
+ Hits              19298    19376      +78     
+ Misses             2607     2606       -1     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/backends.py 84.90% <0.00%> (-0.37%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.92% <0.00%> (-0.21%) ⬇️
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_orc.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_json.py 100.00% <0.00%> (ø)
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 88.46% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/dataframe.py 93.77% <0.00%> (+0.04%) ⬆️
python/dask_cudf/dask_cudf/tests/test_core.py 95.37% <0.00%> (+0.09%) ⬆️
python/cudf/cudf/core/column/string.py 88.65% <0.00%> (+0.12%) ⬆️
... and 4 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@vuule vuule marked this pull request as ready for review October 21, 2022 05:59
@vuule vuule requested a review from a team as a code owner October 21, 2022 05:59
@@ -217,6 +217,14 @@ __global__ void __launch_bounds__(128)
if (frag_id < num_fragments_per_column and lane_id == 0) groups[column_id][frag_id] = *g;
}

constexpr uint32_t max_RLE_page_size(uint8_t value_bit_width, uint32_t num_values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring here.

page_size =
1 + 5 + ((values_in_page * ck_g.dict_rle_bits + 7) >> 3) + (values_in_page >> 8);
// Additional byte to store entry bit width
page_size = 1 + max_RLE_page_size(ck_g.dict_rle_bits, values_in_page);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug fix and the rest of the diff is refactoring, right? Is a C++ test or Python comparison to another reader/writer needed?

@vuule
Copy link
Contributor Author

vuule commented Oct 21, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 7940b5b into rapidsai:branch-22.12 Oct 21, 2022
@vuule vuule deleted the bug-dict-data-rle-page-size branch October 21, 2022 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Memcheck error in ParquetDictionaryTest/ParquetSizedTest.DictionaryTest/22
3 participants