Fix page size estimation in Parquet writer #13364

etseidl · 2023-05-16T20:50:21Z

Description

Fixes #13250. The page size estimation for dictionary encoded pages adds a term to estimate overhead bytes for the bit-packed-header used when encoding bit-packed literal runs. This term originally used a value of 256, but it's hard to see where that value comes from. This PR change the value to 8, with a possible justification being the minimum length of a literal run is 8 values. Worst case would be multiple runs of 8, with required overhead bytes then being num_values/8. This also adds a test that has been verified to fail for values larger than 16 in the problematic term.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rapids-bot · 2023-05-16T20:50:26Z

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

vuule · 2023-05-16T23:22:01Z

cpp/tests/io/parquet_test.cpp

+  // a "worst-case" for page size estimation in the presence of a dictionary. have confirmed
+  // that this fails for values over 16 in the final term of `max_RLE_page_size()`.
+  auto elements0  = cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+    if ((i / 5) % 2 == 1) {


why is the pattern changed every five elements?

It worked 🤣 This was totally empirical...I changed the divisor until I started getting overwrites in the page encoder. Printing out the run lengths, it seemed to give literal runs of 8 followed by repeated runs of 2. Pretty close to ideal for the purposes of the test.

it seemed to give literal runs of 8 followed by repeated runs of 2

Can you please add this info to the comment above?

Thinking seriously about it, the pattern is CCCCCRRRRRCCCCCRRRRR..., where C changes, R is repeating. The encoder will take the first CCCCCRRR and emit a literal run of 8, followed by a repeated run of RR. Then this repeats with another CCCCCRRR, RR. I'd like to say this was rationally designed, but I got lucky 😄

vuule · 2023-05-16T23:25:24Z

/ok to test

…_size_est_fix

vuule · 2023-05-16T23:52:51Z

/ok to test

vuule

Great detective work, thank you for the fix!

etseidl · 2023-05-17T01:44:50Z

I've been thinking about it some more, and for small dictionary bitwidths I think this will still have issues. The real size for this 8/2 case is going to be (bitwidth * num_vals + 7)/8 + num_vals/8 for the 8/10s of the values that are bit packed, plus (1 + (bitwidth + 7)/8) * num_runs for the 2/10s that are RLE encoded. I plugged that into a spreadsheet and varied the bitwidth, and found that for bitwidth < 8 or equal to 10 or 11 the estimated size is still going to be to low. I have to modify my test to see if I can demonstrate the need for a better RLE estimate.

Going to switch to draft until I can pin this down...don't want to have to hunt this down again 😅

etseidl · 2023-05-17T04:09:24Z

I did more testing, and when trying to get to low enough bitwidths to trigger problems, the encoder wound up picking different splits, so the current formula was still an over estimation. I'm going to leave max_RLE_page_size as is and add a TODO to reevaluate if the added corruption warning gets triggered.

ttnghia · 2023-05-17T04:47:07Z

Do you have any benchmark for this fix?

galipremsagar · 2023-05-17T17:06:06Z

/ok to test

nvdbaranec · 2023-05-17T17:24:00Z

Do you have any benchmark for this fix?

I'm curious too what the downstream effects are of this. Is it something along the lines of:

we are generally overestimating the size of the header for bitpacked data in dictionaries.
therefore we are overestimating the size of dictionaries
therefore we are sometimes not creating a dictionary due to page limit restrictions

?

etseidl · 2023-05-17T19:06:22Z

By this point the decision to use a dictionary has been made. This is just calculating the size of buffer needed to hold the encoded data. For the given data, the size estimate is too small, so the encoder overruns the buffer and injects bad keys into a different page's buffer. This is just a more conservative estimate of the space needed to encode. I'd expect a small increase in peak memory, but will be surprised if it's major. More like a few hundred bytes per page.

nvdbaranec · 2023-05-17T19:40:04Z

Ah, I see. Thanks for the explanation.

etseidl · 2023-05-17T22:26:15Z

Here's the peak mem from nvbench where there was a difference. The overhead was a bit more than I thought, but still just under 2% overall.

23.06
| data_type | cardinality | run_length |  CPU Time  | bytes_per_second | peak_memory_usage | 
|-----------|-------------|------------|------------|------------------|-------------------|
|     FLOAT |           0 |          1 | 108.456 ms |       4950351609 |         1.100 GiB |       
|   DECIMAL |           0 |          1 | 112.537 ms |       4770831599 |         1.091 GiB |       
| TIMESTAMP |           0 |          1 | 132.757 ms |       4044169636 |         1.171 GiB |       
|    STRING |           0 |          1 | 133.119 ms |       4033187436 |         1.342 GiB |       
|    STRUCT |           0 |          1 | 146.856 ms |       3655906305 |         1.284 GiB |       
|    STRUCT |           0 |         32 | 119.807 ms |       4481333437 |         1.474 GiB |       


this PR
| data_type | cardinality | run_length |  CPU Time  | bytes_per_second | peak_memory_usage | 
|-----------|-------------|------------|------------|------------------|-------------------|
|     FLOAT |           0 |          1 | 108.067 ms |       4968196421 |         1.122 GiB |       
|   DECIMAL |           0 |          1 | 112.582 ms |       4768935960 |         1.105 GiB |
| TIMESTAMP |           0 |          1 | 132.141 ms |       4063026417 |         1.193 GiB |       
|    STRING |           0 |          1 | 131.600 ms |       4079737251 |         1.351 GiB |       
|    STRUCT |           0 |          1 | 146.687 ms |       3660118264 |         1.300 GiB |       
|    STRUCT |           0 |         32 | 120.336 ms |       4461647293 |         1.501 GiB |

karthikeyann

cpp approval.

vuule · 2023-05-18T23:25:55Z

/merge

etseidl added 2 commits May 16, 2023 13:37

fix for overwriting page data

1657cd7

fixes

2927c58

etseidl requested a review from a team as a code owner May 16, 2023 20:50

etseidl requested review from karthikeyann and davidwendt May 16, 2023 20:50

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 16, 2023

etseidl changed the title ~~Page size est fix~~ Fix page size estimation in Parquet writer May 16, 2023

GregoryKimball requested a review from vuule May 16, 2023 20:54

GregoryKimball assigned vuule May 16, 2023

GregoryKimball added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue labels May 16, 2023

GregoryKimball assigned etseidl May 16, 2023

Merge branch 'rapidsai:branch-23.06' into page_size_est_fix

dbdb3aa

davidwendt added the non-breaking Non-breaking change label May 16, 2023

vuule reviewed May 16, 2023

View reviewed changes

etseidl added 2 commits May 16, 2023 16:50

add more details about the test data

f1a83a2

Merge branch 'page_size_est_fix' of github.com:etseidl/cudf into page…

7cece53

…_size_est_fix

vuule approved these changes May 17, 2023

View reviewed changes

etseidl marked this pull request as draft May 17, 2023 01:47

add more notes to max_RLE_page_size

9246fc4

etseidl marked this pull request as ready for review May 17, 2023 04:05

karthikeyann approved these changes May 18, 2023

View reviewed changes

ttnghia approved these changes May 18, 2023

View reviewed changes

rapids-bot bot merged commit 8c2f76e into rapidsai:branch-23.06 May 18, 2023

etseidl deleted the page_size_est_fix branch May 19, 2023 00:42

GregoryKimball mentioned this pull request Jun 27, 2023

[FEA] Add Parquet and ORC unit tests based on Apache sample files #13627

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix page size estimation in Parquet writer #13364

Fix page size estimation in Parquet writer #13364

etseidl commented May 16, 2023

rapids-bot bot commented May 16, 2023

vuule May 16, 2023

etseidl May 16, 2023

vuule May 16, 2023

etseidl May 16, 2023

vuule commented May 16, 2023

vuule commented May 16, 2023

vuule left a comment

etseidl commented May 17, 2023 •

edited

Loading

etseidl commented May 17, 2023

ttnghia commented May 17, 2023

galipremsagar commented May 17, 2023

nvdbaranec commented May 17, 2023 •

edited

Loading

etseidl commented May 17, 2023

nvdbaranec commented May 17, 2023

etseidl commented May 17, 2023

karthikeyann left a comment

vuule commented May 18, 2023

Fix page size estimation in Parquet writer #13364

Fix page size estimation in Parquet writer #13364

Conversation

etseidl commented May 16, 2023

Description

Checklist

rapids-bot bot commented May 16, 2023

vuule May 16, 2023

Choose a reason for hiding this comment

etseidl May 16, 2023

Choose a reason for hiding this comment

vuule May 16, 2023

Choose a reason for hiding this comment

etseidl May 16, 2023

Choose a reason for hiding this comment

vuule commented May 16, 2023

vuule commented May 16, 2023

vuule left a comment

Choose a reason for hiding this comment

etseidl commented May 17, 2023 • edited Loading

etseidl commented May 17, 2023

ttnghia commented May 17, 2023

galipremsagar commented May 17, 2023

nvdbaranec commented May 17, 2023 • edited Loading

etseidl commented May 17, 2023

nvdbaranec commented May 17, 2023

etseidl commented May 17, 2023

karthikeyann left a comment

Choose a reason for hiding this comment

vuule commented May 18, 2023

etseidl commented May 17, 2023 •

edited

Loading

nvdbaranec commented May 17, 2023 •

edited

Loading