-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for Parquet writer when requested pages per row is smaller than fragment size #13806
Conversation
Pull requests from external contributors require approval from a |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we refactor this condition to use a few local variables? It's getting pretty hefty.
cpp/src/io/parquet/page_enc.cu
Outdated
@@ -433,8 +433,9 @@ __global__ void __launch_bounds__(128) | |||
max_RLE_page_size(col_g.num_rep_level_bits(), num_vals)); | |||
|
|||
if (num_rows >= ck_g.num_rows || | |||
(values_in_page > 0 && (page_size + fragment_data_size > this_max_page_size)) || | |||
rows_in_page + frag_g.num_rows > max_page_size_rows) { | |||
(values_in_page > 0 && (page_size + fragment_data_size > this_max_page_size || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is values_in_page
used here instead of rows_in_page
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
History? On the reader side there's the problem of pages with 0 rows but many values, so maybe that's why values_in_page
was originally used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omg, I forgot about 0 row pages 😮💨
Need to stare at this some more now
cpp/src/io/parquet/page_enc.cu
Outdated
rows_in_page + frag_g.num_rows > max_page_size_rows) { | ||
(values_in_page > 0 && (page_size + fragment_data_size > this_max_page_size || | ||
rows_in_page + frag_g.num_rows > max_page_size_rows)) || | ||
rows_in_page >= max_page_size_rows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this check relevant? As I understand this, if rows_in_page >= max_page_size_rows
, then values_in_page > 0
is also true and the rows_in_page + frag_g.num_rows > max_page_size_rows
condition will be checked anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure...put it back in just to be safe. I'll see if I can dream up some weird edge case where this is necessary 😅
cpp/src/io/parquet/page_enc.cu
Outdated
if (num_rows >= ck_g.num_rows || | ||
(values_in_page > 0 && (page_size + fragment_data_size > this_max_page_size)) || | ||
rows_in_page + frag_g.num_rows > max_page_size_rows) { | ||
(values_in_page > 0 && (page_size + fragment_data_size > this_max_page_size || | ||
rows_in_page + frag_g.num_rows > max_page_size_rows)) || | ||
rows_in_page >= max_page_size_rows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I feel the entire of this if
condition is too complex and error-prone. Can we break it into multiple conditions? For example:
auto const name1 = num_rows >= ck_g.num_rows;
auto const name2 = ....;
....
if(name1 && name2 || ....) {
...
}
// checks to see when we need to close the current page and start a new one | ||
auto const is_last_chunk = num_rows >= ck_g.num_rows; | ||
auto const is_page_bytes_exceeded = page_size + fragment_data_size > this_max_page_size; | ||
auto const is_page_rows_exceeded = rows_in_page + frag_g.num_rows > max_page_size_rows; | ||
// only check for limit overflow if there's already at least one fragment for this page | ||
auto const is_page_too_big = | ||
values_in_page > 0 && (is_page_bytes_exceeded || is_page_rows_exceeded); | ||
|
||
if (is_last_chunk || is_page_too_big) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic! This is much cleaner and better to understand what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for cleaning up the condition!
Just one change request :)
/ok to test |
/merge |
Description
#12685 introduced a bug in page calculation. If the
max_page_size_rows
parameter is set smaller than the page fragment size, the writer will produce a spurious empty page. This PR fixes this by only checking the fragment size if there are already rows in the page, and then returns the old check for number of rows exceeding the page limit.Interestingly, libcudf can read these files with empty pages just fine, but parquet-mr cannot.
Checklist