Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Parquet writer when requested pages per row is smaller than fragment size #13806

Merged
merged 3 commits into from
Aug 3, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Aug 2, 2023

Description

#12685 introduced a bug in page calculation. If the max_page_size_rows parameter is set smaller than the page fragment size, the writer will produce a spurious empty page. This PR fixes this by only checking the fragment size if there are already rows in the page, and then returns the old check for number of rows exceeding the page limit.

Interestingly, libcudf can read these files with empty pages just fine, but parquet-mr cannot.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner August 2, 2023 20:10
@etseidl etseidl requested review from harrism and ttnghia August 2, 2023 20:10
@rapids-bot
Copy link

rapids-bot bot commented Aug 2, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write permissions or greater before CI can begin.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 2, 2023
@vuule vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Aug 2, 2023
@vuule
Copy link
Contributor

vuule commented Aug 2, 2023

/ok to test

@vuule vuule self-requested a review August 2, 2023 20:29
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we refactor this condition to use a few local variables? It's getting pretty hefty.

@@ -433,8 +433,9 @@ __global__ void __launch_bounds__(128)
max_RLE_page_size(col_g.num_rep_level_bits(), num_vals));

if (num_rows >= ck_g.num_rows ||
(values_in_page > 0 && (page_size + fragment_data_size > this_max_page_size)) ||
rows_in_page + frag_g.num_rows > max_page_size_rows) {
(values_in_page > 0 && (page_size + fragment_data_size > this_max_page_size ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is values_in_page used here instead of rows_in_page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

History? On the reader side there's the problem of pages with 0 rows but many values, so maybe that's why values_in_page was originally used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg, I forgot about 0 row pages 😮‍💨
Need to stare at this some more now

rows_in_page + frag_g.num_rows > max_page_size_rows) {
(values_in_page > 0 && (page_size + fragment_data_size > this_max_page_size ||
rows_in_page + frag_g.num_rows > max_page_size_rows)) ||
rows_in_page >= max_page_size_rows) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this check relevant? As I understand this, if rows_in_page >= max_page_size_rows, then values_in_page > 0 is also true and the rows_in_page + frag_g.num_rows > max_page_size_rows condition will be checked anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure...put it back in just to be safe. I'll see if I can dream up some weird edge case where this is necessary 😅

Comment on lines 435 to 438
if (num_rows >= ck_g.num_rows ||
(values_in_page > 0 && (page_size + fragment_data_size > this_max_page_size)) ||
rows_in_page + frag_g.num_rows > max_page_size_rows) {
(values_in_page > 0 && (page_size + fragment_data_size > this_max_page_size ||
rows_in_page + frag_g.num_rows > max_page_size_rows)) ||
rows_in_page >= max_page_size_rows) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I feel the entire of this if condition is too complex and error-prone. Can we break it into multiple conditions? For example:

auto const name1 = num_rows >= ck_g.num_rows;
auto const name2 = ....;
....

if(name1 && name2 || ....) {
...
}

@etseidl etseidl requested review from vuule and ttnghia August 2, 2023 21:49
Comment on lines +435 to +443
// checks to see when we need to close the current page and start a new one
auto const is_last_chunk = num_rows >= ck_g.num_rows;
auto const is_page_bytes_exceeded = page_size + fragment_data_size > this_max_page_size;
auto const is_page_rows_exceeded = rows_in_page + frag_g.num_rows > max_page_size_rows;
// only check for limit overflow if there's already at least one fragment for this page
auto const is_page_too_big =
values_in_page > 0 && (is_page_bytes_exceeded || is_page_rows_exceeded);

if (is_last_chunk || is_page_too_big) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! This is much cleaner and better to understand what's going on.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for cleaning up the condition!
Just one change request :)

cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved
@etseidl etseidl requested a review from vuule August 2, 2023 23:01
@vuule
Copy link
Contributor

vuule commented Aug 2, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Aug 3, 2023

/merge

@rapids-bot rapids-bot bot merged commit 399efb9 into rapidsai:branch-23.10 Aug 3, 2023
@etseidl etseidl deleted the fix_small_row_count branch August 3, 2023 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants