Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parameters to control page size in Parquet writer #10882

Merged
merged 20 commits into from
May 24, 2022

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented May 17, 2022

This PR fixes #9615

Adds two more parameters to the Parquet writer options objects to control page sizes. One sets a target page size in bytes (defaults to 512 KiB), and a second to set a maximum number of rows per page (defaults to 20000, see this for the rationale behind this choice).

This also removes the validation logic (and unit tests) from the row_group_size setters, since the page size is no longer fixed. Perhaps validation could be moved to the build() function.

I've tried to follow the naming convention from #9677

Still needs help on python bindings and unit tests.

@etseidl etseidl requested a review from a team as a code owner May 17, 2022 21:34
@etseidl etseidl requested review from bdice and codereport May 17, 2022 21:34
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 17, 2022
@raydouglass
Copy link
Member

okay to test

@PointKernel PointKernel added breaking Breaking change cuIO cuIO issue feature request New feature or request labels May 18, 2022
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @etseidl! Thanks for your pull request! I have a few comments / suggestions.

"The maximum row group size cannot be smaller than the page size, which is 512KB.");
_row_group_size_bytes = size_bytes;
}
void set_row_group_size_bytes(size_t size_bytes) { _row_group_size_bytes = size_bytes; }
Copy link
Contributor

@bdice bdice May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this validate the maximum row group size against the maximum page size?

I see the comment in the tests below suggesting to test parameters on initialization, but that isn't safe if this can be altered after initialization. Perhaps the best approach is to add validation to this method, and make the "builder" interface always set the page size before the row group size. Then the error can be raised during the builder call OR after initialization. edit: sorry, that suggestion doesn't make sense. I'll think about this some more.

Suggested change
void set_row_group_size_bytes(size_t size_bytes) { _row_group_size_bytes = size_bytes; }
void set_row_group_size_bytes(size_t size_bytes)
{
CUDF_EXPECTS(
size_bytes >= _target_page_size_bytes,
"The maximum row group size in bytes cannot be smaller than the maximum page size in bytes.");
_row_group_size_bytes = size_bytes;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I was definitely wanting some help on this logic. my only concern is with confusion when modifying both row group and page sizes; the page size would have to be changed first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it makes the most sense to validate in build(). Allows any order and does not delay the validation too much (e.g. like validating in the reader would).

Copy link
Contributor

@bdice bdice May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vuule Are you concerned about users calling set_target_page_size_bytes and creating an invalid state after the object is built? Is that allowed by the builder interface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that won't always work.
We do have an equivalent interface in ORC. There, this is solved by only having a hard coded minimum for these values instead of comparing the two in the setters. We ensure that the size of the "row group" (called stripe in ORC) is not smaller than the size of "page" (called row group in ORC) in the getters instead of setters! Setters allow stripes to be set to smaller size, but then getters for the row group size return the minimum of the two. Effectively, setting stripe size low also sets the row group size to the same value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdice @vuule Should I take a stab at reproducing the ORC logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@etseidl Yes, that approach seems fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ORC approach is better than validating in build because we allow modifying the options struct after it's built.

"The maximum row group size cannot be smaller than the page size, which is 5000 rows.");
_row_group_size_rows = size_rows;
}
void set_row_group_size_rows(size_type size_rows) { _row_group_size_rows = size_rows; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this validate the maximum row group size against the maximum page size?

Suggested change
void set_row_group_size_rows(size_type size_rows) { _row_group_size_rows = size_rows; }
void set_row_group_size_rows(size_type size_rows)
{
CUDF_EXPECTS(
size_rows >= _target_page_size_rows,
"The maximum row group size in rows cannot be smaller than the maximum page size in rows.");
_row_group_size_rows = size_rows;
}

cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
Comment on lines 343 to 344
// override max_page_size if target is smaller
if (max_page_size > max_page_size_bytes) max_page_size = max_page_size_bytes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want a better name for max_page_size, because it isn't clear how this value's meaning is different from max_page_size_bytes.

// TODO (dm): this convoluted logic to limit page size needs refactoring

@devavret Do you have suggestions on what should be done here?

Copy link
Contributor Author

@etseidl etseidl May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that intent here is to prevent wildly lopsided page sizes within a column chunk. As the target page size shrinks, this should be less necessary. Maybe threshold_page_size would be a better name? Would something like the following make sense?

auto threshold_page_size = max_page_size_bytes;
if (threshold_page_size >= SOME_CUTOFF_VALUE && (values_in_page * 3 >= ck_g.num_values)) {
  threshold_page_size = (values_in_page * 2 >= ck_g.num_values) ? max_page_size_bytes >>1
                                         : (max_page_size_bytes>>2)*3;

SOME_CUTOFF_VALUE would be 512*1024 to match the current code, but could be adjusted downward if need be.

Copy link
Contributor

@bdice bdice May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds reasonable but would like a second reviewer's opinion on this section. cc: @vuule

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more on the lines of adding a condition to the if block below that says: cut off the page if the remaining size of chunk is less than max_page_size_bytes. But that would require calculating the per chunk byte size beforehand.

I think the current changes are fine for now. For naming confusion, you can rename max_page_size to this_page_max_size.

BTW I think this PR only allows the user to make the pages smaller than they were going to be. Is there any need to allow larger pages?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that intent here is to prevent wildly lopsided page sizes within a column chunk.

I think the real reason may be to avoid pages with very small number of values. I think we can remove this logic altogether because the final page is going to have at least 5000 values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so for now I'll just rename the variable, but leave the TODO in place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I think this PR only allows the user to make the pages smaller than they were going to be. Is there any need to allow larger pages?

@devavret Yes, this PR will only make pages smaller. But many Parquet on Hadoop guides suggest setting page size to 1M. Should we allow that here, or is that out of scope?

@codecov
Copy link

codecov bot commented May 18, 2022

Codecov Report

Merging #10882 (39d027e) into branch-22.06 (6352b4e) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.06   #10882      +/-   ##
================================================
+ Coverage         86.29%   86.32%   +0.02%     
================================================
  Files               144      144              
  Lines             22656    22668      +12     
================================================
+ Hits              19552    19569      +17     
+ Misses             3104     3099       -5     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/groupby.py 97.42% <100.00%> (+<0.01%) ⬆️
python/dask_cudf/dask_cudf/tests/test_groupby.py 100.00% <100.00%> (ø)
python/cudf/cudf/utils/ioutils.py 79.47% <0.00%> (-0.13%) ⬇️
python/cudf/cudf/io/json.py 97.56% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 93.78% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/string.py 88.78% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.79% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
python/cudf/cudf/core/column/lists.py 91.70% <0.00%> (+0.97%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80e4262...39d027e. Read the comment docs.

* @param val The page size to use.
* @return this for chaining.
*/
parquet_writer_options_builder& max_page_size_bytes(size_t val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You had uncompressed below, it would be good to include it here in this too as users are more likely to see it here.

Also is this a hard limit or a soft limit. i.e. what happens if a single row cannot be split, i.e. string or binary, and is larger than this is set to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. This is a soft limit, since the current implementation operates a fragment-at-a-time, allowing a single fragment to over-fill a page. I believe this is consistent with other parquet writers, such as parquet-mr.

* @param val The page size to use.
* @return this for chaining.
*/
parquet_writer_options_builder& max_page_size_rows(size_type val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this impact lists? I assume that the data column will be split on rows too, but it would be good to have that in the docs. For row groups it is the top level number of rows, it cannot be anything else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with row groups, this is on row count, not value count. I originally wasn't thinking of adding the row limit, but changed my mind later to allow cudf to have something akin to the row count limit from parquet-mr. Will modify the documentation.

Question: Should I change the default to 0 and not limit on row count if that's the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this impact lists? I assume that the data column will be split on rows too

cuIO Parquet writer guarantees that lists won't be split across pages. Upcoming DataPageHeaderV2 in parquet spec also guarantees this.

@vuule vuule requested a review from devavret May 19, 2022 19:51
Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page size calculations still doesn't sit well with me because the size for the last few pages in a chunk is limited based on the number of values left to encode but actually it wants to limit the byte size. These are only loosely coupled as the values may be all null.

If we wanted to avoid pages with very small number of values, we don't need to worry anyway because the final page in a chunk is going to be at least 5000 values (fragment size).

@etseidl
Copy link
Contributor Author

etseidl commented May 20, 2022

@bdice @vuule @revans2 @devavret Thank you all for the feedback! I think I've addressed all your suggestions. Could you all take another look please?

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some reservations about argument validation but since it's similar to existing code, I think it's out of scope for this PR.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you for the contribution!

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
@etseidl
Copy link
Contributor Author

etseidl commented May 21, 2022

Thanks again everyone. Not sure if this is the right place to bring this up, but it seems something changed in the CI build environment. I keep getting build failures because of the arrow library referencing missing symbols. Seems to be happening to other PRs as well.

@bdice
Copy link
Contributor

bdice commented May 21, 2022

Thanks again everyone. Not sure if this is the right place to bring this up, but it seems something changed in the CI build environment. I keep getting build failures because of the arrow library referencing missing symbols. Seems to be happening to other PRs as well.

Yes, this will be fixed as soon as possible. PR #10275 will resolve this.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @etseidl! Thanks for your quick iterations, and thanks to @vuule @devavret for the design advice.

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label May 23, 2022
@vuule
Copy link
Contributor

vuule commented May 24, 2022

rerun tests

@vuule
Copy link
Contributor

vuule commented May 24, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 379cc9f into rapidsai:branch-22.06 May 24, 2022
@etseidl etseidl deleted the feature/pagesize branch June 1, 2022 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Control rowgroup and page size when writing Parquet files
8 participants