Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parameters to control row group size in Parquet writer #9677

Merged
merged 17 commits into from
Nov 17, 2021

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Nov 12, 2021

Closes #9615

Adds the following API to the Parquet writer:

  • Set maximum row group size, in bytes (minimum of 512KB);
  • Set maximum row group size, in rows (minimum of 5000).

The API is more limited than its ORC equivalent because of limitation in Parquet page size control/estimation.

Other changes:

  • Fix naming in some ORC APIs to be consistent.
  • Change rowgroup to row_group in APIs, since Parquet specs refer to this as "row group", not "rowgroup".
  • Replace some uint32_t use in Parquet writer.
  • Remove unused target_page_size.

@vuule vuule added feature request New feature or request cuIO cuIO issue breaking Breaking change labels Nov 12, 2021
@vuule vuule self-assigned this Nov 12, 2021
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Nov 12, 2021
@vuule vuule marked this pull request as ready for review November 12, 2021 23:45
@vuule vuule requested review from a team as code owners November 12, 2021 23:45
@vuule vuule requested review from harrism, bdice and shwina November 12, 2021 23:45
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup. Comments attached.

cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/cpp/io/orc.pxd Outdated Show resolved Hide resolved
@vuule vuule requested a review from bdice November 15, 2021 20:03
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commentary about page size was a helpful improvement. Thanks!

@vuule vuule requested a review from PointKernel November 15, 2021 20:35
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just some coding-style comments for your reference.

cpp/include/cudf/io/orc.hpp Show resolved Hide resolved

/**
* @brief Returns maximum stripe size, in bytes.
*/
auto stripe_size_bytes() const { return _stripe_size_bytes; }
auto get_stripe_size_bytes() const { return _stripe_size_bytes; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember a while back, @codereport and @jrhemstad had a long discussion on whether get prefix should be used or not and the outcome is that no prefix is more readable IIRC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true in general. However, we made a conscious decision to use get_ prefix for the _options classes. I'm unable to find the discussion that led to this (perhaps Jake had the discussion with Conor later on).
We can choose to remove the prefix, but it should be done for all options. Either way, I prefer to keep the API consistent, so I'm adding these options with the prefix.

cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
vuule and others added 2 commits November 15, 2021 15:03
Co-authored-by: Yunsong Wang <[email protected]>
@vuule vuule requested a review from PointKernel November 15, 2021 23:54
@codecov
Copy link

codecov bot commented Nov 16, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.02@36b3344). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-22.02    #9677   +/-   ##
===============================================
  Coverage                ?   10.51%           
===============================================
  Files                   ?      118           
  Lines                   ?    20249           
  Branches                ?        0           
===============================================
  Hits                    ?     2130           
  Misses                  ?    18119           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36b3344...a9e04f7. Read the comment docs.

@vuule vuule requested a review from shwina November 16, 2021 20:18
@vuule
Copy link
Contributor Author

vuule commented Nov 17, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 9114104 into rapidsai:branch-22.02 Nov 17, 2021
@vuule vuule deleted the fea-parquet-rowgroup-size branch November 17, 2021 18:31
rapids-bot bot pushed a commit that referenced this pull request May 24, 2022
This PR fixes #9615 

Adds two more parameters to the Parquet writer options objects to control page sizes.  One sets a target page size in bytes (defaults to 512 KiB), and a second to set a maximum number of rows per page (defaults to 20000, see [this](https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/) for the rationale behind this choice).  

~~This also removes the validation logic (and unit tests) from the row_group_size setters, since the page size is no longer fixed.  Perhaps validation could be moved to the build() function.~~

I've tried to follow the naming convention from #9677

Still needs help on python bindings and unit tests.

Authors:
  - https://github.com/etseidl

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #10882
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Control rowgroup and page size when writing Parquet files
4 participants