Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix page size calculation in Parquet writer #12182

Merged
merged 8 commits into from
Nov 30, 2022

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Nov 17, 2022

Description

When calculating page boundaries, the current Parquet writer does not take into account storage needed per page for repetition and definition level data. As a consequence pages may sometimes exceed the specified limit, which in turn impacts the ability to compress these pages with codecs that have a maximum buffer size. This PR fixes the page size calculation to take repetition and definition levels into account.

This also incorporates the fragment size reduction from 5000 to 1000 that was suggested in #12130

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

@rapids-bot
Copy link

rapids-bot bot commented Nov 17, 2022

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 17, 2022
@etseidl
Copy link
Contributor Author

etseidl commented Nov 17, 2022

I ran the NDS conversion from csv to parquet/zstd at scale=100. The total output size was 27G, consistent with the output before this change. Individual files sizes are:

12K     income_band
12K     reason
12K     ship_mode
16K     warehouse
20K     call_center
20K     web_site
28K     household_demographics
36K     web_page
40K     store
44K     promotion
484K    time_dim
788K    catalog_page
956K    date_dim
5.0M    customer_demographics
15M     customer_address
26M     item
79M     customer
426M    web_returns
879M    catalog_returns
1.3G    store_returns
1.6G    inventory
3.8G    web_sales
7.8G    catalog_sales
11G     store_sales

compared to the following provided by @jbrennan333

16K	./gpu_parquet_sf100_zstd/income_band
16K	./gpu_parquet_sf100_zstd/reason
16K	./gpu_parquet_sf100_zstd/ship_mode
20K	./gpu_parquet_sf100_zstd/warehouse
24K	./gpu_parquet_sf100_zstd/call_center
24K	./gpu_parquet_sf100_zstd/web_site
32K	./gpu_parquet_sf100_zstd/household_demographics
40K	./gpu_parquet_sf100_zstd/web_page
44K	./gpu_parquet_sf100_zstd/store
48K	./gpu_parquet_sf100_zstd/promotion
488K	./gpu_parquet_sf100_zstd/time_dim
796K	./gpu_parquet_sf100_zstd/catalog_page
960K	./gpu_parquet_sf100_zstd/date_dim
5.0M	./gpu_parquet_sf100_zstd/customer_demographics
16M	./gpu_parquet_sf100_zstd/customer_address
27M	./gpu_parquet_sf100_zstd/item
82M	./gpu_parquet_sf100_zstd/customer
435M	./gpu_parquet_sf100_zstd/web_returns
887M	./gpu_parquet_sf100_zstd/catalog_returns
1.3G	./gpu_parquet_sf100_zstd/store_returns
1.6G	./gpu_parquet_sf100_zstd/inventory
3.8G	./gpu_parquet_sf100_zstd/web_sales
7.8G	./gpu_parquet_sf100_zstd/catalog_sales
11G	./gpu_parquet_sf100_zstd/store_sales

@etseidl
Copy link
Contributor Author

etseidl commented Nov 17, 2022

@vuule the benchmarks really are a mixed bag. I'll have to test with 5000 again, but maybe we should hold off on the fragment size change in favor of trying the per-column approach.

@codecov
Copy link

codecov bot commented Nov 17, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.02@124a8d5). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.02   #12182   +/-   ##
===============================================
  Coverage                ?   88.18%           
===============================================
  Files                   ?      137           
  Lines                   ?    22653           
  Branches                ?        0           
===============================================
  Hits                    ?    19977           
  Misses                  ?     2676           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@vuule vuule added cuIO cuIO issue bug Something isn't working non-breaking Non-breaking change labels Nov 17, 2022
@vuule
Copy link
Contributor

vuule commented Nov 17, 2022

This is my summary of results with just the fragment size change:

Performance impact:
5% throughput increase on average in the writer benchmark. The benefit is very uneven, with numeric types having almost 10% throughput loss and list type getting up to 110% faster.
15% throughput increase on average in the reader benchmark, with ~10% throughput loss with numeric types and up to 220% gain with lists.

Are you seeing something similar? This is without ZSTD (de)compression, so maybe you're seeing a different picture.

@etseidl
Copy link
Contributor Author

etseidl commented Nov 17, 2022

My number are just a bit more extreme. I'm seeing more like a 13-15% slowdown on integral/time types and as much as 50% increase for list. This is on my workstation w/ an A6000, so I should probably try repeating on the A100s at work. That and I should try benchmarking with some different datasets. (Also need to dump to CSV so I can get harder performance numbers).

@vuule
Copy link
Contributor

vuule commented Nov 17, 2022

As you suggested earlier, let's take the L on the fragment size and just fix the page size for now :)

@etseidl
Copy link
Contributor Author

etseidl commented Nov 17, 2022

@vuule the fix is mostly ready for review then...do you think there are any unit tests to add? I can't think of anything beyond write a file then read all the page headers to confirm that all uncompressed sizes are below the threshold. I don't know if there's any value there.

@etseidl etseidl marked this pull request as ready for review November 18, 2022 18:13
@etseidl etseidl requested a review from a team as a code owner November 18, 2022 18:13
Comment on lines 362 to 366
// subtract size of rep and def level vectors
auto num_vals = values_in_page + frag_g.num_values;
this_max_page_size -= max_RLE_page_size(col_g.num_def_level_bits(), num_vals) +
max_RLE_page_size(col_g.num_rep_level_bits(), num_vals);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to merge upstream with the latest changes in this file.

size_t this_max_page_size = (values_in_page * 2 >= ck_g.num_values) ? 256 * 1024
: (values_in_page * 3 >= ck_g.num_values) ? 384 * 1024
: 512 * 1024;
long this_max_page_size = (values_in_page * 2 >= ck_g.num_values) ? 256 * 1024
Copy link
Contributor

@ttnghia ttnghia Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may also need to modify the line 360 below to avoid comparing between size_t and long:

if(this_max_page_size > static_cast<long>(max_page_size_bytes)) {
  this_max_page_size = static_cast<long>(max_page_size_bytes);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The should also be similar issue with comparing this_max_page_size with other variables below but I can't figure out an efficient way to fix all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. Doesn't compiler complain about signed-unsigned comparison?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea. Sometimes it is very aggressive. Some other times it just silently ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I expected the compiler to yell at me about this...didn't realize min() was overloaded so much (unlike std::min). Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've been working with cuda code/nvcc long enough time and no longer trust in its ability to yell at me for non-critical warnings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the remaining usage of this_max_page_size is at line 369. Can you please check for signed/unsigned comparison there (and fix if necessary)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comparison at line 369 is ok since it's checking whether the sum of two 32-bit unsigneds is greater than a 64-bit signed.

Ofc that raises the question of whether using 32 bits for page size is sufficient. But since pages really should be in the kilobytes, it shouldn't be an issue. Maybe we should just change the max_page_size_bytes parameter to be a size_type instead? But that's getting a little out-of-scope maybe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_type is meant to be used for the number of rows (I'm sure there are many places where it's misused). I prefer to use 64-bit types for byte counting, as much as I agree that page size should never be over 2GB :D
Agreed that this would be out-of-scope for this, very focused, PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take that back...if the addition overflows, then the test at 369 might never evaluate true 😮

Maybe there should be a follow-on PR to clean this up.

@vuule
Copy link
Contributor

vuule commented Nov 29, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit f4bb574 into rapidsai:branch-23.02 Nov 30, 2022
@etseidl etseidl deleted the feature/page_size_fix branch November 30, 2022 03:06
rapids-bot bot pushed a commit that referenced this pull request Dec 13, 2022
Reverts a change made in #12182 which exposed some potential issues with calculating page boundaries.  Instead, checks are added to ensure that page sizes will not exceed what can be represented by a signed 32-bit integer.  Also fixes some bounds checking that is no longer correct given changes made to the max page fragment size, and changes the column_index_truncation_length from `size_type` (which was used incorrectly) to `int32_t`.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #12277
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants