Fix page size calculation in Parquet writer #12182

etseidl · 2022-11-17T01:04:18Z

Description

When calculating page boundaries, the current Parquet writer does not take into account storage needed per page for repetition and definition level data. As a consequence pages may sometimes exceed the specified limit, which in turn impacts the ability to compress these pages with codecs that have a maximum buffer size. This PR fixes the page size calculation to take repetition and definition levels into account.

~~This also incorporates the fragment size reduction from 5000 to 1000 that was suggested in #12130~~

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

level info into account when calculating page sizes

GPUtester · 2022-11-17T01:04:20Z

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

rapids-bot · 2022-11-17T01:04:22Z

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

etseidl · 2022-11-17T01:11:27Z

I ran the NDS conversion from csv to parquet/zstd at scale=100. The total output size was 27G, consistent with the output before this change. Individual files sizes are:

12K     income_band
12K     reason
12K     ship_mode
16K     warehouse
20K     call_center
20K     web_site
28K     household_demographics
36K     web_page
40K     store
44K     promotion
484K    time_dim
788K    catalog_page
956K    date_dim
5.0M    customer_demographics
15M     customer_address
26M     item
79M     customer
426M    web_returns
879M    catalog_returns
1.3G    store_returns
1.6G    inventory
3.8G    web_sales
7.8G    catalog_sales
11G     store_sales

compared to the following provided by @jbrennan333

16K	./gpu_parquet_sf100_zstd/income_band
16K	./gpu_parquet_sf100_zstd/reason
16K	./gpu_parquet_sf100_zstd/ship_mode
20K	./gpu_parquet_sf100_zstd/warehouse
24K	./gpu_parquet_sf100_zstd/call_center
24K	./gpu_parquet_sf100_zstd/web_site
32K	./gpu_parquet_sf100_zstd/household_demographics
40K	./gpu_parquet_sf100_zstd/web_page
44K	./gpu_parquet_sf100_zstd/store
48K	./gpu_parquet_sf100_zstd/promotion
488K	./gpu_parquet_sf100_zstd/time_dim
796K	./gpu_parquet_sf100_zstd/catalog_page
960K	./gpu_parquet_sf100_zstd/date_dim
5.0M	./gpu_parquet_sf100_zstd/customer_demographics
16M	./gpu_parquet_sf100_zstd/customer_address
27M	./gpu_parquet_sf100_zstd/item
82M	./gpu_parquet_sf100_zstd/customer
435M	./gpu_parquet_sf100_zstd/web_returns
887M	./gpu_parquet_sf100_zstd/catalog_returns
1.3G	./gpu_parquet_sf100_zstd/store_returns
1.6G	./gpu_parquet_sf100_zstd/inventory
3.8G	./gpu_parquet_sf100_zstd/web_sales
7.8G	./gpu_parquet_sf100_zstd/catalog_sales
11G	./gpu_parquet_sf100_zstd/store_sales

etseidl · 2022-11-17T01:58:03Z

@vuule the benchmarks really are a mixed bag. I'll have to test with 5000 again, but maybe we should hold off on the fragment size change in favor of trying the per-column approach.

codecov · 2022-11-17T05:52:56Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.02@124a8d5). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-23.02   #12182   +/-   ##
===============================================
  Coverage                ?   88.18%           
===============================================
  Files                   ?      137           
  Lines                   ?    22653           
  Branches                ?        0           
===============================================
  Hits                    ?    19977           
  Misses                  ?     2676           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

vuule · 2022-11-17T18:40:29Z

This is my summary of results with just the fragment size change:

Performance impact:
5% throughput increase on average in the writer benchmark. The benefit is very uneven, with numeric types having almost 10% throughput loss and list type getting up to 110% faster.
15% throughput increase on average in the reader benchmark, with ~10% throughput loss with numeric types and up to 220% gain with lists.

Are you seeing something similar? This is without ZSTD (de)compression, so maybe you're seeing a different picture.

etseidl · 2022-11-17T19:15:10Z

My number are just a bit more extreme. I'm seeing more like a 13-15% slowdown on integral/time types and as much as 50% increase for list. This is on my workstation w/ an A6000, so I should probably try repeating on the A100s at work. That and I should try benchmarking with some different datasets. (Also need to dump to CSV so I can get harder performance numbers).

vuule · 2022-11-17T19:23:19Z

As you suggested earlier, let's take the L on the fragment size and just fix the page size for now :)

etseidl · 2022-11-17T19:34:36Z

@vuule the fix is mostly ready for review then...do you think there are any unit tests to add? I can't think of anything beyond write a file then read all the page headers to confirm that all uncompressed sizes are below the threshold. I don't know if there's any value there.

ttnghia · 2022-11-18T19:45:20Z

cpp/src/io/parquet/page_enc.cu

+      // subtract size of rep and def level vectors
+      auto num_vals = values_in_page + frag_g.num_values;
+      this_max_page_size -= max_RLE_page_size(col_g.num_def_level_bits(), num_vals) +
+                            max_RLE_page_size(col_g.num_rep_level_bits(), num_vals);
+


Please make sure to merge upstream with the latest changes in this file.

cpp/src/io/parquet/page_enc.cu

…than wrap when subtracting rep and def level sizes

…feature/page_size_fix

ttnghia · 2022-11-29T22:12:03Z

cpp/src/io/parquet/page_enc.cu

-      size_t this_max_page_size = (values_in_page * 2 >= ck_g.num_values)   ? 256 * 1024
-                                  : (values_in_page * 3 >= ck_g.num_values) ? 384 * 1024
-                                                                            : 512 * 1024;
+      long this_max_page_size = (values_in_page * 2 >= ck_g.num_values)   ? 256 * 1024


You may also need to modify the line 360 below to avoid comparing between size_t and long:

if(this_max_page_size > static_cast<long>(max_page_size_bytes)) { this_max_page_size = static_cast<long>(max_page_size_bytes); }

The should also be similar issue with comparing this_max_page_size with other variables below but I can't figure out an efficient way to fix all.

Huh. Doesn't compiler complain about signed-unsigned comparison?

I have no idea. Sometimes it is very aggressive. Some other times it just silently ignored.

tbh I expected the compiler to yell at me about this...didn't realize min() was overloaded so much (unlike std::min). Fixed.

Yeah I've been working with cuda code/nvcc long enough time and no longer trust in its ability to yell at me for non-critical warnings.

I believe that the remaining usage of this_max_page_size is at line 369. Can you please check for signed/unsigned comparison there (and fix if necessary)?

I think the comparison at line 369 is ok since it's checking whether the sum of two 32-bit unsigneds is greater than a 64-bit signed.

Ofc that raises the question of whether using 32 bits for page size is sufficient. But since pages really should be in the kilobytes, it shouldn't be an issue. Maybe we should just change the max_page_size_bytes parameter to be a size_type instead? But that's getting a little out-of-scope maybe.

size_type is meant to be used for the number of rows (I'm sure there are many places where it's misused). I prefer to use 64-bit types for byte counting, as much as I agree that page size should never be over 2GB :D
Agreed that this would be out-of-scope for this, very focused, PR.

I take that back...if the addition overflows, then the test at 369 might never evaluate true 😮

Maybe there should be a follow-on PR to clean this up.

vuule · 2022-11-29T23:35:22Z

@gpucibot merge

Reverts a change made in #12182 which exposed some potential issues with calculating page boundaries. Instead, checks are added to ensure that page sizes will not exceed what can be represented by a signed 32-bit integer. Also fixes some bounds checking that is no longer correct given changes made to the max page fragment size, and changes the column_index_truncation_length from `size_type` (which was used incorrectly) to `int32_t`. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #12277

etseidl added 2 commits November 16, 2022 16:03

reduce fragment size to 1000 and take size of repetition and definition

692107e

level info into account when calculating page sizes

handle def and rep separately

effd0e4

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 17, 2022

vuule added cuIO cuIO issue bug Something isn't working non-breaking Non-breaking change labels Nov 17, 2022

etseidl added 2 commits November 17, 2022 12:36

revert fragment size to 5000

330b23f

use max_RLE_page_size()

f81b78e

etseidl marked this pull request as ready for review November 18, 2022 18:13

etseidl requested a review from a team as a code owner November 18, 2022 18:13

etseidl requested review from robertmaynard and karthikeyann November 18, 2022 18:13

ttnghia reviewed Nov 18, 2022

View reviewed changes

Merge branch 'rapidsai:branch-23.02' into feature/page_size_fix

9e97c88

etseidl mentioned this pull request Nov 19, 2022

Selectively use dictionary encoding in Parquet writer #12211

Merged

3 tasks

vuule reviewed Nov 29, 2022

View reviewed changes

cpp/src/io/parquet/page_enc.cu Show resolved Hide resolved

vuule mentioned this pull request Nov 29, 2022

[BUG] Compressing a table with large strings using ZSTD can result in little or no compression #12249

Closed

etseidl added 2 commits November 29, 2022 09:18

change 'this_max_page_size' to be signed so it will underflow rather …

a7c3683

…than wrap when subtracting rep and def level sizes

Merge branch 'feature/page_size_fix' of github.com:etseidl/cudf into …

319171b

…feature/page_size_fix

vuule approved these changes Nov 29, 2022

View reviewed changes

ttnghia reviewed Nov 29, 2022

View reviewed changes

fix comparison per review comments

99e4243

ttnghia approved these changes Nov 29, 2022

View reviewed changes

rapids-bot bot merged commit f4bb574 into rapidsai:branch-23.02 Nov 30, 2022

etseidl deleted the feature/page_size_fix branch November 30, 2022 03:06

etseidl mentioned this pull request Dec 1, 2022

Clean up handling of max_page_size_bytes in Parquet writer #12277

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix page size calculation in Parquet writer #12182

Fix page size calculation in Parquet writer #12182

etseidl commented Nov 17, 2022 •

edited

Loading

GPUtester commented Nov 17, 2022

rapids-bot bot commented Nov 17, 2022

etseidl commented Nov 17, 2022

etseidl commented Nov 17, 2022

codecov bot commented Nov 17, 2022 •

edited

Loading

vuule commented Nov 17, 2022

etseidl commented Nov 17, 2022

vuule commented Nov 17, 2022

etseidl commented Nov 17, 2022

ttnghia Nov 18, 2022

ttnghia Nov 29, 2022 •

edited

Loading

ttnghia Nov 29, 2022

vuule Nov 29, 2022

ttnghia Nov 29, 2022

etseidl Nov 29, 2022

ttnghia Nov 29, 2022

ttnghia Nov 29, 2022

etseidl Nov 29, 2022

vuule Nov 29, 2022

etseidl Nov 29, 2022

vuule commented Nov 29, 2022

Fix page size calculation in Parquet writer #12182

Fix page size calculation in Parquet writer #12182

Conversation

etseidl commented Nov 17, 2022 • edited Loading

Description

Checklist

GPUtester commented Nov 17, 2022

rapids-bot bot commented Nov 17, 2022

etseidl commented Nov 17, 2022

etseidl commented Nov 17, 2022

codecov bot commented Nov 17, 2022 • edited Loading

Codecov Report

vuule commented Nov 17, 2022

etseidl commented Nov 17, 2022

vuule commented Nov 17, 2022

etseidl commented Nov 17, 2022

Choose a reason for hiding this comment

ttnghia Nov 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule commented Nov 29, 2022

etseidl commented Nov 17, 2022 •

edited

Loading

codecov bot commented Nov 17, 2022 •

edited

Loading

ttnghia Nov 29, 2022 •

edited

Loading