Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded #14392

etseidl · 2023-11-09T22:12:16Z

Description

The current implementation of the DELTA_BINARY_PACKED encoder can in certain circumstances use more bits for encoding than are present in the data. Specifically, for INT32 data, if the range of values is large enough, the encoder will use up to 33 bits. While not a violation of the Parquet specification, this does cause problems with certain parquet readers (see this arrow-cpp issue, for instance). libcudf and parquet-mr have no issue with reading data encoded with 33 bits, but in the interest of a) greater interoperability and b) smaller files, this PR changes the DELTA_BINARY_PACKED encoder to use no more bits than are present in the physical type being encoded (32 for INT32, 64 for INT64).

The actual change made here is to perform all the delta computations using 32-bit integers for INT32 data, and 64-bit integers for INT64 data. The prior encoder used 64-bit integers exclusively. To deal with possible overflow, all math is done on unsigned integers to get defined wrapping behavior (overflow with signed integers is UB), with results cast back to signed afterwards. This is in line with the approach taken by parquet-mr, arrow-cpp, and arrow-rs.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2023-11-09T22:12:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

GregoryKimball · 2023-11-13T19:29:58Z

Thank you @etseidl for contributing this. (Part of #13501, to target 24.02 release)

vuule · 2023-11-17T23:35:11Z

/ok to test

cpp/src/io/parquet/delta_enc.cuh

vuule

Looks good, if a bit odd :D
Can you please add a note in the description about the 2s complement math going on with this change?

PointKernel

LGTM

cpp/src/io/parquet/delta_enc.cuh

PointKernel · 2023-12-05T20:40:42Z

/ok to test

hyperbolic2346

Thank you for doing this!

vuule · 2023-12-06T01:34:45Z

@ttnghia should we wait on your review before merging? I see that you self-requested a review.

ttnghia · 2023-12-06T01:37:31Z

I'm fine to merge it now. If it is still here then I can take a look later tonight.

etseidl · 2023-12-06T01:46:20Z

Don't we need a python review?

vuule · 2023-12-06T02:05:42Z

Don't we need a python review?

Well, NOW we do! :D

Thanks for the reminder (I don't see github requiring the python review). I'll ping folks tomorrow :)

bdice

Python approval. 🐍 ✅

vuule · 2023-12-06T02:31:31Z

/merge

…hysical type being encoded (rapidsai#14392) The current implementation of the DELTA_BINARY_PACKED encoder can in certain circumstances use more bits for encoding than are present in the data. Specifically, for INT32 data, if the range of values is large enough, the encoder will use up to 33 bits. While not a violation of the Parquet specification, this does cause problems with certain parquet readers (see [this](apache/arrow#20374) arrow-cpp issue, for instance). libcudf and parquet-mr have no issue with reading data encoded with 33 bits, but in the interest of a) greater interoperability and b) smaller files, this PR changes the DELTA_BINARY_PACKED encoder to use no more bits than are present in the physical type being encoded (32 for INT32, 64 for INT64). The actual change made here is to perform all the delta computations using 32-bit integers for INT32 data, and 64-bit integers for INT64 data. The prior encoder used 64-bit integers exclusively. To deal with possible overflow, all math is done on unsigned integers to get defined wrapping behavior (overflow with signed integers is UB), with results cast back to signed afterwards. This is in line with the approach taken by parquet-mr, arrow-cpp, and arrow-rs. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) - Mike Wilson (https://github.com/hyperbolic2346) - Bradley Dice (https://github.com/bdice) URL: rapidsai#14392

etseidl and others added 3 commits November 8, 2023 14:43

limit encoding bitwidth to size of physical type

7031f80

test delta encoder with more row counts

a020677

Merge branch 'rapidsai:branch-23.12' into delta_bitwidth

0717436

etseidl requested review from a team as code owners November 9, 2023 22:12

etseidl requested review from shwina, galipremsagar, harrism and bdice and removed request for a team November 9, 2023 22:12

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Nov 9, 2023

Merge branch 'branch-24.02' into delta_bitwidth

6275bc3

vuule self-requested a review November 15, 2023 19:01

etseidl and others added 4 commits November 15, 2023 19:01

Merge branch 'branch-24.02' into delta_bitwidth

0819b84

Merge branch 'branch-24.02' into delta_bitwidth

2ab0252

Merge remote-tracking branch 'origin/branch-24.02' into delta_bitwidth

1a7db26

Merge branch 'branch-24.02' into delta_bitwidth

2ee022e

vuule self-assigned this Nov 17, 2023

vuule added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 17, 2023

Merge branch 'branch-24.02' into delta_bitwidth

b2944d2

vuule reviewed Nov 21, 2023

View reviewed changes

cpp/src/io/parquet/delta_enc.cuh Show resolved Hide resolved

cpp/src/io/parquet/delta_enc.cuh Show resolved Hide resolved

harrism removed their request for review November 21, 2023 07:13

vuule approved these changes Nov 22, 2023

View reviewed changes

etseidl added 6 commits November 28, 2023 13:35

Merge branch 'branch-24.02' into delta_bitwidth

7fbd06c

Merge branch 'branch-24.02' into delta_bitwidth

2e039fc

Merge branch 'branch-24.02' into delta_bitwidth

7b1448c

Merge branch 'branch-24.02' into delta_bitwidth

550d37d

Merge branch 'branch-24.02' into delta_bitwidth

a449357

Merge branch 'rapidsai:branch-24.02' into delta_bitwidth

5908af6

PointKernel self-requested a review December 5, 2023 18:24

PointKernel approved these changes Dec 5, 2023

View reviewed changes

cpp/src/io/parquet/delta_enc.cuh Show resolved Hide resolved

ttnghia self-requested a review December 5, 2023 23:05

hyperbolic2346 approved these changes Dec 5, 2023

View reviewed changes

bdice approved these changes Dec 6, 2023

View reviewed changes

rapids-bot bot merged commit d97b3e0 into rapidsai:branch-24.02 Dec 6, 2023
70 checks passed

etseidl deleted the delta_bitwidth branch December 6, 2023 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded #14392

Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded #14392

etseidl commented Nov 9, 2023 •

edited

Loading

copy-pr-bot bot commented Nov 9, 2023

GregoryKimball commented Nov 13, 2023

vuule commented Nov 17, 2023

vuule left a comment

PointKernel left a comment

PointKernel commented Dec 5, 2023

hyperbolic2346 left a comment

vuule commented Dec 6, 2023

ttnghia commented Dec 6, 2023

etseidl commented Dec 6, 2023

vuule commented Dec 6, 2023

bdice left a comment

vuule commented Dec 6, 2023

Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded #14392

Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded #14392

Conversation

etseidl commented Nov 9, 2023 • edited Loading

Description

Checklist

copy-pr-bot bot commented Nov 9, 2023

GregoryKimball commented Nov 13, 2023

vuule commented Nov 17, 2023

vuule left a comment

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

PointKernel commented Dec 5, 2023

hyperbolic2346 left a comment

Choose a reason for hiding this comment

vuule commented Dec 6, 2023

ttnghia commented Dec 6, 2023

etseidl commented Dec 6, 2023

vuule commented Dec 6, 2023

bdice left a comment

Choose a reason for hiding this comment

vuule commented Dec 6, 2023

etseidl commented Nov 9, 2023 •

edited

Loading