[Parquet] DELTA_BINARY_PACKED constraint on num_bits is too restrict? #20374

asfimport · 2022-08-18T21:05:06Z

Consider the sequence of (int32) values

[863490391,-816295192,1613070492,-1166045478,1856530847]

This sequence can be encoded as a single block, single miniblock with a bit_width of 33.

However, we currently require [1] the bit_width of each miniblock to be smaller than the bitwidth of the type it encodes.

We could consider lifting this constraint, as, as shown in the example above, the values representation's bit_width can be smaller than the delta's representation's bit_width.

[1]

arrow/cpp/src/parquet/encoding.cc

Line 2173 in a376968

if (bit_width_data[i] > kMaxDeltaBitWidth) {

Reporter: Jorge Leitão / @jorgecarleitao

Original Issue Attachments:

test.parquet

_{Note: This issue was originally created as ARROW-17465. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-08-18T21:07:48Z

Antoine Pitrou / @pitrou:
It would be useful to find out what parquet-mr (the Java Parquet implementation) does about this.
In particular, we don't want to produce files that wouldn't be readable by parquet-mr, I think.

asfimport · 2022-08-18T21:18:00Z

Jorge Leitão / @jorgecarleitao:
Agree. The issue was not very clear - we currently only support reading DELTA_BINARY_PACKED, and it is in the read path that currently bails.

Note that pyspark can read it. Attached to this comment is a minimal parquet file reproducing the issue. The code example results in

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/18 21:14:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
spark reads it                                                                  
Traceback (most recent call last):
 File "bla.py", line 16, in <module>
   table = pyarrow.parquet.read_table("test.parquet")
 File "/home/azureuser/projects/arrow2/arrow-parquet-integration-testing/venv/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
   ...
OSError: delta bit width larger than integer bit width

which how I concluded that this is something specific in arrow :)

import pyarrow.parquet
import pyspark.sql


spark = pyspark.sql.SparkSession.builder.config(
    # see https://stackoverflow.com/a/62024670/931303
    "spark.sql.parquet.enableVectorizedReader",
    "false",
).getOrCreate()

result = spark.read.parquet("test.parquet").collect()
assert [r["c1"] for r in result] == [863490391, -816295192, 1613070492, -1166045478, 1856530847]
print("spark reads it")

table = pyarrow.parquet.read_table("test.parquet")
print("pyarrow reads it")

test.parquet

asfimport · 2022-08-20T10:21:33Z

Antoine Pitrou / @pitrou:
Hmm, it seems it will be complicated for Parquet C++ to accept this without some rearchitecturing of the decoder.
(roughly, if we decode Int32 values, we first decode the deltas as a buffer of Int32 as well...)

I would suggest first ask on the Parquet ML whether this is intended to be supported.

(note that of course for such data, DELTA_BINARY_PACKED should not be used at all, as it produces an expansion... while being more CPU-intensive to decode as well)

asfimport · 2022-08-20T10:21:55Z

Antoine Pitrou / @pitrou:
Also, for Int64 data, it would mean we would have to accept 65-bit deltas.

…hysical type being encoded (#14392) The current implementation of the DELTA_BINARY_PACKED encoder can in certain circumstances use more bits for encoding than are present in the data. Specifically, for INT32 data, if the range of values is large enough, the encoder will use up to 33 bits. While not a violation of the Parquet specification, this does cause problems with certain parquet readers (see [this](apache/arrow#20374) arrow-cpp issue, for instance). libcudf and parquet-mr have no issue with reading data encoded with 33 bits, but in the interest of a) greater interoperability and b) smaller files, this PR changes the DELTA_BINARY_PACKED encoder to use no more bits than are present in the physical type being encoded (32 for INT32, 64 for INT64). The actual change made here is to perform all the delta computations using 32-bit integers for INT32 data, and 64-bit integers for INT64 data. The prior encoder used 64-bit integers exclusively. To deal with possible overflow, all math is done on unsigned integers to get defined wrapping behavior (overflow with signed integers is UB), with results cast back to signed afterwards. This is in line with the approach taken by parquet-mr, arrow-cpp, and arrow-rs. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) - Mike Wilson (https://github.com/hyperbolic2346) - Bradley Dice (https://github.com/bdice) URL: #14392

…hysical type being encoded (rapidsai#14392) The current implementation of the DELTA_BINARY_PACKED encoder can in certain circumstances use more bits for encoding than are present in the data. Specifically, for INT32 data, if the range of values is large enough, the encoder will use up to 33 bits. While not a violation of the Parquet specification, this does cause problems with certain parquet readers (see [this](apache/arrow#20374) arrow-cpp issue, for instance). libcudf and parquet-mr have no issue with reading data encoded with 33 bits, but in the interest of a) greater interoperability and b) smaller files, this PR changes the DELTA_BINARY_PACKED encoder to use no more bits than are present in the physical type being encoded (32 for INT32, 64 for INT64). The actual change made here is to perform all the delta computations using 32-bit integers for INT32 data, and 64-bit integers for INT64 data. The prior encoder used 64-bit integers exclusively. To deal with possible overflow, all math is done on unsigned integers to get defined wrapping behavior (overflow with signed integers is UB), with results cast back to signed afterwards. This is in line with the approach taken by parquet-mr, arrow-cpp, and arrow-rs. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) - Mike Wilson (https://github.com/hyperbolic2346) - Bradley Dice (https://github.com/bdice) URL: rapidsai#14392

mapleFU · 2024-02-19T02:23:32Z

@etseidl After #37940 this is applied, would we prevent from generating new 33/65 bits data? Though we can still read from legacy data.

etseidl · 2024-02-19T04:45:08Z

@mapleFU AFAIK the arrow writer does not produce 33/65 bit data either before or after #37940, and still cannot read the bitwidths larger than the physical type. This reminds me that I still would like to add some clarification to the specification around this issue.

etseidl · 2024-06-20T23:58:05Z

This issue should be closed given the changes to the Parquet specification in PARQUET-2435.

cc @pitrou @mapleFU

pitrou · 2024-06-24T13:49:48Z

Thanks @etseidl ! Closing then.

etseidl mentioned this issue Nov 9, 2023

Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded rapidsai/cudf#14392

Merged

3 tasks

asfimport mentioned this issue Jun 23, 2024

Clarify behavior of DELTA_BINARY_PACKED encoders/decoders apache/parquet-format#426

Closed

pitrou closed this as not planned Won't fix, can't repro, duplicate, stale Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet] DELTA_BINARY_PACKED constraint on num_bits is too restrict? #20374

[Parquet] DELTA_BINARY_PACKED constraint on num_bits is too restrict? #20374

asfimport commented Aug 18, 2022

asfimport commented Aug 18, 2022

asfimport commented Aug 18, 2022

asfimport commented Aug 20, 2022

asfimport commented Aug 20, 2022

mapleFU commented Feb 19, 2024

etseidl commented Feb 19, 2024

etseidl commented Jun 20, 2024

pitrou commented Jun 24, 2024

[Parquet] DELTA_BINARY_PACKED constraint on num_bits is too restrict? #20374

[Parquet] DELTA_BINARY_PACKED constraint on num_bits is too restrict? #20374

Comments

asfimport commented Aug 18, 2022

Original Issue Attachments:

asfimport commented Aug 18, 2022

asfimport commented Aug 18, 2022

asfimport commented Aug 20, 2022

asfimport commented Aug 20, 2022

mapleFU commented Feb 19, 2024

etseidl commented Feb 19, 2024

etseidl commented Jun 20, 2024

pitrou commented Jun 24, 2024