Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] DELTA_BINARY_PACKED constraint on num_bits is too restrict? #20374

Closed
asfimport opened this issue Aug 18, 2022 · 8 comments
Closed

Comments

@asfimport
Copy link
Collaborator

Consider the sequence of (int32) values

[863490391,-816295192,1613070492,-1166045478,1856530847]

This sequence can be encoded as a single block, single miniblock with a bit_width of 33.

However, we currently require [1] the bit_width of each miniblock to be smaller than the bitwidth of the type it encodes.

We could consider lifting this constraint, as, as shown in the example above, the values representation's bit_width can be smaller than the delta's representation's bit_width.

[1]

if (bit_width_data[i] > kMaxDeltaBitWidth) {

Reporter: Jorge Leitão / @jorgecarleitao

Original Issue Attachments:

Note: This issue was originally created as ARROW-17465. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
It would be useful to find out what parquet-mr (the Java Parquet implementation) does about this.
In particular, we don't want to produce files that wouldn't be readable by parquet-mr, I think.

@asfimport
Copy link
Collaborator Author

Jorge Leitão / @jorgecarleitao:
Agree. The issue was not very clear - we currently only support reading DELTA_BINARY_PACKED, and it is in the read path that currently bails.

Note that pyspark can read it. Attached to this comment is a minimal parquet file reproducing the issue. The code example results in

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/18 21:14:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
spark reads it                                                                  
Traceback (most recent call last):
 File "bla.py", line 16, in <module>
   table = pyarrow.parquet.read_table("test.parquet")
 File "/home/azureuser/projects/arrow2/arrow-parquet-integration-testing/venv/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
   ...
OSError: delta bit width larger than integer bit width

which how I concluded that this is something specific in arrow :)

import pyarrow.parquet
import pyspark.sql


spark = pyspark.sql.SparkSession.builder.config(
    # see https://stackoverflow.com/a/62024670/931303
    "spark.sql.parquet.enableVectorizedReader",
    "false",
).getOrCreate()

result = spark.read.parquet("test.parquet").collect()
assert [r["c1"] for r in result] == [863490391, -816295192, 1613070492, -1166045478, 1856530847]
print("spark reads it")

table = pyarrow.parquet.read_table("test.parquet")
print("pyarrow reads it")

test.parquet

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Hmm, it seems it will be complicated for Parquet C++ to accept this without some rearchitecturing of the decoder.
(roughly, if we decode Int32 values, we first decode the deltas as a buffer of Int32 as well...)

I would suggest first ask on the Parquet ML whether this is intended to be supported.

(note that of course for such data, DELTA_BINARY_PACKED should not be used at all, as it produces an expansion... while being more CPU-intensive to decode as well)

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Also, for Int64 data, it would mean we would have to accept 65-bit deltas.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Dec 6, 2023
…hysical type being encoded (#14392)

The current implementation of the DELTA_BINARY_PACKED encoder can in certain circumstances use more bits for encoding than are present in the data. Specifically, for INT32 data, if the range of values is large enough, the encoder will use up to 33 bits. While not a violation of the Parquet specification, this does cause problems with certain parquet readers (see [this](apache/arrow#20374) arrow-cpp issue, for instance). libcudf and parquet-mr have no issue with reading data encoded with 33 bits, but in the interest of a) greater interoperability and b) smaller files, this PR changes the DELTA_BINARY_PACKED encoder to use no more bits than are present in the physical type being encoded (32 for INT32, 64 for INT64).

The actual change made here is to perform all the delta computations using 32-bit integers for INT32 data, and 64-bit integers for INT64 data. The prior encoder used 64-bit integers exclusively. To deal with possible overflow, all math is done on unsigned integers to get defined wrapping behavior (overflow with signed integers is UB), with results cast back to signed afterwards. This is in line with the approach taken by parquet-mr, arrow-cpp, and arrow-rs.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Bradley Dice (https://github.com/bdice)

URL: #14392
karthikeyann pushed a commit to karthikeyann/cudf that referenced this issue Dec 12, 2023
…hysical type being encoded (rapidsai#14392)

The current implementation of the DELTA_BINARY_PACKED encoder can in certain circumstances use more bits for encoding than are present in the data. Specifically, for INT32 data, if the range of values is large enough, the encoder will use up to 33 bits. While not a violation of the Parquet specification, this does cause problems with certain parquet readers (see [this](apache/arrow#20374) arrow-cpp issue, for instance). libcudf and parquet-mr have no issue with reading data encoded with 33 bits, but in the interest of a) greater interoperability and b) smaller files, this PR changes the DELTA_BINARY_PACKED encoder to use no more bits than are present in the physical type being encoded (32 for INT32, 64 for INT64).

The actual change made here is to perform all the delta computations using 32-bit integers for INT32 data, and 64-bit integers for INT64 data. The prior encoder used 64-bit integers exclusively. To deal with possible overflow, all math is done on unsigned integers to get defined wrapping behavior (overflow with signed integers is UB), with results cast back to signed afterwards. This is in line with the approach taken by parquet-mr, arrow-cpp, and arrow-rs.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#14392
@mapleFU
Copy link
Member

mapleFU commented Feb 19, 2024

@etseidl After #37940 this is applied, would we prevent from generating new 33/65 bits data? Though we can still read from legacy data.

@etseidl
Copy link
Contributor

etseidl commented Feb 19, 2024

@mapleFU AFAIK the arrow writer does not produce 33/65 bit data either before or after #37940, and still cannot read the bitwidths larger than the physical type. This reminds me that I still would like to add some clarification to the specification around this issue.

@etseidl
Copy link
Contributor

etseidl commented Jun 20, 2024

This issue should be closed given the changes to the Parquet specification in PARQUET-2435.

cc @pitrou @mapleFU

@pitrou
Copy link
Member

pitrou commented Jun 24, 2024

Thanks @etseidl ! Closing then.

@pitrou pitrou closed this as not planned Won't fix, can't repro, duplicate, stale Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants