-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet] DELTA_BINARY_PACKED constraint on num_bits is too restrict? #20374
Comments
Antoine Pitrou / @pitrou: |
Jorge Leitão / @jorgecarleitao: Note that pyspark can read it. Attached to this comment is a minimal parquet file reproducing the issue. The code example results in Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/18 21:14:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
spark reads it
Traceback (most recent call last):
File "bla.py", line 16, in <module>
table = pyarrow.parquet.read_table("test.parquet")
File "/home/azureuser/projects/arrow2/arrow-parquet-integration-testing/venv/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
...
OSError: delta bit width larger than integer bit width which how I concluded that this is something specific in arrow :) import pyarrow.parquet
import pyspark.sql
spark = pyspark.sql.SparkSession.builder.config(
# see https://stackoverflow.com/a/62024670/931303
"spark.sql.parquet.enableVectorizedReader",
"false",
).getOrCreate()
result = spark.read.parquet("test.parquet").collect()
assert [r["c1"] for r in result] == [863490391, -816295192, 1613070492, -1166045478, 1856530847]
print("spark reads it")
table = pyarrow.parquet.read_table("test.parquet")
print("pyarrow reads it") |
Antoine Pitrou / @pitrou: I would suggest first ask on the Parquet ML whether this is intended to be supported. (note that of course for such data, DELTA_BINARY_PACKED should not be used at all, as it produces an expansion... while being more CPU-intensive to decode as well) |
Antoine Pitrou / @pitrou: |
…hysical type being encoded (#14392) The current implementation of the DELTA_BINARY_PACKED encoder can in certain circumstances use more bits for encoding than are present in the data. Specifically, for INT32 data, if the range of values is large enough, the encoder will use up to 33 bits. While not a violation of the Parquet specification, this does cause problems with certain parquet readers (see [this](apache/arrow#20374) arrow-cpp issue, for instance). libcudf and parquet-mr have no issue with reading data encoded with 33 bits, but in the interest of a) greater interoperability and b) smaller files, this PR changes the DELTA_BINARY_PACKED encoder to use no more bits than are present in the physical type being encoded (32 for INT32, 64 for INT64). The actual change made here is to perform all the delta computations using 32-bit integers for INT32 data, and 64-bit integers for INT64 data. The prior encoder used 64-bit integers exclusively. To deal with possible overflow, all math is done on unsigned integers to get defined wrapping behavior (overflow with signed integers is UB), with results cast back to signed afterwards. This is in line with the approach taken by parquet-mr, arrow-cpp, and arrow-rs. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) - Mike Wilson (https://github.com/hyperbolic2346) - Bradley Dice (https://github.com/bdice) URL: #14392
…hysical type being encoded (rapidsai#14392) The current implementation of the DELTA_BINARY_PACKED encoder can in certain circumstances use more bits for encoding than are present in the data. Specifically, for INT32 data, if the range of values is large enough, the encoder will use up to 33 bits. While not a violation of the Parquet specification, this does cause problems with certain parquet readers (see [this](apache/arrow#20374) arrow-cpp issue, for instance). libcudf and parquet-mr have no issue with reading data encoded with 33 bits, but in the interest of a) greater interoperability and b) smaller files, this PR changes the DELTA_BINARY_PACKED encoder to use no more bits than are present in the physical type being encoded (32 for INT32, 64 for INT64). The actual change made here is to perform all the delta computations using 32-bit integers for INT32 data, and 64-bit integers for INT64 data. The prior encoder used 64-bit integers exclusively. To deal with possible overflow, all math is done on unsigned integers to get defined wrapping behavior (overflow with signed integers is UB), with results cast back to signed afterwards. This is in line with the approach taken by parquet-mr, arrow-cpp, and arrow-rs. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) - Mike Wilson (https://github.com/hyperbolic2346) - Bradley Dice (https://github.com/bdice) URL: rapidsai#14392
This issue should be closed given the changes to the Parquet specification in PARQUET-2435. |
Thanks @etseidl ! Closing then. |
Consider the sequence of (int32) values
[863490391,-816295192,1613070492,-1166045478,1856530847]
This sequence can be encoded as a single block, single miniblock with a bit_width of 33.
However, we currently require [1] the bit_width of each miniblock to be smaller than the bitwidth of the type it encodes.
We could consider lifting this constraint, as, as shown in the example above, the values representation's
bit_width
can be smaller than the delta's representation'sbit_width
.[1]
arrow/cpp/src/parquet/encoding.cc
Line 2173 in a376968
Reporter: Jorge Leitão / @jorgecarleitao
Original Issue Attachments:
Note: This issue was originally created as ARROW-17465. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: