[FEA] Parquet support for reading binary and repeated binary #10733

tgravescs · 2022-04-25T21:57:26Z

Is your feature request related to a problem? Please describe.
We have a customer that is using parquet files written by a system other than Spark that is writing Strings to parquet files and using the schema type of binary and repeated binary. In the past various other big data projects wrote strings this way as well.

Example of the parquet schema for a similar parquet file:

Schema:
message test {
  optional binary x;
  optional int32 y;
  optional int32 z;
  repeated binary net;
  optional binary locale;
}

Note that in some cases cudf seems to be able to read these but then in others it fails. I'll attach 2 parquet files, goodbinary.parquet and badbinary.parquet where the only difference between good and bad is that the goodbinary.parquet drops the last column from schema above optional binary locale and the badbinary.parquet has all columns above.

binary is a list of bytes [77 74 74] when read as string wtt

Describe the solution you'd like
Regardless of it seems to sometimes work, we would like cudf to officially support reading these binary and repeated binary types. Ideally we could support reading binary both as binary and as strings. We could pass in a read schema so the reader would know what to read it is.

The text was updated successfully, but these errors were encountered:

tgravescs · 2022-04-25T21:59:45Z

binaryparquet.zip

This zip file contains the 2 parquet files described above

nvdbaranec · 2022-04-26T20:36:59Z

This turns out to be a simple bug in the reader. Working on it.

Partially addresses: #10733 For a particular way of encoding list schemas (an old way that Spark seems to use sometimes), the parquet reader was accidentally propagating incorrect nesting information between columns. Just a simple bug of not popping an extra value off a stack. Note: this is simply a fix so that the files read correctly, however the internal data in the file is actually of binary type and cudf converts these to string columns. This PR does not add support for binary as a real type in cudf. Authors: - https://github.com/nvdbaranec Approvers: - Yunsong Wang (https://github.com/PointKernel) - MithunR (https://github.com/mythrocks) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #10750

nvdbaranec · 2022-05-23T19:56:58Z

Closed with: #10750

tgravescs added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Apr 25, 2022

nvdbaranec self-assigned this Apr 26, 2022

nvdbaranec mentioned this issue Apr 27, 2022

Fix an issue with one_level_list schemas in parquet reader. #10750

Merged

tgravescs mentioned this issue May 3, 2022

[FEA] Support reading binary data types from Parquet as binary (not strings) NVIDIA/spark-rapids#5416

Closed

nvdbaranec closed this as completed May 23, 2022

tgravescs mentioned this issue Jun 3, 2022

[FEA] Parquet support for reading binary and repeated binary as binary not strings #11044

Closed

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Parquet support for reading binary and repeated binary #10733

[FEA] Parquet support for reading binary and repeated binary #10733

tgravescs commented Apr 25, 2022

tgravescs commented Apr 25, 2022

nvdbaranec commented Apr 26, 2022

nvdbaranec commented May 23, 2022

[FEA] Parquet support for reading binary and repeated binary #10733

[FEA] Parquet support for reading binary and repeated binary #10733

Comments

tgravescs commented Apr 25, 2022

tgravescs commented Apr 25, 2022

nvdbaranec commented Apr 26, 2022

nvdbaranec commented May 23, 2022