Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Parquet support for reading binary and repeated binary #10733

Closed
tgravescs opened this issue Apr 25, 2022 · 3 comments
Closed

[FEA] Parquet support for reading binary and repeated binary #10733

tgravescs opened this issue Apr 25, 2022 · 3 comments
Assignees
Labels
cuIO cuIO issue feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@tgravescs
Copy link
Contributor

Is your feature request related to a problem? Please describe.
We have a customer that is using parquet files written by a system other than Spark that is writing Strings to parquet files and using the schema type of binary and repeated binary. In the past various other big data projects wrote strings this way as well.

Example of the parquet schema for a similar parquet file:

Schema:
message test {
  optional binary x;
  optional int32 y;
  optional int32 z;
  repeated binary net;
  optional binary locale;
}

Note that in some cases cudf seems to be able to read these but then in others it fails. I'll attach 2 parquet files, goodbinary.parquet and badbinary.parquet where the only difference between good and bad is that the goodbinary.parquet drops the last column from schema above optional binary locale and the badbinary.parquet has all columns above.

binary is a list of bytes [77 74 74] when read as string wtt

Describe the solution you'd like
Regardless of it seems to sometimes work, we would like cudf to officially support reading these binary and repeated binary types. Ideally we could support reading binary both as binary and as strings. We could pass in a read schema so the reader would know what to read it is.

@tgravescs tgravescs added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Apr 25, 2022
@tgravescs
Copy link
Contributor Author

binaryparquet.zip

This zip file contains the 2 parquet files described above

@nvdbaranec
Copy link
Contributor

This turns out to be a simple bug in the reader. Working on it.

@nvdbaranec nvdbaranec self-assigned this Apr 26, 2022
rapids-bot bot pushed a commit that referenced this issue Apr 29, 2022
Partially addresses: #10733

For a particular way of encoding list schemas (an old way that Spark seems to use sometimes), the parquet reader was accidentally propagating incorrect nesting information between columns.  Just a simple bug of not popping an extra value off a stack.

Note:  this is simply a fix so that the files read correctly, however the internal data in the file is actually of binary type and cudf converts these to string columns.  This PR does not add support for binary as a real type in cudf.

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - MithunR (https://github.com/mythrocks)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #10750
@nvdbaranec
Copy link
Contributor

Closed with: #10750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

3 participants