You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We have a customer that is using parquet files written by a system other than Spark that is writing Strings to parquet files and using the schema type of binary and repeated binary. In the past various other big data projects wrote strings this way as well.
Example of the parquet schema for a similar parquet file:
Note that in some cases cudf seems to be able to read these but then in others it fails. I'll attach 2 parquet files, goodbinary.parquet and badbinary.parquet where the only difference between good and bad is that the goodbinary.parquet drops the last column from schema above optional binary locale and the badbinary.parquet has all columns above.
binary is a list of bytes [77 74 74] when read as string wtt
Describe the solution you'd like
Regardless of it seems to sometimes work, we would like cudf to officially support reading these binary and repeated binary types. Ideally we could support reading binary both as binary and as strings. We could pass in a read schema so the reader would know what to read it is.
The text was updated successfully, but these errors were encountered:
Partially addresses: #10733
For a particular way of encoding list schemas (an old way that Spark seems to use sometimes), the parquet reader was accidentally propagating incorrect nesting information between columns. Just a simple bug of not popping an extra value off a stack.
Note: this is simply a fix so that the files read correctly, however the internal data in the file is actually of binary type and cudf converts these to string columns. This PR does not add support for binary as a real type in cudf.
Authors:
- https://github.com/nvdbaranec
Approvers:
- Yunsong Wang (https://github.com/PointKernel)
- MithunR (https://github.com/mythrocks)
- GALI PREM SAGAR (https://github.com/galipremsagar)
URL: #10750
Is your feature request related to a problem? Please describe.
We have a customer that is using parquet files written by a system other than Spark that is writing Strings to parquet files and using the schema type of binary and repeated binary. In the past various other big data projects wrote strings this way as well.
Example of the parquet schema for a similar parquet file:
Note that in some cases cudf seems to be able to read these but then in others it fails. I'll attach 2 parquet files, goodbinary.parquet and badbinary.parquet where the only difference between good and bad is that the goodbinary.parquet drops the last column from schema above
optional binary locale
and the badbinary.parquet has all columns above.binary is a list of bytes
[77 74 74]
when read as stringwtt
Describe the solution you'd like
Regardless of it seems to sometimes work, we would like cudf to officially support reading these binary and repeated binary types. Ideally we could support reading binary both as binary and as strings. We could pass in a read schema so the reader would know what to read it is.
The text was updated successfully, but these errors were encountered: