-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Reading some lists in parquet produces incorrect results #9240
Comments
Why can't I reproduce it ? test data frame: data_array_of_primitive = [
21/09/24 02:27:08 WARN GpuOverrides: +------------+ |
@smone123 you cannot produce this type of parquet file with Spark, as I said before in the "Additional context" section. To make this work we had to use the low level parquet java API and do some very specific optimizations. Spark has a unit test that does this, and that is how we found it. |
Some updates on this issue: there could be kernel errors as well but firstly, the schema is not properly interpreted when parsing Parquet's Thrift Compact Protocol encoded metadata. More specifically, The converted type returned by read schema should contain |
Closes #9240 This PR added the [one-level list encoding](https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.h#L43-L77) support in parquet reader. It also involved cleanups like removing the unused stream argument and fixing typos in docs/comments. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Devavret Makkar (https://github.com/devavret) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9848
Describe the bug
When testing parquet list support we were able to produce a file that Spark can read in correctly, but cudf cannot. Thanks @res-life for finding this. The data is simple.
But when cudf tries to read it in it gets confused and reads in 6 values instead.
1, 2, 1, 2, 3, 2
Steps/Code to reproduce bug
Apply the following patch
And download and unzip
array-of-int32.parquet.zip and place it in the directory you are going to run the tests from. Then build and run just the new test.
Expected behavior
The code should read the same value.
Environment overview (please complete the following information)
I reproduced this on 21.10, but we see the same behavior in 21.08 too.
Additional context
Spark does not usually write out files like this. If I read in the file with Spark are write it out again the tests all pass using that file instead of the one that is attached. This came from a modified version of a Spark test, because it at least happens often enough that Spark wants to be sure that this works.
The text was updated successfully, but these errors were encountered: