-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Parquet list schema interpretation bug. #13664
Comments
Thank you @nvdbaranec for raising this issue. It seems like another strange file case, where if we rewrite the data by arrow or cudf, cudf produces the correct table on read.
Note, cuDF-python doesn't support |
Changing this to properly read this as a list reveals another layer to this onion. The header shows the number of rows as 0, but the row groups inside have a valid row count. Need to determine how to properly reconcile when they disagree. |
If this is a list column, the number of rows in the page data is just the number of values in the data stream - the number of rows (in the cudf sense) can only be determined by examining the repetition levels. What code is getting deceived here? Some sort of early out condition? |
When investigating [this issue](#13664) I noticed that the file provided has 0 rows in the header. This caused cudf's parquet reader to fail at reading the file, but other tools such as `parq` and `parquet-tools` had no issues reading the file. This change counts up the number of rows in the row groups of the file and will complain loudly if the number differ, but not if the main header is 0. This allows us to properly read the data inside this file. Note that it will not properly parse it as a list of structs yet, that will be fixed in another PR. I didn't add a test since this is the only file I have seen with this issue and we can't read it yet in cudf. A test will be added for reading this file, which will test this change as well, with the PR for that issue. Authors: - Mike Wilson (https://github.com/hyperbolic2346) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #13712
Note that Apache Spark thinks the schema of this file is not an INT, LIST<STRUCT<INT64,STRING>> but rather an INT, STRUCT<LIST<STRUCT<INT64,STRING>>>.
|
Can you show the table that spark creates? Does it have one row? |
There are 6 rows. The curly braces indicate a struct, the square brackets indicate an array (list).
Here's the same data collected back to the Spark driver with each row printed per line. In this case square bracket indicates a structure level. Top-level structures are the entire row.
|
This change alters how we interpret non-annotated data in a parquet file. Most modern parquet writers would produce something like: ``` message spark_schema { required int32 id; optional group phoneNumbers (LIST) { repeated group phone { required int64 number; optional binary kind (STRING); } } } ``` But the list annotation isn't required. If it didn't exist, we would incorrectly interpret this schema as a struct of struct and not a list of struct. This change alters the code to look at the child and see if it is repeated. If it is, this indicates a list. closes #13664 Authors: - Mike Wilson (https://github.com/hyperbolic2346) - Vukasin Milovanovic (https://github.com/vuule) - Mark Harris (https://github.com/harrism) Approvers: - Mark Harris (https://github.com/harrism) - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #13715
The following file loads incorrectly using the cudf parquet reader:
https://github.com/apache/parquet-testing/blob/master/data/repeated_no_annotation.parquet
Internally there are 2 columns:
int
list<struct<int64, string>>
The reader appears to interpret the latter as
struct<struct<int64, string>
. Ultimately this results in unpredictable/broken decoding.The actual schema in the file is below ("phoneNumbers" is the column in question)
The text was updated successfully, but these errors were encountered: