[BUG] Parquet list schema interpretation bug. #13664

nvdbaranec · 2023-07-05T19:13:28Z

The following file loads incorrectly using the cudf parquet reader:

https://github.com/apache/parquet-testing/blob/master/data/repeated_no_annotation.parquet

Internally there are 2 columns:

int
list<struct<int64, string>>

The reader appears to interpret the latter as struct<struct<int64, string>. Ultimately this results in unpredictable/broken decoding.

The actual schema in the file is below ("phoneNumbers" is the column in question)

message spark_schema {
  required int32 id;
  optional group phoneNumbers {
    repeated group phone {
      required int64 number;
      optional binary kind (STRING);
    }
  }
}

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2023-07-11T04:13:46Z

Thank you @nvdbaranec for raising this issue. It seems like another strange file case, where if we rewrite the data by arrow or cudf, cudf produces the correct table on read.

df = pd.read_parquet('repeated_no_annotation.parquet')
df.to_parquet('pd_write.pq')
df2 = cudf.read_parquet('pd_write.pq')
assert cudf.DataFrame.from_pandas(df).to_pandas().equals(df2.to_pandas())
# rewrite with pandas - OK!

df = pd.read_parquet('repeated_no_annotation.parquet')
df2 = cudf.DataFrame.from_pandas(df)
df2.to_parquet('cudf_write.pq')
df3 = cudf.read_parquet('cudf_write.pq')
assert df2.to_pandas().equals(df3.to_pandas())
# rewrite with cudf - OK!

Note, cuDF-python doesn't support equals on dataframes with complex columns

hyperbolic2346 · 2023-07-15T04:25:26Z

Changing this to properly read this as a list reveals another layer to this onion. The header shows the number of rows as 0, but the row groups inside have a valid row count. Need to determine how to properly reconcile when they disagree.

nvdbaranec · 2023-07-17T15:22:40Z

If this is a list column, the number of rows in the page data is just the number of values in the data stream - the number of rows (in the cudf sense) can only be determined by examining the repetition levels. What code is getting deceived here? Some sort of early out condition?

When investigating [this issue](#13664) I noticed that the file provided has 0 rows in the header. This caused cudf's parquet reader to fail at reading the file, but other tools such as `parq` and `parquet-tools` had no issues reading the file. This change counts up the number of rows in the row groups of the file and will complain loudly if the number differ, but not if the main header is 0. This allows us to properly read the data inside this file. Note that it will not properly parse it as a list of structs yet, that will be fixed in another PR. I didn't add a test since this is the only file I have seen with this issue and we can't read it yet in cudf. A test will be added for reading this file, which will test this change as well, with the PR for that issue. Authors: - Mike Wilson (https://github.com/hyperbolic2346) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #13712

jlowe · 2023-07-19T16:09:51Z

Note that Apache Spark thinks the schema of this file is not an INT, LIST<STRUCT<INT64,STRING>> but rather an INT, STRUCT<LIST<STRUCT<INT64,STRING>>>.

scala> spark.read.parquet("repeated_no_annotation.parquet").printSchema
root
 |-- id: integer (nullable = true)
 |-- phoneNumbers: struct (nullable = true)
 |    |-- phone: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- kind: string (nullable = true)

hyperbolic2346 · 2023-07-20T02:50:35Z

Can you show the table that spark creates? Does it have one row?

jlowe · 2023-07-20T14:16:53Z

There are 6 rows. The curly braces indicate a struct, the square brackets indicate an array (list).

scala> spark.read.parquet("repeated_no_annotation.parquet").show(truncate=false)
+---+----------------------------------------------------------------+          
|id |phoneNumbers                                                    |
+---+----------------------------------------------------------------+
|1  |null                                                            |
|2  |null                                                            |
|3  |{[]}                                                            |
|4  |{[{5555555555, null}]}                                          |
|5  |{[{1111111111, home}]}                                          |
|6  |{[{1111111111, home}, {2222222222, null}, {3333333333, mobile}]}|
+---+----------------------------------------------------------------+

Here's the same data collected back to the Spark driver with each row printed per line. In this case square bracket indicates a structure level. Top-level structures are the entire row.

scala> spark.read.parquet("repeated_no_annotation.parquet").collect.foreach(println)
[1,null]
[2,null]
[3,[WrappedArray()]]
[4,[WrappedArray([5555555555,null])]]
[5,[WrappedArray([1111111111,home])]]
[6,[WrappedArray([1111111111,home], [2222222222,null], [3333333333,mobile])]]

This change alters how we interpret non-annotated data in a parquet file. Most modern parquet writers would produce something like: ``` message spark_schema { required int32 id; optional group phoneNumbers (LIST) { repeated group phone { required int64 number; optional binary kind (STRING); } } } ``` But the list annotation isn't required. If it didn't exist, we would incorrectly interpret this schema as a struct of struct and not a list of struct. This change alters the code to look at the child and see if it is repeated. If it is, this indicates a list. closes #13664 Authors: - Mike Wilson (https://github.com/hyperbolic2346) - Vukasin Milovanovic (https://github.com/vuule) - Mark Harris (https://github.com/harrism) Approvers: - Mark Harris (https://github.com/harrism) - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #13715

nvdbaranec added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Jul 5, 2023

jlowe added the Spark Functionality that helps Spark RAPIDS label Jul 6, 2023

mythrocks mentioned this issue Jul 6, 2023

[TEST] Compatibility tests for data formats NVIDIA/spark-rapids#8666

Open

47 tasks

GregoryKimball added this to libcudf Jul 10, 2023

GregoryKimball moved this to Needs owner in libcudf Jul 10, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jul 10, 2023

mattahrens added the Reliability label Jul 12, 2023

hyperbolic2346 self-assigned this Jul 13, 2023

This was referenced Jul 13, 2023

Add an xfail test for Parquet reads for LIST<STRUCT<int, string>> NVIDIA/spark-rapids#8708

Closed

Enable test_read_repeated_no_annotation which is set to xfail NVIDIA/spark-rapids#8710

Closed

This was referenced Jul 14, 2023

[BUG] Parquet load failure on repeated_no_annotation.parquet NVIDIA/spark-rapids#8631

Closed

Add a test for reading a repeated_no_annotation Parquet file NVIDIA/spark-rapids#8709

Closed

This was referenced Jul 17, 2023

Parquet uses row group row count if missing from header #13712

Merged

Fixing parquet list of struct interpretation #13715

Merged

GregoryKimball removed the status in libcudf Sep 25, 2023

rapids-bot bot closed this as completed in #13715 Oct 6, 2023

GregoryKimball removed this from libcudf Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Parquet list schema interpretation bug. #13664

[BUG] Parquet list schema interpretation bug. #13664

nvdbaranec commented Jul 5, 2023

GregoryKimball commented Jul 11, 2023

hyperbolic2346 commented Jul 15, 2023

nvdbaranec commented Jul 17, 2023

jlowe commented Jul 19, 2023 •

edited

Loading

hyperbolic2346 commented Jul 20, 2023

jlowe commented Jul 20, 2023 •

edited

Loading

[BUG] Parquet list schema interpretation bug. #13664

[BUG] Parquet list schema interpretation bug. #13664

Comments

nvdbaranec commented Jul 5, 2023

GregoryKimball commented Jul 11, 2023

hyperbolic2346 commented Jul 15, 2023

nvdbaranec commented Jul 17, 2023

jlowe commented Jul 19, 2023 • edited Loading

hyperbolic2346 commented Jul 20, 2023

jlowe commented Jul 20, 2023 • edited Loading

jlowe commented Jul 19, 2023 •

edited

Loading

jlowe commented Jul 20, 2023 •

edited

Loading