Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parquet list schema interpretation bug. #13664

Closed
nvdbaranec opened this issue Jul 5, 2023 · 6 comments · Fixed by #13715
Closed

[BUG] Parquet list schema interpretation bug. #13664

nvdbaranec opened this issue Jul 5, 2023 · 6 comments · Fixed by #13715
Assignees
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@nvdbaranec
Copy link
Contributor

The following file loads incorrectly using the cudf parquet reader:

https://github.com/apache/parquet-testing/blob/master/data/repeated_no_annotation.parquet

Internally there are 2 columns:

  • int
  • list<struct<int64, string>>

The reader appears to interpret the latter as struct<struct<int64, string>. Ultimately this results in unpredictable/broken decoding.

The actual schema in the file is below ("phoneNumbers" is the column in question)

message spark_schema {
  required int32 id;
  optional group phoneNumbers {
    repeated group phone {
      required int64 number;
      optional binary kind (STRING);
    }
  }
}
@nvdbaranec nvdbaranec added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Jul 5, 2023
@jlowe jlowe added the Spark Functionality that helps Spark RAPIDS label Jul 6, 2023
@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Jul 10, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jul 10, 2023
@GregoryKimball
Copy link
Contributor

Thank you @nvdbaranec for raising this issue. It seems like another strange file case, where if we rewrite the data by arrow or cudf, cudf produces the correct table on read.

df = pd.read_parquet('repeated_no_annotation.parquet')
df.to_parquet('pd_write.pq')
df2 = cudf.read_parquet('pd_write.pq')
assert cudf.DataFrame.from_pandas(df).to_pandas().equals(df2.to_pandas())
# rewrite with pandas - OK!

df = pd.read_parquet('repeated_no_annotation.parquet')
df2 = cudf.DataFrame.from_pandas(df)
df2.to_parquet('cudf_write.pq')
df3 = cudf.read_parquet('cudf_write.pq')
assert df2.to_pandas().equals(df3.to_pandas())
# rewrite with cudf - OK!

Note, cuDF-python doesn't support equals on dataframes with complex columns

@hyperbolic2346
Copy link
Contributor

Changing this to properly read this as a list reveals another layer to this onion. The header shows the number of rows as 0, but the row groups inside have a valid row count. Need to determine how to properly reconcile when they disagree.

@nvdbaranec
Copy link
Contributor Author

If this is a list column, the number of rows in the page data is just the number of values in the data stream - the number of rows (in the cudf sense) can only be determined by examining the repetition levels. What code is getting deceived here? Some sort of early out condition?

rapids-bot bot pushed a commit that referenced this issue Jul 18, 2023
When investigating [this issue](#13664) I noticed that the file provided has 0 rows in the header. This caused cudf's parquet reader to fail at reading the file, but other tools such as `parq` and `parquet-tools` had no issues reading the file. This change counts up the number of rows in the row groups of the file and will complain loudly if the number differ, but not if the main header is 0. This allows us to properly read the data inside this file. Note that it will not properly parse it as a list of structs yet, that will be fixed in another PR. I didn't add a test since this is the only file I have seen with this issue and we can't read it yet in cudf. A test will be added for reading this file, which will test this change as well, with the PR for that issue.

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13712
@jlowe
Copy link
Member

jlowe commented Jul 19, 2023

Note that Apache Spark thinks the schema of this file is not an INT, LIST<STRUCT<INT64,STRING>> but rather an INT, STRUCT<LIST<STRUCT<INT64,STRING>>>.

scala> spark.read.parquet("repeated_no_annotation.parquet").printSchema
root
 |-- id: integer (nullable = true)
 |-- phoneNumbers: struct (nullable = true)
 |    |-- phone: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- kind: string (nullable = true)

@hyperbolic2346
Copy link
Contributor

Can you show the table that spark creates? Does it have one row?

@jlowe
Copy link
Member

jlowe commented Jul 20, 2023

There are 6 rows. The curly braces indicate a struct, the square brackets indicate an array (list).

scala> spark.read.parquet("repeated_no_annotation.parquet").show(truncate=false)
+---+----------------------------------------------------------------+          
|id |phoneNumbers                                                    |
+---+----------------------------------------------------------------+
|1  |null                                                            |
|2  |null                                                            |
|3  |{[]}                                                            |
|4  |{[{5555555555, null}]}                                          |
|5  |{[{1111111111, home}]}                                          |
|6  |{[{1111111111, home}, {2222222222, null}, {3333333333, mobile}]}|
+---+----------------------------------------------------------------+

Here's the same data collected back to the Spark driver with each row printed per line. In this case square bracket indicates a structure level. Top-level structures are the entire row.

scala> spark.read.parquet("repeated_no_annotation.parquet").collect.foreach(println)
[1,null]
[2,null]
[3,[WrappedArray()]]
[4,[WrappedArray([5555555555,null])]]
[5,[WrappedArray([1111111111,home])]]
[6,[WrappedArray([1111111111,home], [2222222222,null], [3333333333,mobile])]]

@GregoryKimball GregoryKimball removed the status in libcudf Sep 25, 2023
rapids-bot bot pushed a commit that referenced this issue Oct 6, 2023
This change alters how we interpret non-annotated data in a parquet file. Most modern parquet writers would produce something like:
```
message spark_schema {
  required int32 id;
  optional group phoneNumbers (LIST) {
    repeated group phone {
      required int64 number;
      optional binary kind (STRING);
    }
  }
}
```

But the list annotation isn't required. If it didn't exist, we would incorrectly interpret this schema as a struct of struct and not a list of struct. This change alters the code to look at the child and see if it is repeated. If it is, this indicates a list.

closes #13664

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Mark Harris (https://github.com/harrism)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #13715
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants