Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] libcudf fails to load old Parquet array encoding properly #13237

Closed
jlowe opened this issue Apr 27, 2023 · 2 comments · Fixed by #13277
Closed

[BUG] libcudf fails to load old Parquet array encoding properly #13237

jlowe opened this issue Apr 27, 2023 · 2 comments · Fixed by #13277
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Apr 27, 2023

Describe the bug
Older Parquet files may have arrays encoded in different ways, including a group followed immediately by a repeated primitive rather than the typical group->group list->primitive. libcudf is failing to load this older encoding properly, loading the data as LIST of LIST of INT32 rather than LIST of INT32.

Steps/Code to reproduce bug
Attached is a sample file which can be loaded with the following test program:

#include <iostream>

#include <cudf/io/parquet.hpp>
#include <cudf/lists/lists_column_view.hpp>
#include <cudf/types.hpp>

int main(int argc, char** argv) {
  auto table = cudf::io::read_parquet(
    cudf::io::parquet_reader_options::builder(cudf::io::source_info("pq891392009.parquet")));
  if (table.tbl->num_columns() != 1) {
    std::cerr << "Incorrect column count" << std::endl;
  }
  auto c0 = table.tbl->get_column(0);
  if (c0.type().id() != cudf::type_id::LIST) {
    std::cerr << "Top level column type mismatch, expected LIST" << std::endl;
    return 1;
  }
  auto lc = cudf::lists_column_view(c0);
  auto child = lc.child();
  if (child.type().id() != cudf::type_id::INT32) {
    std::cerr << "child column type mismatch, expected INT32" << std::endl;
    return 1;
  }
  return 0;
}

pq891392009.parquet.gz

Expected behavior
libcudf should load the file as a LIST of INT32 child column. Both the parquet-cli and Spark CPU can load this file properly.

From parquet-cli:

$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main meta /tmp/dbg/pq891392009.parquet

File path:  /tmp/dbg/pq891392009.parquet
Created by: RAPIDS Spark Plugin
Properties: (none)
Schema:
message spark_schema {
  required group f (LIST) {
    repeated int32 array;
  }
}


Row group 0:  count: 1  43.00 B records  start: 4  total: 43 B
--------------------------------------------------------------------------------
         type      encodings count     avg size   nulls   min / max
f.array  INT32     _   _     2         21.50 B    0       "0" / "1"

$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main cat /tmp/dbg/pq891392009.parquet
{"f": [0, 1]}

and from Spark CPU:

scala> spark.read.parquet("/tmp/dbg/pq891392009.parquet").printSchema
root                                                                            
 |-- f: array (nullable = true)
 |    |-- element: integer (containsNull = true)


scala> spark.read.parquet("/tmp/dbg/pq891392009.parquet").show
+------+
|     f|
+------+
|[0, 1]|
+------+
@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Apr 27, 2023
@GregoryKimball
Copy link
Contributor

Thank you @jlowe for raising this. Would you please share a bit more about the origin of this older Parquet file?

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels May 1, 2023
@jlowe
Copy link
Member Author

jlowe commented May 1, 2023

The file is derived from a disabled unit test in the NVIDIA/spark-rapids repository, but it's behavior in the plugin is similar to an issue we ran into with a user who was working with older Parquet files. When I started fixing the Spark plugin issues raised by the unit test, I eventually ran into these issues on the cudf side.

rapids-bot bot pushed a commit that referenced this issue May 8, 2023
There is a bug reading parquet files that have a specific encoding. The typical list is 
```
<list-repetition> group <name> (LIST) {
  repeated group list {
    <element-repetition> <element-type> element;
  }
}
```
but it can also be
```
<list-repetition> group <name> (LIST) {
  repeated group list {
    repeated int32 name
  }
}
```
This second case was failing and this fixes that. The issue was the `one_level_list` was returning true for the name, which resulted in another nesting level. The fix is to only return true if the immediate parent is not a list.

closes #13237
closes #13239

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - https://github.com/nvdbaranec
  - MithunR (https://github.com/mythrocks)
  - Ashwin Srinath (https://github.com/shwina)

URL: #13277
@GregoryKimball GregoryKimball removed this from libcudf Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants