[BUG] libcudf fails to load old Parquet array encoding properly #13237
Labels
0 - Backlog
In queue waiting for assignment
bug
Something isn't working
cuIO
cuIO issue
libcudf
Affects libcudf (C++/CUDA) code.
Spark
Functionality that helps Spark RAPIDS
Milestone
Describe the bug
Older Parquet files may have arrays encoded in different ways, including a group followed immediately by a repeated primitive rather than the typical group->group list->primitive. libcudf is failing to load this older encoding properly, loading the data as LIST of LIST of INT32 rather than LIST of INT32.
Steps/Code to reproduce bug
Attached is a sample file which can be loaded with the following test program:
pq891392009.parquet.gz
Expected behavior
libcudf should load the file as a LIST of INT32 child column. Both the parquet-cli and Spark CPU can load this file properly.
From parquet-cli:
and from Spark CPU:
The text was updated successfully, but these errors were encountered: