[BUG] Reading some lists in parquet produces incorrect results #9240

revans2 · 2021-09-16T15:42:28Z

Describe the bug
When testing parquet list support we were able to produce a file that Spark can read in correctly, but cudf cannot. Thanks @res-life for finding this. The data is simple.

+---------+
|a        |
+---------+
|[0, 1, 2]|
|[1, 2, 3]|
|[2, 3, 4]|
|[3, 4, 5]|
+---------+

But when cudf tries to read it in it gets confused and reads in 6 values instead.
1, 2, 1, 2, 3, 2

Steps/Code to reproduce bug
Apply the following patch

diff --git a/cpp/tests/io/parquet_test.cpp b/cpp/tests/io/parquet_test.cpp
index 7260aa9e68..1344041f05 100644
--- a/cpp/tests/io/parquet_test.cpp
+++ b/cpp/tests/io/parquet_test.cpp
@@ -2184,6 +2184,17 @@ TEST_F(ParquetWriterStressTest, DeviceWriteLargeTableWithValids)
   CUDF_TEST_EXPECT_TABLES_EQUAL(custom_tbl.tbl->view(), expected->view());
 }
 
+TEST_F(ParquetReaderTest, SpecialArray)
+{
+  cudf_io::parquet_reader_options read_opts =
+    cudf_io::parquet_reader_options::builder(cudf_io::source_info{"./array-of-int32.parquet"});
+  auto result = cudf_io::read_parquet(read_opts);
+
+  // we should only get back 4 rows
+  EXPECT_EQ(result.tbl->view().column(0).type().id(), cudf::type_id::LIST);
+  EXPECT_EQ(result.tbl->view().column(0).size(), 4);
+}
+
 TEST_F(ParquetReaderTest, UserBounds)
 {
   // trying to read more rows than there are should result in

And download and unzip
array-of-int32.parquet.zip and place it in the directory you are going to run the tests from. Then build and run just the new test.

./gtests/PARQUET_TEST --gtest_filter=ParquetReaderTest.SpecialArray
Note: Google Test filter = ParquetReaderTest.SpecialArray
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ParquetReaderTest
[ RUN      ] ParquetReaderTest.SpecialArray
/home/roberte/src/cudf/cpp/tests/io/parquet_test.cpp:2194: Failure
Expected equality of these values:
  result.tbl->view().column(0).type().id()
    Which is: 4-byte object <03-00 00-00>
  cudf::type_id::LIST
    Which is: 4-byte object <18-00 00-00>
/home/roberte/src/cudf/cpp/tests/io/parquet_test.cpp:2195: Failure
Expected equality of these values:
  result.tbl->view().column(0).size()
    Which is: 6
  4
[  FAILED  ] ParquetReaderTest.SpecialArray (2 ms)
[----------] 1 test from ParquetReaderTest (2 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (2 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ParquetReaderTest.SpecialArray

 1 FAILED TEST

Expected behavior
The code should read the same value.

Environment overview (please complete the following information)
I reproduced this on 21.10, but we see the same behavior in 21.08 too.

Additional context

Spark does not usually write out files like this. If I read in the file with Spark are write it out again the tests all pass using that file instead of the one that is attached. This came from a modified version of a Spark test, because it at least happens often enough that Spark wants to be sure that this works.

The text was updated successfully, but these errors were encountered:

vuule · 2021-09-20T20:26:43Z

@devavret

johnnyzhon · 2021-09-24T02:37:46Z

Why can't I reproduce it ?

test data frame:
schema_array_of_primitive = StructType([
StructField("arrayF_str", ArrayType(StringType())),
StructField("arrayF_byte", ArrayType(ByteType())),
StructField("arrayF_short", ArrayType(ShortType())),
StructField("arrayF_int", ArrayType(IntegerType())),
StructField("arrayF_long", ArrayType(LongType())),
StructField("arrayF_float", ArrayType(FloatType())),
StructField("arrayF_double", ArrayType(DoubleType())),
StructField("arrayF_boolean", ArrayType(BooleanType())),
StructField("arrayF_timestamp", ArrayType(TimestampType())),
StructField("arrayF_date", ArrayType(DateType()))
])

data_array_of_primitive = [
(["NVIDIA","intel","amd"], [40,30,20], [800,700,600], [4200,3200,2200], [-40000,30001,50000], [40.12,40.13,40.14], [4.00013,4.00013,4.00015], None, [datetime(2020,2,1,12,1,1)], [date(1990, 10, 12)]),
(["spark","hdfs","hadoop"], [30,20,10], [700,600,500], [3200,2200,2300], [-40000,30000,50001], None, [3.000013,3.000013,3.000014], [True,True,True], [datetime(2020,2,10,12,30,30)], [date(2019, 3, 15)]),
(["take out","come back"], [20,20,10], [600,400,500], None, [-40001,30000,50000], [20.12,20.13,20.14], [2.000013,12.000013,2.000016], [True,True,True], [datetime(2020,3,1,12,1,1)],[date(1992, 1, 1)]),
(["Mellanox","deepminD"], [40,40,10], [800,800,200], [4200,5200,6200], [-40000,30000,50000], [-20.12,20.12,20.13], [4.00012,4.00018,24.00013], [False,False,False], [datetime(2020,4,1,12,1,1)], [date(1993, 1, 1)]),
(["TEST","dev","product"], [30,30,20], [500,600,700], [3200,7200,1200], [40000,30002,50000], [20.12,20.13,20.14], [14.000013,24.000013,34.000013], [True,False,False], [datetime(2020,7,1,12,1,1)], [date(1996, 1, 1)])
]
Steps:

df =spark.createDataFrame(data_array_of_primitive, schema_array_of_primitive)
df.write.parquet("/usr/app/test.parquet")
df_new = spark.read.parquet("/usr/app/test.parquet")
df_new.createOrReplaceTempView('test')
spark.sql("select arrayF_byte from test").show()

21/09/24 02:27:08 WARN GpuOverrides:
*Exec will run on GPU
*Partitioning <SinglePartition$> will run on GPU
!Exec cannot run on GPU because not all expressions can be replaced
@expression cast(arrayF_byte#51 as string) AS arrayF_byte#79 could run on GPU
!Expression cast(arrayF_byte#51 as string) cannot run on GPU because Cast from ArrayType(ByteType,true) to StringType is not supported
@expression arrayF_byte#51 could run on GPU
*Exec will run on GPU

+------------+
| arrayF_byte|
+------------+
|[30, 30, 20]|
|[40, 30, 20]|
|[20, 20, 10]|
|[40, 40, 10]|
|[30, 20, 10]|
+------------+

revans2 · 2021-10-01T11:41:50Z

@smone123 you cannot produce this type of parquet file with Spark, as I said before in the "Additional context" section. To make this work we had to use the low level parquet java API and do some very specific optimizations. Spark has a unit test that does this, and that is how we found it.

PointKernel · 2021-11-12T00:53:19Z

Some updates on this issue: there could be kernel errors as well but firstly, the schema is not properly interpreted when parsing Parquet's Thrift Compact Protocol encoded metadata. More specifically, The converted type returned by read schema should contain LIST instead of INT32 or UNKNOWN.

Closes #9240 This PR added the [one-level list encoding](https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.h#L43-L77) support in parquet reader. It also involved cleanups like removing the unused stream argument and fixing typos in docs/comments. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Devavret Makkar (https://github.com/devavret) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9848

revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Sep 16, 2021

beckernick added cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Sep 16, 2021

beckernick added this to the IO Data Type Expansion milestone Sep 16, 2021

devavret assigned PointKernel Sep 28, 2021

PointKernel mentioned this issue Dec 6, 2021

Add one-level list encoding support in parquet reader #9848

Merged

rapids-bot bot closed this as completed in #9848 Dec 10, 2021

res-life mentioned this issue Jan 10, 2022

Update test case of reading nested unsigned parquet file NVIDIA/spark-rapids#4483

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Reading some lists in parquet produces incorrect results #9240

[BUG] Reading some lists in parquet produces incorrect results #9240

revans2 commented Sep 16, 2021 •

edited

Loading

vuule commented Sep 20, 2021

johnnyzhon commented Sep 24, 2021 •

edited

Loading

revans2 commented Oct 1, 2021

PointKernel commented Nov 12, 2021

[BUG] Reading some lists in parquet produces incorrect results #9240

[BUG] Reading some lists in parquet produces incorrect results #9240

Comments

revans2 commented Sep 16, 2021 • edited Loading

vuule commented Sep 20, 2021

johnnyzhon commented Sep 24, 2021 • edited Loading

revans2 commented Oct 1, 2021

PointKernel commented Nov 12, 2021

revans2 commented Sep 16, 2021 •

edited

Loading

johnnyzhon commented Sep 24, 2021 •

edited

Loading