parquet -> arrow utf8-8 column conversion yields too many values #790

danburkert · 2022-01-26T20:58:14Z

We're seeing an issue decoding parquet row groups containing UTF-8 string columns with many null values. Occassionally these conversions yield more values than specified in the parquet file metadata. I traced this as far as arrow2::io::parquet::read::binary::iter_to_array with the problematic data set. After modifying that function to add some sanity check asserts, multiple test cases fail in the arrow2 test suite.

diff --git a/src/io/parquet/read/binary/mod.rs b/src/io/parquet/read/binary/mod.rs
index a0422f1a9..67812e2b1 100644
--- a/src/io/parquet/read/binary/mod.rs
+++ b/src/io/parquet/read/binary/mod.rs
@@ -53,6 +53,8 @@ where
             )?
         }
     }
+    debug_assert_eq!(validity.len(), capacity);
+    debug_assert_eq!(values.len(), capacity);
     Ok(utils::finish_array(data_type, values, validity))
 }

yields

---- io::parquet::write::list_large_binary_optional_v1 stdout ----
thread 'io::parquet::write::list_large_binary_optional_v1' panicked at 'assertion failed: `(left == right)`
  left: `12`,
 right: `15`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

---- io::parquet::write::list_large_binary_optional_v2 stdout ----
thread 'io::parquet::write::list_large_binary_optional_v2' panicked at 'assertion failed: `(left == right)`
  left: `12`,
 right: `15`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

---- io::parquet::write::list_utf8_optional_v1 stdout ----
thread 'io::parquet::write::list_utf8_optional_v1' panicked at 'assertion failed: `(left == right)`
  left: `12`,
 right: `15`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

---- io::parquet::write::list_utf8_optional_v2 stdout ----
thread 'io::parquet::write::list_utf8_optional_v2' panicked at 'assertion failed: `(left == right)`
  left: `12`,
 right: `15`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

---- io::parquet::write::utf8_required_v1 stdout ----
thread 'io::parquet::write::utf8_required_v1' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `10`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

---- io::parquet::write::utf8_required_v2_compressed stdout ----
thread 'io::parquet::write::utf8_required_v2_compressed' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `10`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

---- io::parquet::write::utf8_required_v2 stdout ----
thread 'io::parquet::write::utf8_required_v2' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `10`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

I'm not entirely sure these debug asserts are correct, but I suspect they are. I'm going to keep digging and see if I can isolate the issue further, but figured I'd throw up a bug report sooner than later. Thanks!

The text was updated successfully, but these errors were encountered:

danburkert · 2022-01-26T21:02:22Z

Although perhaps this isn't the same root cause as we're seeing with the problematic data set; we see too many values, whereas those tests are failing with too few. Here's the output from our data:

thread ... panicked at 'assertion failed: `(left == right)`
  left: `41795`,
 right: `41784`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

danburkert · 2022-01-26T21:57:09Z

I'm narrowing in on a fix, I've identified some bugs in the nullable dictionary read code. Iterating on a fix here: danburkert@8d7cb88

jorgecarleitao · 2022-01-27T17:12:38Z

Closed by #791. Thanks a lot for the report and the fix, @danburkert !

danburkert mentioned this issue Jan 26, 2022

Fixed reading parquet binary dict page #791

Merged

jorgecarleitao closed this as completed Jan 27, 2022

jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet -> arrow utf8-8 column conversion yields too many values #790

parquet -> arrow utf8-8 column conversion yields too many values #790

danburkert commented Jan 26, 2022

danburkert commented Jan 26, 2022

danburkert commented Jan 26, 2022

jorgecarleitao commented Jan 27, 2022

parquet -> arrow utf8-8 column conversion yields too many values #790

parquet -> arrow utf8-8 column conversion yields too many values #790

Comments

danburkert commented Jan 26, 2022

danburkert commented Jan 26, 2022

danburkert commented Jan 26, 2022

jorgecarleitao commented Jan 27, 2022