Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

parquet -> arrow utf8-8 column conversion yields too many values #790

Closed
danburkert opened this issue Jan 26, 2022 · 3 comments
Closed

parquet -> arrow utf8-8 column conversion yields too many values #790

danburkert opened this issue Jan 26, 2022 · 3 comments
Labels
no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@danburkert
Copy link
Contributor

We're seeing an issue decoding parquet row groups containing UTF-8 string columns with many null values. Occassionally these conversions yield more values than specified in the parquet file metadata. I traced this as far as arrow2::io::parquet::read::binary::iter_to_array with the problematic data set. After modifying that function to add some sanity check asserts, multiple test cases fail in the arrow2 test suite.

diff --git a/src/io/parquet/read/binary/mod.rs b/src/io/parquet/read/binary/mod.rs
index a0422f1a9..67812e2b1 100644
--- a/src/io/parquet/read/binary/mod.rs
+++ b/src/io/parquet/read/binary/mod.rs
@@ -53,6 +53,8 @@ where
             )?
         }
     }
+    debug_assert_eq!(validity.len(), capacity);
+    debug_assert_eq!(values.len(), capacity);
     Ok(utils::finish_array(data_type, values, validity))
 }

yields

---- io::parquet::write::list_large_binary_optional_v1 stdout ----
thread 'io::parquet::write::list_large_binary_optional_v1' panicked at 'assertion failed: `(left == right)`
  left: `12`,
 right: `15`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

---- io::parquet::write::list_large_binary_optional_v2 stdout ----
thread 'io::parquet::write::list_large_binary_optional_v2' panicked at 'assertion failed: `(left == right)`
  left: `12`,
 right: `15`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

---- io::parquet::write::list_utf8_optional_v1 stdout ----
thread 'io::parquet::write::list_utf8_optional_v1' panicked at 'assertion failed: `(left == right)`
  left: `12`,
 right: `15`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

---- io::parquet::write::list_utf8_optional_v2 stdout ----
thread 'io::parquet::write::list_utf8_optional_v2' panicked at 'assertion failed: `(left == right)`
  left: `12`,
 right: `15`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

---- io::parquet::write::utf8_required_v1 stdout ----
thread 'io::parquet::write::utf8_required_v1' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `10`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

---- io::parquet::write::utf8_required_v2_compressed stdout ----
thread 'io::parquet::write::utf8_required_v2_compressed' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `10`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

---- io::parquet::write::utf8_required_v2 stdout ----
thread 'io::parquet::write::utf8_required_v2' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `10`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

I'm not entirely sure these debug asserts are correct, but I suspect they are. I'm going to keep digging and see if I can isolate the issue further, but figured I'd throw up a bug report sooner than later. Thanks!

@danburkert
Copy link
Contributor Author

Although perhaps this isn't the same root cause as we're seeing with the problematic data set; we see too many values, whereas those tests are failing with too few. Here's the output from our data:

thread ... panicked at 'assertion failed: `(left == right)`
  left: `41795`,
 right: `41784`', /home/ec2-user/src/rust/arrow2/src/io/parquet/read/binary/mod.rs:56:5

@danburkert
Copy link
Contributor Author

I'm narrowing in on a fix, I've identified some bugs in the nullable dictionary read code. Iterating on a fix here: danburkert@8d7cb88

@jorgecarleitao
Copy link
Owner

Closed by #791. Thanks a lot for the report and the fix, @danburkert !

@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Jan 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants