Fixed reading parquet binary dict page #791

danburkert · 2022-01-26T22:52:07Z

As described in #790, there is a bug in parquet binary dict page decoding whereby too many values are yielded due to a mixup in the length arguments passed to various decoding functions, ultimately leading to extra bitpacked values being included in the output array. This PR fixes the bug, and adds some debug_assert!s in key places to ensure that tests would catch the issue in the future. I went ahead and added similar debug asserts to the other page type read functions, which exposed a few places where validity buffers are being allocated for non-nullable columns.

Unfortunately these new debug asserts are firing for nested page type tests, but I don't currently understand how the encoding works for these page types (it looks significantly more complicated than RLE & bitpack). I suspect that these test failures are a true positive, but I can't be sure without understanding how nested encoding works.

src/io/parquet/read/binary/basic.rs

danburkert · 2022-01-26T22:53:45Z

src/io/parquet/read/utils.rs

-                let length = std::cmp::min(pack_size, pack_remaining);
-
-                let additional = remaining.min(length);
+                let additional = pack_size.min(remaining);


Using the full page length instead of the remaining length is one of the bugs.

danburkert · 2022-01-26T22:54:52Z

src/io/parquet/read/binary/basic.rs

    let values_iterator = values_iter(indices_buffer, dict, additional);

    let mut validity_iterator = hybrid_rle::Decoder::new(validity_buffer, 1);

    extend_from_decoder(
        validity,
        &mut validity_iterator,
-        length,
+        additional,


parameter mixup; extend_from_decoder expects the additional length, but this was passing total length.

tustvold · 2022-01-26T23:33:06Z

FWIW on the surface this sounds similar to apache/arrow-rs#1111

jorgecarleitao · 2022-01-27T06:07:06Z

Great fix! Thanks!

My understanding is that the column.num_values() can't be reliably be used to estimate the capacity for nested types. imo It is a consequence of an ambiguous spec.

This has shown up some times:

pyarrow has a bug related to this: https://issues.apache.org/jira/browse/PARQUET-2066
this description is too ambiguous: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L733
mailing list discussing a related idea: https://www.mail-archive.com/[email protected]/msg16032.html (for pages, but the idea is the same)

My understanding is that for nested types num_values is 15 for the following list array:

[[0, 1], None, [2, None, 3], [4, 5, 6], [], [7, 8, 9], None, [10]]

which overshoots the number of values on the primitive array ([0, 1, 2, None, 3, 4, 5,6,7,8,9,10].len() = 12). The equality in the debug_assert is only valid for non-nested types.

I suggest we move the debug_assert to inside the non-nested branch and document why this is true in that case (imo it should raise an error, since data loss should be an error, but we can leave it for another PR - there is currently a mix of error and panic in this code base).

codecov · 2022-01-27T06:41:55Z

Codecov Report

Merging #791 (bf675a2) into main (c80b39d) will increase coverage by 0.02%.
The diff coverage is 87.09%.

@@            Coverage Diff             @@
##             main     #791      +/-   ##
==========================================
+ Coverage   71.05%   71.07%   +0.02%     
==========================================
  Files         319      319              
  Lines       16667    16672       +5     
==========================================
+ Hits        11843    11850       +7     
+ Misses       4824     4822       -2

Impacted Files	Coverage Δ
src/io/parquet/read/binary/basic.rs	`62.00% <25.00%> (-1.11%)`	⬇️
src/io/parquet/read/primitive/basic.rs	`56.06% <50.00%> (-0.66%)`	⬇️
src/io/parquet/read/binary/mod.rs	`48.14% <100.00%> (+4.14%)`	⬆️
src/io/parquet/read/boolean/mod.rs	`76.19% <100.00%> (+2.50%)`	⬆️
src/io/parquet/read/fixed_size_binary/mod.rs	`75.75% <100.00%> (+0.75%)`	⬆️
src/io/parquet/read/fixed_size_binary/utils.rs	`55.55% <100.00%> (+5.55%)`	⬆️
src/io/parquet/read/primitive/mod.rs	`43.90% <100.00%> (+2.87%)`	⬆️
src/io/parquet/read/utils.rs	`73.13% <100.00%> (-0.40%)`	⬇️
src/bitmap/mutable.rs	`90.56% <0.00%> (+0.75%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c80b39d...bf675a2. Read the comment docs.

danburkert · 2022-01-27T06:45:22Z

Got it, thanks for the explanation. I've gone ahead and moved the checks to within the non-nested branch.

danburkert · 2022-01-27T17:14:09Z

@jorgecarleitao one thing that occurred to me after having some time to digest your explanation - if the length/capacity as used in those methods isn't accurate for nested columns, then the capacity reservations for the data and validity buffers is going to be too large. Obviously won't cause bugs or errors, but perhaps something that could be optimized.

danburkert · 2022-01-27T17:14:28Z

And thanks for the quick help on this! 🙏

jorgecarleitao · 2022-01-27T17:21:02Z

I agree. The reading of parquet is about to take an overall (#789), so I think we should revisit this after #789 lands.

danburkert added 2 commits January 26, 2022 21:55

nullable-dict-decode

8d7cb88

add equivalent debug asserts to more page types

b6f7411

danburkert commented Jan 26, 2022

View reviewed changes

Update src/io/parquet/read/binary/basic.rs

573b1c4

more bugs

542b4e9

skip len checks on nested pages

bf675a2

jorgecarleitao merged commit 601fa08 into jorgecarleitao:main Jan 27, 2022

jorgecarleitao added the bug Something isn't working label Jan 27, 2022

jorgecarleitao changed the title ~~Fix parquet binary dict page reading~~ Fixed reading parquet binary dict page Jan 27, 2022

jorgecarleitao mentioned this pull request Jan 27, 2022

parquet -> arrow utf8-8 column conversion yields too many values #790

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed reading parquet binary dict page #791

Fixed reading parquet binary dict page #791

danburkert commented Jan 26, 2022

danburkert Jan 26, 2022

danburkert Jan 26, 2022

tustvold commented Jan 26, 2022 •

edited

Loading

jorgecarleitao commented Jan 27, 2022 •

edited

Loading

codecov bot commented Jan 27, 2022

danburkert commented Jan 27, 2022

danburkert commented Jan 27, 2022

danburkert commented Jan 27, 2022

jorgecarleitao commented Jan 27, 2022

Fixed reading parquet binary dict page #791

Fixed reading parquet binary dict page #791

Conversation

danburkert commented Jan 26, 2022

danburkert Jan 26, 2022

Choose a reason for hiding this comment

danburkert Jan 26, 2022

Choose a reason for hiding this comment

tustvold commented Jan 26, 2022 • edited Loading

jorgecarleitao commented Jan 27, 2022 • edited Loading

codecov bot commented Jan 27, 2022

Codecov Report

danburkert commented Jan 27, 2022

danburkert commented Jan 27, 2022

danburkert commented Jan 27, 2022

jorgecarleitao commented Jan 27, 2022

tustvold commented Jan 26, 2022 •

edited

Loading

jorgecarleitao commented Jan 27, 2022 •

edited

Loading