Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrowArrayReader Reads Too Many Values From Bit-Packed Runs #1111

Closed
tustvold opened this issue Dec 30, 2021 · 8 comments · Fixed by #1130
Closed

ArrowArrayReader Reads Too Many Values From Bit-Packed Runs #1111

tustvold opened this issue Dec 30, 2021 · 8 comments · Fixed by #1130
Labels

Comments

@tustvold
Copy link
Contributor

tustvold commented Dec 30, 2021

Describe the bug

Originally reported in apache/datafusion#1441 and encountered again in #1110, ParquetFileArrowReader appears to read incorrect data for string columns that contain nulls.

In particular the conditions required are for the column to be nullable, contain nulls, and multiple row groups.

To Reproduce

Read simple_strings.parquet.zip with the following code

#[test]
    fn test_read_strings() {
        let testdata = arrow::util::test_util::parquet_test_data();
        let path = format!("{}/simple_strings.parquet", testdata);
        let parquet_file_reader =
            SerializedFileReader::try_from(File::open(&path).unwrap()).unwrap();
        let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(parquet_file_reader));
        let record_batch_reader = arrow_reader
            .get_record_reader(60)
            .expect("Failed to read into array!");

        let batches = record_batch_reader
            .collect::<arrow::error::Result<Vec<_>>>()
            .unwrap();

        assert_eq!(batches.len(), 1);
        let batch = batches.into_iter().next().unwrap();
        assert_eq!(batch.num_rows(), 6);

        let strings = batch
            .column(0)
            .as_any()
            .downcast_ref::<StringArray>()
            .unwrap();

        let strings: Vec<_> = strings.iter().collect();

        assert_eq!(
            &strings,
            &[
                None,
                Some("-1685637712"),
                Some("512814980"),
                Some("868743207"),
                None,
                Some("-1001940778")
            ]
        )
    }

Fails with

thread 'arrow::arrow_reader::tests::test_read_strings' panicked at 'assertion failed: `(left == right)`
  left: `[None, Some("-1685637712"), Some("512814980"), Some("-1685637712"), None, Some("868743207")]`,
 right: `[None, Some("-1685637712"), Some("512814980"), Some("868743207"), None, Some("-1001940778")]`', parquet/src/arrow/arrow_reader.rs:715:9

For comparison

$ python
> import duckdb
> duckdb.query("select * from 'simple_strings.parquet'").fetchall()
[(None,), ('-1685637712',), ('512814980',), ('868743207',), (None,), ('-1001940778',)]

The file consists of two row groups, each with 3 rows and was generated using #1110

Expected behavior

The test should pass

@tustvold tustvold added the bug label Dec 30, 2021
@tustvold
Copy link
Contributor Author

tustvold commented Dec 30, 2021

So adding a print statement to VariableLenDictionaryDecoder::new it is being created twice with num_values of 3, i.e. the number of rows in the row group.

This is the num_values field from Page, which confusingly is the number of values including nulls. This value is then used to determine how many values to read from RleDecoder for this page.

Now a somewhat strange quirk of the hybrid encoding is packed "runs" are always multiples of 8 in length. This means if the final run of a page is packed encoded, as opposed to RLE, it will zero-padded to length. Unfortunately the parquet designers opted to not store the actual length for a packed run, but the length / 8. This means the length of the final packed run of a page is not actually knowable...

This is where the issue arises. VariableLenDictionaryDecoder thinks it has more actual values than it does, as it is being fed the value_count for the page which counts nulls which aren't encoded. This means it asks RleDecoder for more keys than should actually be present. As RleDecoder contains a zero-padded final run, it returns too many values, which has the effect of "shifting" the string values in the final result.

The fix should be a case of making whatever calls ValueDecoder::read_value_bytes only request a number of values that the page should be expected to yield. This is what ColumnReaderImpl handles for the non-ArrowArrayReader ArrayReader implementations. I need to do some digging to see how feasible this is with the design of ArrowArrayReader.

@tustvold
Copy link
Contributor Author

tustvold commented Dec 30, 2021

So I'm not sure there is an easy way to fix this... ArrowArrayReader flattens all the pages from all the column chunks into iterators and then feeds these to CompositeValueDecoder which decode the levels and values independently. This makes it a non-trivial change to decode the levels and corresponding values from a given page in lock-step, which I believe is necessary in order to decode the correct number.

Rather than spending time re-working ArrowArrayReader in order to fix this bug, I'm personally going to focus on getting #1082 and the PRs it builds on polished up. This provides an alternative implementation for reading byte arrays, that builds on the existing ColumnReaderImpl and RecordReader logic and so, much like PrimitiveArrayReader, does not run into this bug. My hope is that by being both faster, and duplicating less code, it will make sense to swap out ArrowArrayReader and therefore fix this bug for anything not using ArrowArrayReader explicitly.

If someone else wishes to work on fixing ArrowArrayReader that would be brilliant, but I'm going to focus my efforts elsewhere.

FYI @yordan-pavlov @alamb

Edit: In the short-term switching back to ComplexObjectArrayReader does fix the bug, but represents a non-trivial performance regression (up to 6x) and so I'm somewhat loathe to suggest it

@tustvold tustvold changed the title ArrowArrayReader Incorrect Data ArrowArrayReader Reads Too Many Values From Bit-Packed Runs Dec 30, 2021
@yordan-pavlov
Copy link
Contributor

yordan-pavlov commented Dec 30, 2021

@tustvold 's work on making the "old" arrow reader faster looks promising, plus I've hit a wall in making the ArrowArrayReader faster in some cases, so I agree that it doesn't make sense to try to fix the ArrowArrayReader and instead it makes more sense to focus on finishing the work in #1082 .

ArrowArrayReader does use def levels to calculate how many values to read, see here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L576 , but reading def levels is done independently of reading of values (one of the main characteristics of ArrowArrayReader) and so if the value of num_values field from Page is incorrect and does indeed include NULL values, then the entire idea of ArrowArrayReader falls apart and it won't work correctly. It makes me wonder how the unit tests pass though, I will have a look at this in the next few days.

One reason why I implemented ArrowArrayReader in a new struct so that it can easily be swapped out if was necessary, so I guess that time has come. Yes - it will result in a significant reduction in performance of reading string arrays, but correctness is more important than speed and also it would be only until #1082 is finished.

@yordan-pavlov
Copy link
Contributor

here is what I've found so far:

it's getting pretty late now, but tomorrow I will try to write the missing test (that doesn't rely on an external parquet file) to reproduce the issue with VariableLenDictionaryDecoder / RleDecoder and also think on a short-term fix

@alamb
Copy link
Contributor

alamb commented Dec 31, 2021

it's getting pretty late now, but tomorrow I will try to write the missing test (that doesn't rely on an external parquet file) to reproduce the issue with VariableLenDictionaryDecoder / RleDecoder and also think on a short-term fix

A short term fix would be nice so that we can get correct answers from the existing code, while @tustvold works on the longer term / better fix

@yordan-pavlov
Copy link
Contributor

I have been able to reproduce the issue where RleDecoder returns more keys than values (as explained by @tustvold above) by adding a test very similar to the existing test_arrow_array_reader_string but using dictionary encoding instead of plain. Next, I will looking for a short-term fix.

Here is some sample output from the test:

running 1 test
page num_values: 100, values.len(): 33
page num_values: 100, values.len(): 38
VariableLenPlainDecoder::new, num_values: 9

---------- reading a batch of 50 values ----------
VariableLenDictionaryDecoder::new, num_values: 100
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 100, num_values: 14
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, self.num_values: 86
// ok so far, 33 actual values - 14 values read = 19 values still left in first page

---------- reading a batch of 100 values ----------
VariableLenPlainDecoder::new, num_values: 10
VariableLenDictionaryDecoder::new, num_values: 100
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 86, num_values: 37
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 26, self.num_values: 0
// this is a problem - only 19 values were left in the first page, but 26 values have been read

VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 0, num_values: 11
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 0, self.num_values: 0
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 100, num_values: 11
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 11, self.num_values: 89
thread 'arrow::arrow_array_reader::tests::test_arrow_array_reader_dict_string' panicked at 'assertion failed: (left == right)
left: "H",
right: "He"', parquet\src\arrow\arrow_array_reader.rs:1745:17

@yordan-pavlov
Copy link
Contributor

UPDATE: for the short-term fix, the only option I can think of is (when def levels are present) to count the number of actual values here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L393 before creating the value reader and using this instead of num_values.

This then makes the new test (using dictionary encoded pages) pass - notice how in the test output below the value of num_values in the VariableLenDictionaryDecoder is the actual number of values instead of including null-values:

running 1 test
page num_values: 100, values.len(): 25
page num_values: 100, values.len(): 31
VariableLenPlainDecoder::new, num_values: 10
---------- reading a batch of 50 values ----------
VariableLenDictionaryDecoder::new, num_values: 25
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 25, num_values: 11
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 11, self.num_values: 14
---------- reading a batch of 100 values ----------
VariableLenPlainDecoder::new, num_values: 10
VariableLenDictionaryDecoder::new, num_values: 31
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 14, num_values: 31
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, self.num_values: 0
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 0, num_values: 17
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 0, self.num_values: 0
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 31, num_values: 17
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 17, self.num_values: 14
---------- reading a batch of 100 values ----------
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 14, num_values: 14
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, self.num_values: 0
test arrow::arrow_array_reader::tests::test_arrow_array_reader_dict_string ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 471 filtered out; finished in 0.01s

Tomorrow I will be checking the impact on performance and possibly create a pull request for the new test plus short-term fix.

@yordan-pavlov
Copy link
Contributor

UPDATE: performance degradation after the fix is actually not so bad - only between 3% and 8%

arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - old
                        time:   [1.4393 ms 1.4485 ms 1.4585 ms]
                        change: [-9.2499% -3.6008% +1.8053%] (p = 0.23 > 0.05)
arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - new
                        time:   [594.37 us 604.91 us 618.75 us]
                        change: [+5.1998% +8.7402% +12.498%] (p = 0.00 < 0.05)

arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - old
                        time:   [1.1766 ms 1.1857 ms 1.1950 ms]
                        change: [-4.5337% -2.8033% -1.0591%] (p = 0.00 < 0.05)
arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - new
                        time:   [467.90 us 471.83 us 476.11 us]
                        change: [-0.8465% +3.1251% +7.0623%] (p = 0.12 > 0.05)

I will try to submit a PR for this later today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants