Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Parquet reader for null lists #1448

Merged
merged 4 commits into from
Mar 22, 2022
Merged

Fix Parquet reader for null lists #1448

merged 4 commits into from
Mar 22, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 15, 2022

Which issue does this PR close?

Closes #1399.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 15, 2022
#[test]
fn test_read_null_list() {
let testdata = arrow::util::test_util::parquet_test_data();
let path = format!("{}/null_list.parquet", testdata);
Copy link
Member Author

@viirya viirya Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I downloaded the test parquet file from @novemberkilo's parquet-testing fork to test it locally. Not sure how we can put the file there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could point to my parquet-testing fork like I have done here novemberkilo@e5952ae#diff-fe7afb5c9c916e521401d3fcfb4277d5071798c3baf83baf11d6071742823584

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think it works for CI purpose in the PR. Not sure if we can merge into master like that. Anyway, let me point to your fork first to test it. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@novemberkilo could you open a PR against parquet-testing and add your test file there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I think it is good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya fwiw I did something like:

  • delete the parquet-testing submodule
  • git submodule add -b topic/null-list-reader [email protected]:novemberkilo/parquet-testing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Thanks. I added the file manually to test it before.

And seems updating .gitmodules doesn't work for CI here. We may need to wait for the test file being merged in parquet-testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have a play with creating a test that generates a parquet file, so that we can get this PR in without waiting for parquet-testing. I will also file a ticket to start a discussion on a faster way to get parquet test files checked in, without relying on an upstream repo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tustvold. The parquet file was merged. It'd be great to get testing parquet test files checked in faster.

Copy link
Member Author

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand correctly Parquet's behavior for this case. It looks like next should return None. That's what I do in this fix.

cc @sunchao

@@ -924,7 +927,7 @@ impl<OffsetSize: OffsetSizeTrait> ArrayReader for ListArrayReader<OffsetSize> {
&& (rep_levels.len() == next_batch_array.len()))
{
return Err(ArrowError(
"Expected item_reader def_levels and rep_levels to be same length as batch".to_string(),
format!("Expected item_reader def_levels {} and rep_levels {} to be same length as batch {}", def_levels.len(), rep_levels.len(), next_batch_array.len()),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely for debugging purpose. I feel it is clear.

@novemberkilo
Copy link
Contributor

Thanks for taking this on @viirya -- I'm following along with interest :)

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks - I will see if I can't create some minimal test cases so that we can get this one in (and to check I fully understand the problems)

@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
field = self.arrow_schema.field_with_name(part).ok();
} else if let Some(f) = field {
if let ArrowType::Struct(fields) = f.data_type() {
field = fields.iter().find(|f| f.name() == part)
field = fields.iter().find(|f| f.name() == part);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a match block might be clearer?

@@ -885,6 +885,9 @@ fn remove_indices(
Ok(Arc::new(StructArray::from((new_columns, valid.finish()))))
}
}
ArrowType::Null => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably beyond the scope of this PR, but it occurs to me that this remove_indices function is basically an inverse of the arrow take kernel. I wonder if we could flip the construction of null_list_indices to be instead the non-null positions - and just use a regular take kernel?

I might have a play with this 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, seems so.

@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
field = self.arrow_schema.field_with_name(part).ok();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of this PR, but I wonder why this doesn't just initialize field to be self.arrow_schema.field_with_name(parts[1]).ok() and then do skip(2) given it has already done the length check above...

@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
field = self.arrow_schema.field_with_name(part).ok();
} else if let Some(f) = field {
if let ArrowType::Struct(fields) = f.data_type() {
field = fields.iter().find(|f| f.name() == part)
field = fields.iter().find(|f| f.name() == part);
} else if let ArrowType::List(list_field) = f.data_type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken this change is necessary to support a StructArray containing a ListArray in general, without being specific to nulls?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is not specific to nulls.

#[test]
fn test_read_null_list() {
let testdata = arrow::util::test_util::parquet_test_data();
let path = format!("{}/null_list.parquet", testdata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have a play with creating a test that generates a parquet file, so that we can get this PR in without waiting for parquet-testing. I will also file a ticket to start a discussion on a faster way to get parquet test files checked in, without relying on an upstream repo

.expect("Failed to read into array!");

let batch = record_batch_reader.next();
assert!(batch.is_none());
Copy link
Contributor

@tustvold tustvold Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct, locally I ran

> duckdb.query(f"select count(*) from 'null_list.parquet'").fetchall()
[(1,)]
> duckdb.query(f"select * from 'null_list.parquet'").fetchall()
[(None,)]

Which would imply that the data contains a single row, with a single null value

Perhaps it should be

let batch = record_batch_reader.next().unwrap().unwrap();
assert_eq!(batch.num_rows(), 1);
assert_eq!(batch.num_columns(), 1);
assert_eq!(batch.column(0).len(), 1);

let list = batch.column(0).as_any().downcast_ref::<ListArray>().unwrap();
assert_eq!(list.len(), 1);
assert_eq!(list.is_valid(0), true);

let val = list.value(0);
assert_eq!(val.len(), 0);

@@ -964,7 +967,10 @@ impl<OffsetSize: OffsetSizeTrait> ArrayReader for ListArrayReader<OffsetSize> {
cur_offset += OffsetSize::one();
}
});
offsets.push(cur_offset);

if !cur_offset.is_zero() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is correct, I think it will convert an array containing no non-null values, into an empty array?

@codecov-commenter
Copy link

Codecov Report

Merging #1448 (acd8d47) into master (717216f) will increase coverage by 0.04%.
The diff coverage is 82.75%.

@@            Coverage Diff             @@
##           master    #1448      +/-   ##
==========================================
+ Coverage   82.67%   82.72%   +0.04%     
==========================================
  Files         185      187       +2     
  Lines       53866    54186     +320     
==========================================
+ Hits        44535    44824     +289     
- Misses       9331     9362      +31     
Impacted Files Coverage Δ
parquet/src/arrow/array_reader.rs 78.72% <64.28%> (+0.45%) ⬆️
parquet/src/arrow/arrow_reader.rs 93.79% <100.00%> (+0.22%) ⬆️
arrow/src/array/data.rs 82.95% <0.00%> (-0.21%) ⬇️
arrow/src/array/array_list.rs 95.52% <0.00%> (ø)
integration-testing/src/lib.rs 0.00% <0.00%> (ø)
arrow/src/compute/kernels/window.rs 100.00% <0.00%> (ø)
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (ø)
arrow/src/array/transform/union.rs 75.00% <0.00%> (ø)
arrow/src/array/transform/fixed_size_list.rs 73.68% <0.00%> (ø)
arrow/src/array/transform/mod.rs 86.46% <0.00%> (+0.03%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 717216f...acd8d47. Read the comment docs.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this, unless anybody else chimes in with a review I'll get this merged this afternoon 👍

@alamb
Copy link
Contributor

alamb commented Mar 22, 2022

Merging while we get the permissions for @tustvold sorted out. Thanks all!

@alamb alamb merged commit e778c10 into apache:master Mar 22, 2022
@viirya
Copy link
Member Author

viirya commented Mar 22, 2022

Thank you @alamb @tustvold @novemberkilo @sunchao !

@alamb alamb changed the title Fix Parquet reader for null list Fix Parquet reader for null lists Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet reader fails to read null list.
6 participants