Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read nested Parquet 2-level lists correctly #6757

Merged
merged 5 commits into from
Nov 27, 2024
Merged

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Nov 19, 2024

Which issue does this PR close?

Closes #6756.

Rationale for this change

See issue.

What changes are included in this PR?

Modifies both the arrow and record readers to check for nested lists before triggering the rule that repeated groups named "array" are treated as list<OneTuple>.

Are there any user-facing changes?

Changes the interpretation of some legacy schemas.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 19, 2024

/// Returns `true` if this type is annotated as a list.
pub fn is_list(&self) -> bool {
match self.is_group() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since .is_group() returns a boolean wouldn't it be better semantics to put it inside a branch conditional just using if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Rust has warped me...I feel like I'm doing it wrong if I use if 😅. But you're right that it would be more succinct.

use arrow::datatypes::ToByteSlice;

let testdata = arrow::util::test_util::parquet_test_data();
// message my_record {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you for putting a comment with the test data. I like this :)

Copy link

@devanbenz devanbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix CI 👍

@etseidl
Copy link
Contributor Author

etseidl commented Nov 21, 2024

Please fix CI 👍

I wish I could 😅. It appears that test is borked for everyone due to some upstream shenanigans.

And now fixed by #6745

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, mostly just some minor nits

parquet/src/schema/types.rs Outdated Show resolved Hide resolved
parquet/src/schema/types.rs Outdated Show resolved Hide resolved
|| repeated_field.name() == "array"
|| repeated_field.name() == format!("{}_tuple", list_type.name())
|| (!repeated_field.is_list()
&& !repeated_field.has_single_repeated_child()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
&& !repeated_field.has_single_repeated_child()
&& items[0].get_basic_info().repetition() != Repetition::REPEATED

Copy link
Contributor Author

@etseidl etseidl Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do this because repetition() can theoretically panic, even though by this point in the schema every node should have a repetition. Perhaps panicking here is warranted if the schema is invalid. I'll change if you'd prefer the panic (or return an error à la #6738).

let items = repeated_field.get_fields();
if items.len() != 1
|| repeated_field.name() == "array"
|| repeated_field.name() == format!("{}_tuple", list_type.name())
|| (!repeated_field.is_list()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check necessary, given we're in visit_list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're testing if the child of the list we're processing is also LIST annotated.

optional group my_list (LIST) {  <---- this is `list_type`
    repeated group array (LIST) { <---- this is `repeated_field`
        repeated int32 array;
    };
}

@alamb
Copy link
Contributor

alamb commented Nov 27, 2024

Looks like an improvement to me -- thank you @etseidl , @tustvold and @devanbenz

@alamb alamb merged commit c8fda9a into apache:main Nov 27, 2024
16 checks passed
@etseidl etseidl deleted the read_old_lists branch November 27, 2024 19:45
richox pushed a commit to blaze-init/arrow-rs that referenced this pull request Dec 20, 2024
* read nested 2-level lists correctly

* address review comments

* Apply suggestions from code review

Co-authored-by: Raphael Taylor-Davies <[email protected]>

---------

Co-authored-by: Raphael Taylor-Davies <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet readers incorrectly interpret legacy nested lists
4 participants