Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parquet schema interpretation issue #13277

Merged

Conversation

hyperbolic2346
Copy link
Contributor

Description

There is a bug reading parquet files that have a specific encoding. The typical list is

<list-repetition> group <name> (LIST) {
  repeated group list {
    <element-repetition> <element-type> element;
  }
}

but it can also be

<list-repetition> group <name> (LIST) {
  repeated group list {
    repeated int32 name
  }
}

This second case was failing and this fixes that. The issue was the one_level_list was returning true for the name, which resulted in another nesting level. The fix is to only return true if the immediate parent is not a list.

closes #13237
closes #13239

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@hyperbolic2346 hyperbolic2346 added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change python labels May 2, 2023
@hyperbolic2346 hyperbolic2346 self-assigned this May 2, 2023
@hyperbolic2346 hyperbolic2346 requested review from a team as code owners May 2, 2023 19:29
@github-actions github-actions bot added the Python Affects Python cuDF API. label May 2, 2023
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for the fix

@mythrocks
Copy link
Contributor

mythrocks commented May 2, 2023

I've been trying to wrap my head around Parquet's handling for legacy LIST representations as well.

I think we are agreed on the first one:

<list-repetition> group <name> (LIST) {
  repeated group list {
    <element-repetition> <element-type> element;
  }
}

This conforms to the first section of the LogicalType spec:

It is required that the repeated group of elements is named list and that its element field is named element.

So this should be interpreted as LIST<element_type>.

Could I please have clarification on the second one?

<list-repetition> group <name> (LIST) {
  repeated group list {
    repeated int32 name
  }
}

The backward compatibility rules are:

1. If the repeated field is not a group, then its type is the element type and elements are required.
2. If the repeated field is a group with multiple fields, then its type is the element type and elements are required.
3. If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required.
4. Otherwise, the repeated field's type is the element type with the repeated field's repetition.

Wouldn't this case fall into the fourth category? In which case, should it not be interpreted as before the change? (i.e. LIST<LIST<INT32>>?)
Edit: I interpreted this incorrectly. This does indeed fall into the fourth category, which uses the element type, not the inner group's type. So this should really be LIST<INT32>.
I can verify this in Spark with the sample file used to test this change.

@hyperbolic2346 hyperbolic2346 changed the title Mwilson/parquet schema fix Fix parquet schema interpretation issue May 3, 2023
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Thanks for the fix, @hyperbolic2346. LGTM!

@mythrocks
Copy link
Contributor

I've re-kicked the failing test. Once it passes, if there aren't any objections, shall we merge this one?

@mythrocks
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 0a5065f into rapidsai:branch-23.06 May 8, 2023
@mythrocks
Copy link
Contributor

Thank you for this fix, @hyperbolic2346!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
5 participants