Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing parquet list of struct interpretation #13715

Merged

Conversation

hyperbolic2346
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 commented Jul 18, 2023

Description

This change alters how we interpret non-annotated data in a parquet file. Most modern parquet writers would produce something like:

message spark_schema {
  required int32 id;
  optional group phoneNumbers (LIST) {
    repeated group phone {
      required int64 number;
      optional binary kind (STRING);
    }
  }
}

But the list annotation isn't required. If it didn't exist, we would incorrectly interpret this schema as a struct of struct and not a list of struct. This change alters the code to look at the child and see if it is repeated. If it is, this indicates a list.

closes #13664

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@hyperbolic2346 hyperbolic2346 added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Jul 18, 2023
@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner July 18, 2023 01:51
@hyperbolic2346 hyperbolic2346 self-assigned this Jul 18, 2023
@GregoryKimball GregoryKimball requested a review from vuule July 18, 2023 16:14
@hyperbolic2346 hyperbolic2346 added 2 - In Progress Currently a work in progress and removed 3 - Ready for Review Ready for review by team labels Jul 18, 2023
@jlowe
Copy link
Member

jlowe commented Jul 19, 2023

I tried testing this PR with the Spark plugin, but Apache Spark thinks the schema of the repeated_no_annotation.parquet file is:

root                                                                            
 |-- id: integer (nullable = true)
 |-- phoneNumbers: struct (nullable = true)
 |    |-- phone: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- kind: string (nullable = true)

I think the issue is that there's no logical type annotation on phoneNumbers so we can't assume it's a list even though the type pattern underneath it looks list-like.

@hyperbolic2346 hyperbolic2346 changed the base branch from branch-23.08 to branch-23.10 July 24, 2023 23:05
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 21, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@hyperbolic2346
Copy link
Contributor Author

This is really close now, but not 100%.

expected:

cudf::struct_view<cudf::list_view<cudf::struct_view<int64_t,cudf::string_view,>>,>:
Length : 6:
Null count: 2
111100
   cudf::list_view<cudf::struct_view<int64_t,cudf::string_view,>>:
   Length : 6
   Offsets(0x7f253c000600) : 0, 0, 0, 0, 1, 2, 5
   Null count: 2
   111100
      cudf::struct_view<int64_t,cudf::string_view,>:
      Length : 5:
         5555555555, 1111111111, 1111111111, 2222222222, 3333333333
         10110
         NULL, home, home, NULL, mobile

produced:

cudf::struct_view<cudf::list_view<cudf::struct_view<int64_t,cudf::string_view,>>,>:
Length : 6:
Null count: 2
111100
   cudf::list_view<cudf::struct_view<int64_t,cudf::string_view,>>:
   Length : 6
   Offsets(0x7f253c007000) : 0, 0, 0, 0, 1, 2, 5
   Null count: 3
   111000
      cudf::struct_view<int64_t,cudf::string_view,>:
      Length : 5:
         5555555555, 1111111111, 1111111111, 2222222222, 3333333333
         10110
         NULL, home, home, NULL, mobile

Note the null difference on the inner list.

@hyperbolic2346
Copy link
Contributor Author

Spark integration tests show a few errors like java.lang.IllegalStateException: Logical error: no valid casts are found STRUCT to ByteType, Type conversion is not allowed from Table{columns=[ColumnVector{rows=1, type=LIST, nullCount=Optional.empty, offHeap=(ID: 435757 7f52640e7fa0)}], cudfTable=139991842717840, rows=1} to [MapType(StructType(StructField(first,StringType,true),StructField(middle,StringType,true),StructField(last,StringType,true)),StringType,true)] columns 0 to 1, and java.lang.IllegalStateException: Logical error: no valid casts are found STRUCT to BooleanType which make me think there may be some changes to how parquet normal files are being interpreted by cudf and not just the very old format.

More investigation is warranted here.

@hyperbolic2346 hyperbolic2346 removed request for a team, shwina and charlesbluca October 3, 2023 16:39
@hyperbolic2346 hyperbolic2346 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress Python Affects Python cuDF API. CMake CMake build issue conda Java Affects Java cuDF API. labels Oct 3, 2023
@hyperbolic2346
Copy link
Contributor Author

Integration tests for spark are passing now and this is ready for review.

@hyperbolic2346 hyperbolic2346 requested a review from ttnghia October 3, 2023 16:41
@harrism
Copy link
Member

harrism commented Oct 3, 2023

/ok to test

@github-actions github-actions bot removed the ci label Oct 3, 2023
@hyperbolic2346
Copy link
Contributor Author

/ok to test

@hyperbolic2346
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; few questions.

cpp/src/io/parquet/parquet.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Show resolved Hide resolved
@@ -175,6 +175,81 @@ type_id to_type_id(SchemaElement const& schema,
return type_id::EMPTY;
}

void metadata::sanitize_schema()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat approach, localizes the madness 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could ultimately fix things like single-level list as well. I like this approach too.

@hyperbolic2346
Copy link
Contributor Author

/ok to test

@hyperbolic2346
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit fc36947 into rapidsai:branch-23.12 Oct 6, 2023
57 checks passed
@hyperbolic2346 hyperbolic2346 deleted the mwilson/parquet_list_of_struct branch October 6, 2023 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Parquet list schema interpretation bug.
5 participants