-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Struct and map support for Parquet reader #6318
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Filtering a table that contains struct-columns fails because struct columns cannot yet be deep-copied from a column-view. This commit fixes the problem.
…dd-struct-columns-python
Added Struct<List> test. Removed errant prints, extra whitespace.
Added tests for cloning Struct<List> and List<Struct<List>> columns. Code formatting has been fixed, also.
…dd-struct-columns-python
…dd-struct-columns-python
…truct-columns-python
nvdbaranec
added
3 - Ready for Review
Ready for review by team
4 - Needs cuIO Reviewer
Spark
Functionality that helps Spark RAPIDS
labels
Sep 24, 2020
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
kkraus14
reviewed
Oct 6, 2020
kkraus14
reviewed
Oct 6, 2020
kkraus14
reviewed
Oct 6, 2020
kkraus14
reviewed
Oct 6, 2020
isVoid
reviewed
Oct 6, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor stuff.
vuule
approved these changes
Oct 6, 2020
kkraus14
approved these changes
Oct 6, 2020
kkraus14
added
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
labels
Oct 6, 2020
OlivierNV
approved these changes
Oct 6, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
kkraus14
added
5 - Ready to Merge
Testing and reviews complete, ready to merge
and removed
3 - Ready for Review
Ready for review by team
4 - Needs cuIO Reviewer
labels
Oct 6, 2020
isVoid
approved these changes
Oct 6, 2020
kkraus14
reviewed
Oct 6, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
5 - Ready to Merge
Testing and reviews complete, ready to merge
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
Spark
Functionality that helps Spark RAPIDS
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Added tests. PR now includes a new glue layer for massaging parquet output -> python structs, graciously provided by @shwina . Needs python reviewer.
Adds support for reading structs to the parquet reader. This include various combinations of nested list/struct, struct/list, etc. Should work arbitrarily. Also handles "maps" which are simply an annotation in the format for a specific List<Struct<>> type.
Highlights include:
Seperating "selected columns" into "input columns" and "output columns". Previously they were the same thing, however with structs multiple input columns can contribute to a single output column (each field at each nesting level corresponds to a unique input column in the file).
The differentiation between "schema" and "leaf schema" (also introduced during list work) has been removed. That is more comprehensively handled by the input/output column split.
Changed how we do nesting level bounds calculating (mapping R/D levels to nesting depths) to take structs into account.
There were a few places in the code where I was using "max_depth" as an inclusive value as opposed to the more idiomatic "size", so I've adjusted a number of places to make that more consistent with what people expect.
The code was making some weak assumptions that "nesting" == "lists" which didn't hold up with structs. Various bits have been refactored to accomodate this.
Added a "user_data" uint32 field to the
column_buffer
struct to allow users to stash custom info. With the refactor into input and output columns, it turned out that thecolumn_buffer
struct was perfect for the output columns. I just needed a little extra place to stash some info.