Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Struct and map support for Parquet reader #6318

Merged
merged 54 commits into from
Oct 7, 2020

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Sep 24, 2020

Added tests. PR now includes a new glue layer for massaging parquet output -> python structs, graciously provided by @shwina . Needs python reviewer.

Adds support for reading structs to the parquet reader. This include various combinations of nested list/struct, struct/list, etc. Should work arbitrarily. Also handles "maps" which are simply an annotation in the format for a specific List<Struct<>> type.

Highlights include:

  • Seperating "selected columns" into "input columns" and "output columns". Previously they were the same thing, however with structs multiple input columns can contribute to a single output column (each field at each nesting level corresponds to a unique input column in the file).

  • The differentiation between "schema" and "leaf schema" (also introduced during list work) has been removed. That is more comprehensively handled by the input/output column split.

  • Changed how we do nesting level bounds calculating (mapping R/D levels to nesting depths) to take structs into account.

  • There were a few places in the code where I was using "max_depth" as an inclusive value as opposed to the more idiomatic "size", so I've adjusted a number of places to make that more consistent with what people expect.

  • The code was making some weak assumptions that "nesting" == "lists" which didn't hold up with structs. Various bits have been refactored to accomodate this.

  • Added a "user_data" uint32 field to the column_buffer struct to allow users to stash custom info. With the refactor into input and output columns, it turned out that the column_buffer struct was perfect for the output columns. I just needed a little extra place to stash some info.

mythrocks and others added 22 commits September 18, 2020 17:09
Filtering a table that contains struct-columns fails
because struct columns cannot yet be deep-copied from
a column-view.
This commit fixes the problem.
Added Struct<List> test.
Removed errant prints, extra whitespace.
Added tests for cloning Struct<List> and List<Struct<List>> columns.
Code formatting has been fixed, also.
@nvdbaranec nvdbaranec added 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer Spark Functionality that helps Spark RAPIDS labels Sep 24, 2020
@nvdbaranec nvdbaranec requested a review from a team as a code owner September 24, 2020 21:09
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@rgsl888prabhu rgsl888prabhu self-requested a review September 24, 2020 21:31
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor stuff.

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved
python/cudf/cudf/core/column/lists.py Outdated Show resolved Hide resolved
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Oct 6, 2020
Copy link
Contributor

@OlivierNV OlivierNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer labels Oct 6, 2020
@nvdbaranec nvdbaranec merged commit 84557ea into rapidsai:branch-0.16 Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.