[REVIEW] Struct and map support for Parquet reader #6318

nvdbaranec · 2020-09-24T21:09:50Z

Added tests. PR now includes a new glue layer for massaging parquet output -> python structs, graciously provided by @shwina . Needs python reviewer.

Adds support for reading structs to the parquet reader. This include various combinations of nested list/struct, struct/list, etc. Should work arbitrarily. Also handles "maps" which are simply an annotation in the format for a specific List<Struct<>> type.

Highlights include:

Seperating "selected columns" into "input columns" and "output columns". Previously they were the same thing, however with structs multiple input columns can contribute to a single output column (each field at each nesting level corresponds to a unique input column in the file).
The differentiation between "schema" and "leaf schema" (also introduced during list work) has been removed. That is more comprehensively handled by the input/output column split.
Changed how we do nesting level bounds calculating (mapping R/D levels to nesting depths) to take structs into account.
There were a few places in the code where I was using "max_depth" as an inclusive value as opposed to the more idiomatic "size", so I've adjusted a number of places to make that more consistent with what people expect.
The code was making some weak assumptions that "nesting" == "lists" which didn't hold up with structs. Various bits have been refactored to accomodate this.
Added a "user_data" uint32 field to the column_buffer struct to allow users to stash custom info. With the refactor into input and output columns, it turned out that the column_buffer struct was perfect for the output columns. I just needed a little extra place to stash some info.

Filtering a table that contains struct-columns fails because struct columns cannot yet be deep-copied from a column-view. This commit fixes the problem.

…dd-struct-columns-python

Added Struct<List> test. Removed errant prints, extra whitespace.

Added tests for cloning Struct<List> and List<Struct<List>> columns. Code formatting has been fixed, also.

…dd-struct-columns-python

…truct-columns-python

GPUtester · 2020-09-24T21:11:52Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

…q_structs

…structs

CHANGELOG.md

python/cudf/cudf/tests/test_parquet.py

python/cudf/cudf/core/column/lists.py

isVoid

A few minor stuff.

python/cudf/cudf/tests/test_parquet.py

python/cudf/cudf/core/column/lists.py

…quet_structs

OlivierNV

Looks good to me

python/cudf/cudf/tests/test_parquet.py

mythrocks and others added 22 commits September 18, 2020 17:09

[struct] Fix struct filtering.

0281d87

Filtering a table that contains struct-columns fails because struct columns cannot yet be deep-copied from a column-view. This commit fixes the problem.

Merge remote-tracking branch 'origin/branch-0.16' into fix-struct-filter

970e856

Initial struct dtype

6e90840

Merge branch 'branch-0.16' of https://github.com/rapidsai/cudf into a…

cd6b491

…dd-struct-columns-python

Merge remote-tracking branch 'origin/branch-0.16' into fix-struct-filter

336eda5

[struct] Fix struct filtering.

a1833f2

Added Struct<List> test. Removed errant prints, extra whitespace.

[struct] Fix struct filtering.

84bf750

Added tests for cloning Struct<List> and List<Struct<List>> columns. Code formatting has been fixed, also.

Merge branch 'branch-0.16' of https://github.com/rapidsai/cudf into a…

299df24

…dd-struct-columns-python

Add a __repr__ for struct dtype

d5f8f51

Merge branch 'branch-0.16' of https://github.com/rapidsai/cudf into a…

2ddd892

…dd-struct-columns-python

Merge remote-tracking branch 'origin/branch-0.16' into fix-struct-filter

6cce5e7

Merge remote-tracking branch 'mythrocks/fix-struct-filter' into add-s…

7bc0436

…truct-columns-python

Initial struct column support

a4430f7

Post-process to fix struct names

9238fc0

Copy struct field names over to libcudf result

e63dcf2

Fix typo

015654d

Handle all null child in struct

a3874ed

Mask handling in StructColumn.from_arrow. Add tests

e6beb03

Struct dtype equality tests

d4920f8

Fields ordering test

4ae7fa5

Struct and map support for Parquet reader.

d25fbd0

Changelog

6fda4cc

nvdbaranec added 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer Spark Functionality that helps Spark RAPIDS labels Sep 24, 2020

nvdbaranec requested a review from a team as a code owner September 24, 2020 21:09

nvdbaranec requested review from jrhemstad and codereport September 24, 2020 21:09

rgsl888prabhu self-requested a review September 24, 2020 21:31

nvdbaranec and others added 3 commits October 6, 2020 12:30

Another round of cpp PR changes. Fixed merge conflicts in python.

972300f

Merge branch 'branch-0.16' of https://github.com/rapidsai/cudf into p…

93f1445

…q_structs

Merge branch 'parquet_structs' of github.com:nvdbaranec/cudf into pq_…

49f3450

…structs

nvdbaranec requested a review from vuule October 6, 2020 17:33

nvdbaranec removed the 5 - Merge After Dependencies label Oct 6, 2020

Fix base size of StructColumn

0c9811c

kkraus14 reviewed Oct 6, 2020

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

kkraus14 reviewed Oct 6, 2020

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

kkraus14 reviewed Oct 6, 2020

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

kkraus14 reviewed Oct 6, 2020

View reviewed changes

python/cudf/cudf/core/column/lists.py Outdated Show resolved Hide resolved

isVoid reviewed Oct 6, 2020

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved

python/cudf/cudf/core/column/lists.py Outdated Show resolved Hide resolved

shwina and others added 4 commits October 6, 2020 15:56

Fixing up logic for generating elements in ListColumn.to_arrow

2db1445

Undo string compare change in is_list_dtype

98c4013

Remove duplicates in CHANGELOG. Parameterized struct tests.

1e5cc7c

Merge branch 'parquet_structs' of github.com:nvdbaranec/cudf into par…

0b76032

…quet_structs

vuule approved these changes Oct 6, 2020

View reviewed changes

kkraus14 approved these changes Oct 6, 2020

View reviewed changes

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Oct 6, 2020

OlivierNV approved these changes Oct 6, 2020

View reviewed changes

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer labels Oct 6, 2020

isVoid approved these changes Oct 6, 2020

View reviewed changes

kkraus14 reviewed Oct 6, 2020

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

Make sure to assert on expect_eq

48de1f9

nvdbaranec merged commit 84557ea into rapidsai:branch-0.16 Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Struct and map support for Parquet reader #6318

[REVIEW] Struct and map support for Parquet reader #6318

nvdbaranec commented Sep 24, 2020 •

edited

Loading

GPUtester commented Sep 24, 2020

isVoid left a comment

OlivierNV left a comment

[REVIEW] Struct and map support for Parquet reader #6318

[REVIEW] Struct and map support for Parquet reader #6318

Conversation

nvdbaranec commented Sep 24, 2020 • edited Loading

GPUtester commented Sep 24, 2020

isVoid left a comment

Choose a reason for hiding this comment

OlivierNV left a comment

Choose a reason for hiding this comment

nvdbaranec commented Sep 24, 2020 •

edited

Loading