Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support get_element from LIST column #8071

Merged
merged 24 commits into from
May 7, 2021

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Apr 26, 2021

Part1 of #8032

This PR adds retrieval of row data from a LIST type column, through adding support to list_view specialization of get_element. The row data is stored in a scalar object.

Use example:

// non-nested LIST column
col = [{1, 2, 3}, {4}]
s = get_element(col, 1); // s is a type erased list_scalar, s._data == int_column{4}

// nested LIST column
col = [[{1, 2}, {3}], [{4}, {}]]
s = get_element(col, 1); // s is a type erased list_scalar, s._data == list_column{{4}, {}}

Implementation note:
Depends on lists::detail::copy_slice under the hood. Also adds a new list_scalar constructor that supports moving external row data to construct a new scalar.

Other included in this PR:

  • is_element_valid_sync(column, i), helper function that returns true if ith row of column is valid.
  • list_scalar factory functions
  • Developer guide for list_scalar

@isVoid isVoid requested a review from a team as a code owner April 26, 2021 23:43
@isVoid isVoid requested review from robertmaynard and harrism April 26, 2021 23:43
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 26, 2021
@isVoid isVoid added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Apr 26, 2021
@codecov
Copy link

codecov bot commented Apr 27, 2021

Codecov Report

Merging #8071 (aa38f7f) into branch-0.20 (51336df) will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.20    #8071      +/-   ##
===============================================
- Coverage        82.88%   82.88%   -0.01%     
===============================================
  Files              103      104       +1     
  Lines            17668    17899     +231     
===============================================
+ Hits             14645    14836     +191     
- Misses            3023     3063      +40     
Impacted Files Coverage Δ
python/cudf/cudf/core/tools/datetimes.py 80.42% <0.00%> (-4.11%) ⬇️
python/cudf/cudf/core/column/decimal.py 90.64% <0.00%> (-2.29%) ⬇️
python/cudf/cudf/core/column/datetime.py 88.03% <0.00%> (-1.88%) ⬇️
python/cudf/cudf/core/column/struct.py 94.73% <0.00%> (-1.56%) ⬇️
python/cudf/cudf/utils/dtypes.py 82.20% <0.00%> (-1.24%) ⬇️
python/dask_cudf/dask_cudf/groupby.py 91.28% <0.00%> (-0.88%) ⬇️
python/cudf/cudf/core/series.py 91.17% <0.00%> (-0.56%) ⬇️
python/cudf/cudf/core/index.py 92.52% <0.00%> (-0.55%) ⬇️
python/cudf/cudf/core/column/column.py 88.20% <0.00%> (-0.44%) ⬇️
python/cudf/cudf/core/column/lists.py 86.98% <0.00%> (-0.43%) ⬇️
... and 28 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4853dbc...aa38f7f. Read the comment docs.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, just suggest refactoring the single-element is_valid to a utility.

cpp/src/copying/get_element.cu Outdated Show resolved Hide resolved
@ttnghia
Copy link
Contributor

ttnghia commented Apr 27, 2021

The behavior seems not right to me. You have a lists column. You access a row in that list column. That row is always a list. Thus, the return list scalar should always have the _data member is a list column of one row.

@isVoid
Copy link
Contributor Author

isVoid commented Apr 27, 2021

@ttnghia currently list_scalar is defined with the top-level list wrapper removed.

For example, in a non-nested list column:

// Lists<int>
row0: [1, 2, 3]
row1: [4]
row2: NULL

get_element(0) should return a `list_scalar` with underlying data of INT column {1, 2, 3}

For a nested list column:

// List<List<int>>
row0:
    row[0][0]: [1, 2, 3]
    row[0][1]: [4, 5]
row1:
    row[1][0]: []
    row[1][1]: NULL

get_element(0) should return `list_scalar` with underlying data of LIST column, which has offset {0, 3, 5} and underlying data of INT column {1, 2, 3, 4, 5}

See discussion here: #5887 (comment)
I guess this saves the overhead of storing an offset vector for the 1-row LIST column, which is always [0, size_of_child] and removes the need of recursion.

@isVoid
Copy link
Contributor Author

isVoid commented Apr 27, 2021

I should probably add the definition of list_scalar to the developer guide:
https://github.com/rapidsai/cudf/blob/branch-0.20/cpp/docs/DEVELOPER_GUIDE.md#cudfscalar

@isVoid isVoid added 2 - In Progress Currently a work in progress and removed 3 - Ready for Review Ready for review by team labels Apr 28, 2021
@isVoid isVoid requested a review from a team as a code owner April 28, 2021 23:37
@github-actions github-actions bot added the CMake CMake build issue label Apr 28, 2021
cpp/src/bitmask/get_valid.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/get_valid.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/get_valid.hpp Outdated Show resolved Hide resolved
Co-authored-by: Mark Harris <[email protected]>
@isVoid isVoid requested a review from nvdbaranec May 4, 2021 01:21
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment here still stands. #8071 (comment) We should be using gather.

@isVoid
Copy link
Contributor Author

isVoid commented May 4, 2021

Added scalar factory methods, and updated developer guide accordingly.

cc @nvdbaranec @harrism

@isVoid
Copy link
Contributor Author

isVoid commented May 4, 2021

My comment here still stands. #8071 (comment) We should be using gather.

Discussed offline and confirmed current implementation supports structs.

@nvdbaranec nvdbaranec self-requested a review May 5, 2021 14:38
@harrism harrism added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 6, 2021
@ttnghia
Copy link
Contributor

ttnghia commented May 6, 2021

It seems that there are some failed tests.

@kkraus14
Copy link
Collaborator

kkraus14 commented May 7, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit bb62cf1 into rapidsai:branch-0.20 May 7, 2021
rapids-bot bot pushed a commit that referenced this pull request May 24, 2021
…8206)

This PR fixes a bug introduced in #8071, when `get_element` retrieves a NULL row in a nested column, the scalar returned not only should be `is_valid() == false`, but also should preserve the column hierarchy of the row-data, even they are invalid. Because depending libraries may use the column hierarchy to deduce the nested type of the column.

This PR also reverts `make_default_constructed_scalar` API for `LIST` type. A `LIST` type scalar should have complete column hierarchy as part of its type information. There isn't enough information provided to the API to construct that.

Another tiny addition: instead of hard coding the position of child column, use `list_column_view::child_column_index` intead.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Conor Hoekstra (https://github.com/codereport)
  - Paul Taylor (https://github.com/trxcllnt)

URL: #8206
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants