[REVIEW] Nested list column types, phase 2 #4990

nvdbaranec · 2020-04-22T18:28:26Z

new column view type : lists_column_view
new test wrapper type : lists_column_wrapper
list support for cudf::concatenate + tests
make_lists_column() factory
tests for lists_column_wrapper
added indentation support to test::print() code

Notable other things:

Both the underlying columns and the parent/child structure of lists columns are very slippery to get your head wrapped around. The test::print() functionality is the most useful way of understanding the structure. Some examples:

test::list_column_wrapper list { {2, 3} };
test::print(list);

List<int32_t>:
Length : 1
Offsets : 0, 2
Children : 
      2, 3

test::list_column_wrapper list { {2, 3}, {4, 5}, {6, 7, 8} };
test::print(list);

List<int32_t>:
Length : 3
Offsets : 0, 2, 4, 7
Children :
   2, 3, 4, 5, 6, 7, 8

 test::list_column_wrapper list { {{1, 2}, {3, 4}}, {{5, 6, 7}, {0}, {8}}, {{9, 10}} };    
 test::print(list);

 List<List<int32_t>>:
 Length : 3
 Offsets : 0, 2, 5, 6
 Children :
     List<int32_t>:
     Length : 6
     Offsets : 0, 2, 4, 7, 8, 9, 11
     Children :
         1, 2, 3, 4, 5, 6, 7, 0, 8, 9, 10

I think it would probably also be useful to have it reproduce the original {} notation as well.
See also : https://arrow.apache.org/docs/format/Columnar.html

…ate, test::print changes.

GPUtester · 2020-04-22T18:29:01Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

codecov · 2020-04-24T17:44:47Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.15@eb896a0). Click here to learn what that means.
The diff coverage is n/a.

@@              Coverage Diff               @@
##             branch-0.15    #4990   +/-   ##
==============================================
  Coverage               ?   88.70%           
==============================================
  Files                  ?       57           
  Lines                  ?    10773           
  Branches               ?        0           
==============================================
  Hits                   ?     9556           
  Misses                 ?     1217           
  Partials               ?        0

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/backends.py	`90.97% <0.00%> (ø)`
python/cudf/cudf/utils/gpu_utils.py	`54.54% <0.00%> (ø)`
python/cudf/cudf/core/column/datetime.py	`84.41% <0.00%> (ø)`
python/cudf/cudf/utils/docutils.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/core/column_accessor.py	`93.12% <0.00%> (ø)`
python/cudf/cudf/core/window/rolling.py	`85.26% <0.00%> (ø)`
python/cudf/cudf/io/dlpack.py	`95.00% <0.00%> (ø)`
python/cudf/cudf/utils/applyutils.py	`98.73% <0.00%> (ø)`
python/cudf/cudf/_lib/nvtx/_lib/__init__.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`91.17% <0.00%> (ø)`
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb896a0...779d63f. Read the comment docs.

nvdbaranec · 2020-05-19T14:12:47Z

rerun tests

…rint() for lists to display the bitmask. Added more tests.

harrism · 2020-05-21T04:04:50Z

new column view type : lists_column_view

new test wrapper type : lists_column_wrapper

list support for cudf::concatenate + tests

make_lists_column() factory

tests for lists_column_wrapper

added indentation support to test::print() code

Given that this is a 2K line PR (== 2-3 hours to review), and multiple of the things above are independent, any chance this can be broken out into separate PRs?

e.g:
PR 1:

new column view type : lists_column_view and any tests

PR 2:

new test wrapper type : lists_column_wrapper
tests for lists_column_wrapper
added indentation support to test::print() code

PR3:

list support for cudf::concatenate + tests

nvdbaranec · 2020-05-21T15:02:33Z

Unfortunately, there's not a lot of splitting that can be done. lists_column_wrapper requires concatenate (which then requires lists_column_view) and the two of them make up pretty much all of the functionality in the PR. All the tests require expect_columns_equal

I could split out lists_column_view but there's really nothing to that one (no associated tests).

For what it's worth, the overwhelming bulk in terms of lines of code are the tests for list_column_wrapper and concatenate. If you read through those lightly it shouldn't be too awful.

This should be the last big PR for all this stuff, I promise :)

davidwendt

Just a comment about "L" type.
I can be convinced or overruled that this is proper style.

davidwendt · 2020-06-04T18:56:36Z

cpp/tests/utilities_tests/lists_column_wrapper_tests.cu

+  using T = TypeParam;
+
+  // to disambiguiate between {} == 0 and {} == List{0}
+  using L = test::lists_column_wrapper<T>;


Could this be "LIST" or "List" perhaps? The "L" was throwing me.
Back in my unicode days you could make a unicode string using L"abc"
Also, 'L' can be used to identify long literals values like 2L so I thought this was make long integers.

Right. The thinking was "keep it short" to try and avoid cluttering the already eyeball-bending pile of brackets.

Dave suggested an opinion from @codereport and that maybe
LCW{}
is a good clarification

Trying it out, it looks good to me. Actually maybe even better than L, since the L can get lost:

{{{L{}}}}
vs
{{{LCW{}}}}

rnyak · 2020-06-19T19:57:07Z

@jrhemstad is that functionality available now for rapids 0.15 nightly? Thanks.

kkraus14 · 2020-06-19T20:32:50Z

@jrhemstad is that functionality available now for rapids 0.15 nightly? Thanks.

This is pure C++ code, is that where you're looking to use it? Nothing has been exposed to Python yet.

rnyak · 2020-06-19T21:38:50Z

@kkraus14 we are looking for Python api that we can use, @benfred please correct me if I am wrong. The support of nested list column types in a cudf dataframe is critical for NVTabular use cases.

EvenOldridge · 2020-06-19T23:27:37Z

We need the API but we can contribute it if that's helpful.

jrhemstad · 2020-06-19T23:38:12Z

List types are still extremely experimental and supported by almost no features. We're continuing to add more support. What operations are highest priority for you when working with list types?

EvenOldridge · 2020-06-19T23:51:23Z

Thanks Jake. Our top priorities are:

Parquet read and write
count (elementwise across the set) -> we want to build a categorical encoder that works on the elements in the list

@rnyak @benfred @alecgunny Any others that are needed for multi-hot categorical support?

rnyak · 2020-06-20T00:20:20Z

@jrhemstad @EvenOldridge

I'd say join operation of two gdfs with list type columns is important. Also, we should be able to do some filtering ops on these columns.

jrhemstad · 2020-06-20T00:24:20Z

@jrhemstad @EvenOldridge

I'd say join operation of two gdfs with list type columns is important. Also, we should be able to do some filtering ops on these columns.

Do you mean join on the list columns? Or join on different columns, and list columns just come along for the ride?

alecgunny · 2020-06-21T00:55:10Z

For my part, a big one would be hash_values hashing the values in the lists, not the lists themselves. Though I understand other applications may prefer in the inverse.

rnyak · 2020-06-21T20:53:04Z

@jrhemstad @EvenOldridge
I'd say join operation of two gdfs with list type columns is important. Also, we should be able to do some filtering ops on these columns.

Do you mean join on the list columns? Or join on different columns, and list columns just come along for the ride?

@jrhemstad Thanks for the quick response.
For our current use case, it is the latter. Join on different columns, and list columns just come along for the ride. What Alec stated, hash_values for the values in the lists, is also important.

harrism · 2020-06-21T23:45:15Z

When you say "has the values in the lists", do you mean you want a column of lists of hash values as output? Or do you mean you want to hash a list of values so the output is a column of a single hash value per list?

alecgunny · 2020-06-22T15:03:49Z

@harrism the former

rnyak · 2020-06-25T20:48:41Z

@harrism @kkraus14 @jrhemstad I'd like to introduce Kyle Kranen (@kkranen), a DL SW Eng from DL-algo, to you. Kyle is working with us for implementation of JoC W&D model with NVTabular. He is willing to help with the development of Python API for nested list column types.

kkranen · 2020-06-25T20:53:45Z

@harrism @kkraus14 @jrhemstad As @rnyak mentioned, I'd love to help accelerate the inclusion of nested list support into the next release of CUDF. I'd love to sync with you to discuss next steps and how I can help.

rnyak · 2020-06-29T23:29:31Z

@harrism @kkraus14 @jrhemstad we prepared a requirements doc for nested list columns. wanted to share with you.

harrism · 2020-06-30T04:50:34Z

@nvdbaranec See above. Should really be a github issue, not a google doc.

rnyak · 2020-06-30T18:13:33Z

@harrism we can create the GH FEAs accordingly.

Add lists_column_view, list_column_wrapper, list support for concaten…

15048c8

…ate, test::print changes.

nvdbaranec added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Apr 22, 2020

nvdbaranec requested review from jrhemstad and davidwendt April 22, 2020 18:28

nvdbaranec requested review from a team as code owners April 22, 2020 18:28

Changelog for 4990

efc4309

nvdbaranec mentioned this pull request Apr 22, 2020

[REVIEW] Nested types, phase 1 #4972

Merged

nvdbaranec added 3 commits April 22, 2020 15:09

Merge branch 'nested_types_phase1' into nested_types_phase2

c76d51b

Merge branch 'nested_types_phase1' into nested_types_phase2

09e2a80

Re-applying clang-format.

c8e8f0c

nvdbaranec mentioned this pull request May 1, 2020

[REVIEW] gather() support for LIST column type. #5073

Merged

nvdbaranec added 3 commits May 6, 2020 13:40

Merge branch 'nested_types_phase1' into nested_types_phase2

d5c5b3f

Merge branch 'nested_types_phase1' into nested_types_phase2

fb936fa

Merge branch 'branch-0.14' into nested_types_phase2

2cd4183

Cleanup for flipping over to REVIEW status

dcce63e

nvdbaranec changed the title ~~[WIP] Nested types, phase 2~~ [REVIEW] Nested types, phase 2 May 20, 2020

nvdbaranec added the 3 - Ready for Review Ready for review by team label May 20, 2020

nvdbaranec added 2 commits May 20, 2020 13:32

Merge branch 'branch-0.14' into nested_types_phase2

ed4674a

Add headers to yaml file. Run clang format.

a29e654

nvdbaranec requested a review from a team as a code owner May 20, 2020 18:55

nvdbaranec added 2 commits May 20, 2020 14:53

Cleaned up some incomplete/incorrect docs. Couple of code clarifications

03f2aeb

Properly concatenate null masks in list concatenate. Improved test::p…

ef0e638

…rint() for lists to display the bitmask. Added more tests.

raydouglass approved these changes May 21, 2020

View reviewed changes

nvdbaranec requested a review from davidwendt June 4, 2020 16:04

davidwendt requested changes Jun 4, 2020

View reviewed changes

Change L{} to LCW{} for representing empty lists in lists tests.

779d63f

davidwendt approved these changes Jun 5, 2020

View reviewed changes

EvenOldridge mentioned this pull request Jun 8, 2020

[FEA] Multi-hot categorical support NVIDIA-Merlin/NVTabular#43

Closed

9 tasks

harrism assigned nvdbaranec Jun 10, 2020

jrhemstad approved these changes Jun 11, 2020

View reviewed changes

jrhemstad merged commit ad9dea1 into rapidsai:branch-0.15 Jun 11, 2020

nvdbaranec mentioned this pull request Jun 11, 2020

[REVIEW] Build compile error fix in column_utilities.cu #5446

Merged

nvdbaranec mentioned this pull request Aug 6, 2020

[FEA] Nested types : Test framework functionality #4832

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Nested list column types, phase 2 #4990

[REVIEW] Nested list column types, phase 2 #4990

nvdbaranec commented Apr 22, 2020 •

edited

Loading

GPUtester commented Apr 22, 2020

codecov bot commented Apr 24, 2020 •

edited

Loading

nvdbaranec commented May 19, 2020

harrism commented May 21, 2020 •

edited

Loading

nvdbaranec commented May 21, 2020 •

edited

Loading

davidwendt left a comment

davidwendt Jun 4, 2020

nvdbaranec Jun 4, 2020

nvdbaranec Jun 5, 2020

nvdbaranec Jun 5, 2020

rnyak commented Jun 19, 2020

kkraus14 commented Jun 19, 2020

rnyak commented Jun 19, 2020 •

edited

Loading

EvenOldridge commented Jun 19, 2020 •

edited

Loading

jrhemstad commented Jun 19, 2020

EvenOldridge commented Jun 19, 2020

rnyak commented Jun 20, 2020 •

edited

Loading

jrhemstad commented Jun 20, 2020

alecgunny commented Jun 21, 2020

rnyak commented Jun 21, 2020

harrism commented Jun 21, 2020

alecgunny commented Jun 22, 2020

rnyak commented Jun 25, 2020

kkranen commented Jun 25, 2020

rnyak commented Jun 29, 2020

harrism commented Jun 30, 2020 •

edited

Loading

rnyak commented Jun 30, 2020

[REVIEW] Nested list column types, phase 2 #4990

[REVIEW] Nested list column types, phase 2 #4990

Conversation

nvdbaranec commented Apr 22, 2020 • edited Loading

GPUtester commented Apr 22, 2020

codecov bot commented Apr 24, 2020 • edited Loading

Codecov Report

nvdbaranec commented May 19, 2020

harrism commented May 21, 2020 • edited Loading

nvdbaranec commented May 21, 2020 • edited Loading

davidwendt left a comment

Choose a reason for hiding this comment

davidwendt Jun 4, 2020

Choose a reason for hiding this comment

nvdbaranec Jun 4, 2020

Choose a reason for hiding this comment

nvdbaranec Jun 5, 2020

Choose a reason for hiding this comment

nvdbaranec Jun 5, 2020

Choose a reason for hiding this comment

rnyak commented Jun 19, 2020

kkraus14 commented Jun 19, 2020

rnyak commented Jun 19, 2020 • edited Loading

EvenOldridge commented Jun 19, 2020 • edited Loading

jrhemstad commented Jun 19, 2020

EvenOldridge commented Jun 19, 2020

rnyak commented Jun 20, 2020 • edited Loading

jrhemstad commented Jun 20, 2020

alecgunny commented Jun 21, 2020

rnyak commented Jun 21, 2020

harrism commented Jun 21, 2020

alecgunny commented Jun 22, 2020

rnyak commented Jun 25, 2020

kkranen commented Jun 25, 2020

rnyak commented Jun 29, 2020

harrism commented Jun 30, 2020 • edited Loading

rnyak commented Jun 30, 2020

nvdbaranec commented Apr 22, 2020 •

edited

Loading

codecov bot commented Apr 24, 2020 •

edited

Loading

harrism commented May 21, 2020 •

edited

Loading

nvdbaranec commented May 21, 2020 •

edited

Loading

rnyak commented Jun 19, 2020 •

edited

Loading

EvenOldridge commented Jun 19, 2020 •

edited

Loading

rnyak commented Jun 20, 2020 •

edited

Loading

harrism commented Jun 30, 2020 •

edited

Loading