Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Nested list column types, phase 2 #4990

Merged
merged 22 commits into from
Jun 11, 2020

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Apr 22, 2020

  • new column view type : lists_column_view
  • new test wrapper type : lists_column_wrapper
  • list support for cudf::concatenate + tests
  • make_lists_column() factory
  • tests for lists_column_wrapper
  • added indentation support to test::print() code

Notable other things:

  • Both the underlying columns and the parent/child structure of lists columns are very slippery to get your head wrapped around. The test::print() functionality is the most useful way of understanding the structure. Some examples:

    test::list_column_wrapper list { {2, 3} };
    test::print(list);
    
    List<int32_t>:
    Length : 1
    Offsets : 0, 2
    Children : 
          2, 3
    
    test::list_column_wrapper list { {2, 3}, {4, 5}, {6, 7, 8} };
    test::print(list);
    
    List<int32_t>:
    Length : 3
    Offsets : 0, 2, 4, 7
    Children :
       2, 3, 4, 5, 6, 7, 8
    
     test::list_column_wrapper list { {{1, 2}, {3, 4}}, {{5, 6, 7}, {0}, {8}}, {{9, 10}} };    
     test::print(list);
    
     List<List<int32_t>>:
     Length : 3
     Offsets : 0, 2, 5, 6
     Children :
         List<int32_t>:
         Length : 6
         Offsets : 0, 2, 4, 7, 8, 9, 11
         Children :
             1, 2, 3, 4, 5, 6, 7, 0, 8, 9, 10
    

    I think it would probably also be useful to have it reproduce the original {} notation as well.
    See also : https://arrow.apache.org/docs/format/Columnar.html

@nvdbaranec nvdbaranec added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Apr 22, 2020
@nvdbaranec nvdbaranec requested review from a team as code owners April 22, 2020 18:28
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@codecov
Copy link

codecov bot commented Apr 24, 2020

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.15@eb896a0). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff               @@
##             branch-0.15    #4990   +/-   ##
==============================================
  Coverage               ?   88.70%           
==============================================
  Files                  ?       57           
  Lines                  ?    10773           
  Branches               ?        0           
==============================================
  Hits                   ?     9556           
  Misses                 ?     1217           
  Partials               ?        0           
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/backends.py 90.97% <0.00%> (ø)
python/cudf/cudf/utils/gpu_utils.py 54.54% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 84.41% <0.00%> (ø)
python/cudf/cudf/utils/docutils.py 100.00% <0.00%> (ø)
python/cudf/cudf/core/column_accessor.py 93.12% <0.00%> (ø)
python/cudf/cudf/core/window/rolling.py 85.26% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 95.00% <0.00%> (ø)
python/cudf/cudf/utils/applyutils.py 98.73% <0.00%> (ø)
python/cudf/cudf/_lib/nvtx/_lib/__init__.py 100.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 91.17% <0.00%> (ø)
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb896a0...779d63f. Read the comment docs.

@nvdbaranec
Copy link
Contributor Author

rerun tests

@nvdbaranec nvdbaranec changed the title [WIP] Nested types, phase 2 [REVIEW] Nested types, phase 2 May 20, 2020
@nvdbaranec nvdbaranec added the 3 - Ready for Review Ready for review by team label May 20, 2020
@nvdbaranec nvdbaranec requested a review from a team as a code owner May 20, 2020 18:55
@harrism
Copy link
Member

harrism commented May 21, 2020

  • new column view type : lists_column_view

  • new test wrapper type : lists_column_wrapper

  • list support for cudf::concatenate + tests

  • make_lists_column() factory

  • tests for lists_column_wrapper

  • added indentation support to test::print() code

Given that this is a 2K line PR (== 2-3 hours to review), and multiple of the things above are independent, any chance this can be broken out into separate PRs?

e.g:
PR 1:

  • new column view type : lists_column_view and any tests

PR 2:

  • new test wrapper type : lists_column_wrapper
  • tests for lists_column_wrapper
  • added indentation support to test::print() code

PR3:

  • list support for cudf::concatenate + tests

@nvdbaranec
Copy link
Contributor Author

nvdbaranec commented May 21, 2020

Unfortunately, there's not a lot of splitting that can be done. lists_column_wrapper requires concatenate (which then requires lists_column_view) and the two of them make up pretty much all of the functionality in the PR. All the tests require expect_columns_equal

I could split out lists_column_view but there's really nothing to that one (no associated tests).

For what it's worth, the overwhelming bulk in terms of lines of code are the tests for list_column_wrapper and concatenate. If you read through those lightly it shouldn't be too awful.

This should be the last big PR for all this stuff, I promise :)

@nvdbaranec nvdbaranec requested a review from davidwendt June 4, 2020 16:04
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment about "L" type.
I can be convinced or overruled that this is proper style.

using T = TypeParam;

// to disambiguiate between {} == 0 and {} == List{0}
using L = test::lists_column_wrapper<T>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be "LIST" or "List" perhaps? The "L" was throwing me.
Back in my unicode days you could make a unicode string using L"abc"
Also, 'L' can be used to identify long literals values like 2L so I thought this was make long integers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The thinking was "keep it short" to try and avoid cluttering the already eyeball-bending pile of brackets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dave suggested an opinion from @codereport and that maybe
LCW{}
is a good clarification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying it out, it looks good to me. Actually maybe even better than L, since the L can get lost:

{{{L{}}}}
vs
{{{LCW{}}}}

@jrhemstad jrhemstad merged commit ad9dea1 into rapidsai:branch-0.15 Jun 11, 2020
@rnyak
Copy link
Contributor

rnyak commented Jun 19, 2020

@jrhemstad is that functionality available now for rapids 0.15 nightly? Thanks.

@kkraus14
Copy link
Collaborator

@jrhemstad is that functionality available now for rapids 0.15 nightly? Thanks.

This is pure C++ code, is that where you're looking to use it? Nothing has been exposed to Python yet.

@rnyak
Copy link
Contributor

rnyak commented Jun 19, 2020

@kkraus14 we are looking for Python api that we can use, @benfred please correct me if I am wrong. The support of nested list column types in a cudf dataframe is critical for NVTabular use cases.

@EvenOldridge
Copy link

EvenOldridge commented Jun 19, 2020

We need the API but we can contribute it if that's helpful.

@jrhemstad
Copy link
Contributor

List types are still extremely experimental and supported by almost no features. We're continuing to add more support. What operations are highest priority for you when working with list types?

@EvenOldridge
Copy link

Thanks Jake. Our top priorities are:

  • Parquet read and write
  • count (elementwise across the set) -> we want to build a categorical encoder that works on the elements in the list

@rnyak @benfred @alecgunny Any others that are needed for multi-hot categorical support?

@rnyak
Copy link
Contributor

rnyak commented Jun 20, 2020

@jrhemstad @EvenOldridge

I'd say join operation of two gdfs with list type columns is important. Also, we should be able to do some filtering ops on these columns.

@jrhemstad
Copy link
Contributor

@jrhemstad @EvenOldridge

I'd say join operation of two gdfs with list type columns is important. Also, we should be able to do some filtering ops on these columns.

Do you mean join on the list columns? Or join on different columns, and list columns just come along for the ride?

@alecgunny
Copy link

For my part, a big one would be hash_values hashing the values in the lists, not the lists themselves. Though I understand other applications may prefer in the inverse.

@rnyak
Copy link
Contributor

rnyak commented Jun 21, 2020

@jrhemstad @EvenOldridge
I'd say join operation of two gdfs with list type columns is important. Also, we should be able to do some filtering ops on these columns.

Do you mean join on the list columns? Or join on different columns, and list columns just come along for the ride?

@jrhemstad Thanks for the quick response.
For our current use case, it is the latter. Join on different columns, and list columns just come along for the ride. What Alec stated, hash_values for the values in the lists, is also important.

@harrism
Copy link
Member

harrism commented Jun 21, 2020

When you say "has the values in the lists", do you mean you want a column of lists of hash values as output? Or do you mean you want to hash a list of values so the output is a column of a single hash value per list?

@alecgunny
Copy link

@harrism the former

@rnyak
Copy link
Contributor

rnyak commented Jun 25, 2020

@harrism @kkraus14 @jrhemstad I'd like to introduce Kyle Kranen (@kkranen), a DL SW Eng from DL-algo, to you. Kyle is working with us for implementation of JoC W&D model with NVTabular. He is willing to help with the development of Python API for nested list column types.

@kkranen
Copy link

kkranen commented Jun 25, 2020

@harrism @kkraus14 @jrhemstad As @rnyak mentioned, I'd love to help accelerate the inclusion of nested list support into the next release of CUDF. I'd love to sync with you to discuss next steps and how I can help.

@rnyak
Copy link
Contributor

rnyak commented Jun 29, 2020

@harrism @kkraus14 @jrhemstad we prepared a requirements doc for nested list columns. wanted to share with you.

@harrism
Copy link
Member

harrism commented Jun 30, 2020

@nvdbaranec See above. Should really be a github issue, not a google doc.

@rnyak
Copy link
Contributor

rnyak commented Jun 30, 2020

@harrism we can create the GH FEAs accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.