-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Nested list column types, phase 2 #4990
[REVIEW] Nested list column types, phase 2 #4990
Conversation
…ate, test::print changes.
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Codecov Report
@@ Coverage Diff @@
## branch-0.15 #4990 +/- ##
==============================================
Coverage ? 88.70%
==============================================
Files ? 57
Lines ? 10773
Branches ? 0
==============================================
Hits ? 9556
Misses ? 1217
Partials ? 0
Continue to review full report at Codecov.
|
rerun tests |
…rint() for lists to display the bitmask. Added more tests.
Given that this is a 2K line PR (== 2-3 hours to review), and multiple of the things above are independent, any chance this can be broken out into separate PRs? e.g:
PR 2:
PR3:
|
Unfortunately, there's not a lot of splitting that can be done. I could split out For what it's worth, the overwhelming bulk in terms of lines of code are the tests for This should be the last big PR for all this stuff, I promise :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment about "L" type.
I can be convinced or overruled that this is proper style.
using T = TypeParam; | ||
|
||
// to disambiguiate between {} == 0 and {} == List{0} | ||
using L = test::lists_column_wrapper<T>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be "LIST" or "List" perhaps? The "L" was throwing me.
Back in my unicode days you could make a unicode string using L"abc"
Also, 'L' can be used to identify long literals values like 2L
so I thought this was make long integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The thinking was "keep it short" to try and avoid cluttering the already eyeball-bending pile of brackets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dave suggested an opinion from @codereport and that maybe
LCW{}
is a good clarification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying it out, it looks good to me. Actually maybe even better than L, since the L can get lost:
{{{L{}}}}
vs
{{{LCW{}}}}
@jrhemstad is that functionality available now for rapids 0.15 nightly? Thanks. |
This is pure C++ code, is that where you're looking to use it? Nothing has been exposed to Python yet. |
We need the API but we can contribute it if that's helpful. |
List types are still extremely experimental and supported by almost no features. We're continuing to add more support. What operations are highest priority for you when working with list types? |
Thanks Jake. Our top priorities are:
@rnyak @benfred @alecgunny Any others that are needed for multi-hot categorical support? |
I'd say join operation of two gdfs with list type columns is important. Also, we should be able to do some filtering ops on these columns. |
Do you mean join on the list columns? Or join on different columns, and list columns just come along for the ride? |
For my part, a big one would be |
@jrhemstad Thanks for the quick response. |
When you say "has the values in the lists", do you mean you want a column of lists of hash values as output? Or do you mean you want to hash a list of values so the output is a column of a single hash value per list? |
@harrism the former |
@harrism @kkraus14 @jrhemstad I'd like to introduce Kyle Kranen (@kkranen), a DL SW Eng from DL-algo, to you. Kyle is working with us for implementation of JoC W&D model with NVTabular. He is willing to help with the development of Python API for nested list column types. |
@harrism @kkraus14 @jrhemstad As @rnyak mentioned, I'd love to help accelerate the inclusion of nested list support into the next release of CUDF. I'd love to sync with you to discuss next steps and how I can help. |
@harrism @kkraus14 @jrhemstad we prepared a requirements doc for nested list columns. wanted to share with you. |
@nvdbaranec See above. Should really be a github issue, not a google doc. |
@harrism we can create the GH FEAs accordingly. |
lists_column_view
lists_column_wrapper
cudf::concatenate
+ testsmake_lists_column()
factorylists_column_wrapper
test::print()
codeNotable other things:
Both the underlying columns and the parent/child structure of lists columns are very slippery to get your head wrapped around. The
test::print()
functionality is the most useful way of understanding the structure. Some examples:I think it would probably also be useful to have it reproduce the original {} notation as well.
See also : https://arrow.apache.org/docs/format/Columnar.html