-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python bindings for lists::concatenate_list_elements
and expose them as .list.concat()
#8006
Conversation
@@ -451,3 +457,56 @@ def sort_values( | |||
sort_lists(self._column, ascending, na_position), | |||
retain_index=not ignore_index, | |||
) | |||
|
|||
def ravel(self) -> ParentType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use some help here with naming/docstring here.
Unlike np.ravel
, this function only removes one level of nesting from each row. What's a better name for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's a better name for that?
unpack
, unbox
? (along with a parameter n
for how many levels of nesting to be removed? - but maybe this is something we can do in a future PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like unbox
, and I also like the suggestion for an n
parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about flatten
? That's what Spark uses: https://spark.apache.org/docs/latest/api/sql/index.html#flatten
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still want the n
parameter with flatten
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Spark flatten
only flattens one level, whereas numpy flatten
does similar to ravel
where it flattens down to 1d: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flatten.html#numpy.ndarray.flatten
Any thoughts on what makes the most sense for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with lists.concat()
as that's really what we're doing (concatenating the inner lists). If we only have a single inner list, nesting is removed:
In [15]: s
Out[15]:
0 [[1, 2], [3, 4]]
dtype: list
In [16]: s.list.concat()
Out[16]:
0 [1, 2, 3, 4]
dtype: list
In [18]: s
Out[18]:
0 [[1, 2]]
dtype: list
In [19]: s.list.concat()
Out[19]:
0 [1, 2]
dtype: list
1 [6.0, nan, 7.0, 8.0, 9.0] | ||
dtype: list | ||
|
||
Null values at the top-level in each row are dropped: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this desirable? If not, what should our behaviour be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok to me... This may introduce ambiguity of
flatten([[1, 2, 3], None, [4, 5]]
flatten([[1, 2, 3], [], [4, 5]]
Even though, when unwrapping nested lists it sounds reasonable to assume both empty list and null item do not contribute to concrete elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I exposed a dropna
param which will drop null values by default. If set to False
, the corresponding row in the result is null.
(these are the options available at the libcudf level)
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #8006 +/- ##
===============================================
Coverage ? 10.62%
===============================================
Files ? 109
Lines ? 18635
Branches ? 0
===============================================
Hits ? 1980
Misses ? 16655
Partials ? 0 Continue to review full report at Codecov.
|
@@ -11,6 +11,8 @@ | |||
if TYPE_CHECKING: | |||
from cudf.core.column import ColumnBase | |||
|
|||
ParentType = Union["cudf.Series", "cudf.Index"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this become SingleColumnFrame
from #8115 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm changing this in #8306.
python/cudf/cudf/tests/test_list.py
Outdated
[[1, 2], [3, 4, 5]], | ||
[[1, 2, None], [3, 4, 5]], | ||
[[[1, 2], [3, 4]], [[5, 6, 7], [8, 9]]], | ||
[[["a", "c", "de", None], None, ["fg"]], [["abc", "de"], None]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try a few empty list items in here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
1 [6.0, nan, 7.0, 8.0, 9.0] | ||
dtype: list | ||
|
||
Null values at the top-level in each row are dropped: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok to me... This may introduce ambiguity of
flatten([[1, 2, 3], None, [4, 5]]
flatten([[1, 2, 3], [], [4, 5]]
Even though, when unwrapping nested lists it sounds reasonable to assume both empty list and null item do not contribute to concrete elements.
Hi @shwina! FYI, I'm working on another list concatenation API (
I'm not sure if this can be applied to this PR, or more general list flattening usages. I imagine that if we want to flatten every N lists (concatenate every N contiguous lists into one list), then just generate the same key for every N indices and call |
@ttnghia That does sound useful, although there's the unnecessary overhead of allocating a list column of keys. Could libcudf include a less general API that simply concatenates all the lists in each row? In any case, what is the behaviour with nulls, i.e., what if an index corresponds to a |
Yes, of course we can have that API. In For a null list element, there is an option to choose: either to ignore the null and continue concatenating the remaining lists, or nullify the entire result (concatenation involving a null list will result in a null list). |
if not isinstance(result_dtype, ListDtype): | ||
return self._return_or_inplace(self._column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to mention in the docstring that this API is designed to work on Lists of Lists and that if you only give it one level it doesn't return an integral type.
Or we could decide to allow it to return an integral type.
lists::concatenate_list_elements
and expose them as .list.concat()
Co-authored-by: Nghia Truong <[email protected]>
rerun tests |
@gpucibot merge |
Adds a method to concatenate the lists in a nested list Series: