-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support collect_set #2973
Comments
Not being a pandas or spark programmer, I don't really understand what this is doing. Trying to understand the non-groupby example. Why is it dropping |
Also can you explain the difference between collect_list and collect_set? |
Yep. The difference between collect_set and collect_list is whether duplicate values are removed or preserved (set is just the unique values, while list is all the values). One of the From my perspective, the non-groupby example is primarily there for completeness. In my experience, the primary times I've used or seen these used in Spark-SQL is combining with groupby. Additionally, I've found the collect_list/set pattern to be quite common in Spark-SQL, but less so in pandas. Would be good to hear from @randerzander @efajardo-nv and @BartleyR as well on that topic. |
Right, duh -- set vs. list. So on a column / series, collect_set could either be done with our |
On Spark it's not a no-op. A Agree with @beckernick that I do not expect |
So if collect_list removes duplicates, how is it different from collect_set (which removes duplicates)? |
Sorry, I misspoke earlier. |
Agree that |
I believe the more Pandas-friendly way of doing this would be to use
|
This partially addresses #2973. This PR implements groupby `collect_set` aggregation. The idea of this PR is to simply apply `drop_list_duplicates` (#7528) to the result generated by groupby `collect_list`, obtaining collect lists without duplicate entries. Examples: ``` keys = {1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3}; vals = {10, 11, 10, 10, 20, 21, 21, 20, 30, 33, 32, 31}; keys_output = {1, 2, 3}; vals_output = {{10, 11}, {20, 21}, {30, 31, 32, 33}}; ``` In this PR, a simple, incomplete Python binding for `collect_set` has been added, and no Java binding is implemented yet. Complete bindings for those Python/Java sides need to be implemented later in some other separate PRs. Authors: - Nghia Truong (@ttnghia) Approvers: - AJ Schmidt (@ajschmidt8) - Karthikeyan (@karthikeyann) - Keith Kraus (@kkraus14) - Jason Lowe (@jlowe) - Ashwin Srinath (@shwina) URL: #7420
libcudf side is implemented, all that's left is the Python side to expose it via |
I'd like to be able to
collect_set
like I would in Spark-SQL or in pandas using a lambda function (though actually doing it with a lambda function in Python isn't too important). This could be used on a column, but is also particularly useful for groupby operations. Spark API doc.See also #2974 , as they will likely be able to share a significant portion of the implementation
Groupby examples:
Pandas:
Pyspark:
Non-Groupby
Pyspark:
Pandas:
The text was updated successfully, but these errors were encountered: