Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support collect_set #2973

Closed
beckernick opened this issue Oct 4, 2019 · 10 comments · Fixed by #7726
Closed

[FEA] Support collect_set #2973

beckernick opened this issue Oct 4, 2019 · 10 comments · Fixed by #7726
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Oct 4, 2019

I'd like to be able to collect_set like I would in Spark-SQL or in pandas using a lambda function (though actually doing it with a lambda function in Python isn't too important). This could be used on a column, but is also particularly useful for groupby operations. Spark API doc.

See also #2974 , as they will likely be able to share a significant portion of the implementation

Groupby examples:

Pandas:

import pandas as pddf = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]
​
res = df.groupby("col1").agg({
    "col2": lambda x: set(x)
})
print(res)
          col2
col1          
0     {12, 38}
1     {42, 93}

Pyspark:

import pandas as pd
​
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

sc = SparkContext()
sqlContext = SQLContext(sc)
​
df = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]
​
sdf = sqlContext.createDataFrame(df)
res = sdf.groupby("col1").agg(F.collect_set("col2"))
print(res.show())
+----+-----------------+
|col1|collect_set(col2)|
+----+-----------------+
|   0|         [12, 38]|
|   1|         [42, 93]|
+----+-----------------+

Non-Groupby

Pyspark:

import pandas as pd

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

sc = SparkContext()
sqlContext = SQLContext(sc)
​

df = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]

sdf = sqlContext.createDataFrame(df)
print(sdf.agg(F.collect_set("col2")).show())
+-----------------+
|collect_set(col2)|
+-----------------+
| [12, 38, 42, 93]|
+-----------------+

Pandas:

import pandas as pddf = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]
df[['col2']].apply(lambda x: set(x))
col2    {42, 12, 93, 38}
dtype: object
@beckernick beckernick added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Oct 4, 2019
@harrism
Copy link
Member

harrism commented Dec 12, 2019

Not being a pandas or spark programmer, I don't really understand what this is doing. Trying to understand the non-groupby example. Why is it dropping 12?

@harrism
Copy link
Member

harrism commented Dec 12, 2019

Also can you explain the difference between collect_list and collect_set?

@beckernick
Copy link
Member Author

beckernick commented Dec 12, 2019

Yep. The difference between collect_set and collect_list is whether duplicate values are removed or preserved (set is just the unique values, while list is all the values). One of the 12s is dropped in this example as it returns the unique values as a nested data structure. In the non-groupby example, it's basically saying "give me all the unique values in col2 as an array inside a new column, with one single row."

From my perspective, the non-groupby example is primarily there for completeness. In my experience, the primary times I've used or seen these used in Spark-SQL is combining with groupby. Additionally, I've found the collect_list/set pattern to be quite common in Spark-SQL, but less so in pandas. Would be good to hear from @randerzander @efajardo-nv and @BartleyR as well on that topic.

@harrism
Copy link
Member

harrism commented Dec 13, 2019

Right, duh -- set vs. list. So on a column / series, collect_list is a no-op?

collect_set could either be done with our concurrent_unordered_map(hash table) or by reusing some of the dictionary-building machinery that @davidwendt is working on.

Also CC @jlowe @revans2 for Spark requirements.

@jlowe
Copy link
Member

jlowe commented Dec 13, 2019

So on a column / series, collect_list is a no-op?

On Spark it's not a no-op. A collect_list on a column puts all the values of the column into an array, removing duplicates. You end up with an array column type with only one row in the column, and that row value is an array containing all the values of the original column. It's like transposing into a single column value. With collect_list or collect_set, you always end up with an array of values, which means it needs nested type support.

Agree with @beckernick that I do not expect collect_set to be used much outside of an aggregation within a groupby. From the Spark perspective, the priority would be to implement this as another aggregation method once list/array column types are supported.

@jlowe jlowe added the Spark Functionality that helps Spark RAPIDS label Dec 13, 2019
@harrism
Copy link
Member

harrism commented Dec 15, 2019

So if collect_list removes duplicates, how is it different from collect_set (which removes duplicates)?

@jlowe
Copy link
Member

jlowe commented Dec 16, 2019

Sorry, I misspoke earlier. collect_list does not remove duplicates.

@BartleyR
Copy link
Member

Would be good to hear from @randerzander @efajardo-nv and @BartleyR as well on that topic.

Agree that collect_set is useful in the groupby and it's much more common in Spark than Pandas (although potentially this is just from my experience and primarily working with larger datasets). We use it when we want to bin by something (equivalent to col1 in the example) then take the unique set of values for that (e.g., communication ports). Not that this can't be done other ways, but the ability to roll this into one call is nice.

@kkraus14
Copy link
Collaborator

I believe the more Pandas-friendly way of doing this would be to use groupby.unique as opposed to apply(lambda x: set(x)):

import pandas as pd

pdf = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2], 'b': [1, 2, 1, 3, 4, 3]})
pdf.groupby(['a'])['b'].unique()
a
1    [1, 2]
2    [3, 4]
Name: b, dtype: object

rapids-bot bot pushed a commit that referenced this issue Mar 23, 2021
This partially addresses #2973.

This PR implements groupby `collect_set` aggregation. The idea of this PR is to simply apply `drop_list_duplicates` (#7528) to the result generated by groupby `collect_list`, obtaining collect lists without duplicate entries.

Examples:
```
keys = {1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3};
vals = {10, 11, 10, 10, 20, 21, 21, 20, 30, 33, 32, 31};

keys_output = {1, 2, 3};
vals_output = {{10, 11}, {20, 21}, {30, 31, 32, 33}};
```

In this PR, a simple, incomplete Python binding for `collect_set` has been added, and no Java binding is implemented yet. Complete bindings for those Python/Java sides need to be implemented later in some other separate PRs.

Authors:
  - Nghia Truong (@ttnghia)

Approvers:
  - AJ Schmidt (@ajschmidt8)
  - Karthikeyan (@karthikeyann)
  - Keith Kraus (@kkraus14)
  - Jason Lowe (@jlowe)
  - Ashwin Srinath (@shwina)

URL: #7420
@kkraus14 kkraus14 added the Python Affects Python cuDF API. label Mar 24, 2021
@kkraus14 kkraus14 removed Spark Functionality that helps Spark RAPIDS libcudf Affects libcudf (C++/CUDA) code. labels Mar 24, 2021
@kkraus14
Copy link
Collaborator

libcudf side is implemented, all that's left is the Python side to expose it via SeriesGroupBy.unique: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.SeriesGroupBy.unique.html

@shwina shwina self-assigned this Mar 25, 2021
rapids-bot bot pushed a commit that referenced this issue Mar 26, 2021
Adds support for `SeriesGroupBy.unique()`. Also adds support for `DataFrameGroupBy.unique()` but that's not tested, as Pandas doesn't support it (yet?).

Resolves #2973

Authors:
  - Ashwin Srinath (@shwina)

Approvers:
  - Keith Kraus (@kkraus14)

URL: #7726
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants