[FEA] Support collect_set #2973

beckernick · 2019-10-04T17:58:18Z

I'd like to be able to collect_set like I would in Spark-SQL or in pandas using a lambda function (though actually doing it with a lambda function in Python isn't too important). This could be used on a column, but is also particularly useful for groupby operations. Spark API doc.

See also #2974 , as they will likely be able to share a significant portion of the implementation

Groupby examples:

Pandas:

import pandas as pd

df = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]

res = df.groupby("col1").agg({
    "col2": lambda x: set(x)
})
print(res)
          col2
col1          
0     {12, 38}
1     {42, 93}

Pyspark:

import pandas as pd

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

sc = SparkContext()
sqlContext = SQLContext(sc)

df = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]

sdf = sqlContext.createDataFrame(df)
res = sdf.groupby("col1").agg(F.collect_set("col2"))
print(res.show())
+----+-----------------+
|col1|collect_set(col2)|
+----+-----------------+
|   0|         [12, 38]|
|   1|         [42, 93]|
+----+-----------------+

Non-Groupby

Pyspark:

import pandas as pd

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

sc = SparkContext()
sqlContext = SQLContext(sc)


df = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]

sdf = sqlContext.createDataFrame(df)
print(sdf.agg(F.collect_set("col2")).show())
+-----------------+
|collect_set(col2)|
+-----------------+
| [12, 38, 42, 93]|
+-----------------+

Pandas:

import pandas as pd

df = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]
df[['col2']].apply(lambda x: set(x))
col2    {42, 12, 93, 38}
dtype: object

The text was updated successfully, but these errors were encountered:

harrism · 2019-12-12T23:27:37Z

Not being a pandas or spark programmer, I don't really understand what this is doing. Trying to understand the non-groupby example. Why is it dropping 12?

harrism · 2019-12-12T23:28:46Z

Also can you explain the difference between collect_list and collect_set?

beckernick · 2019-12-12T23:56:58Z

Yep. The difference between collect_set and collect_list is whether duplicate values are removed or preserved (set is just the unique values, while list is all the values). One of the 12s is dropped in this example as it returns the unique values as a nested data structure. In the non-groupby example, it's basically saying "give me all the unique values in col2 as an array inside a new column, with one single row."

From my perspective, the non-groupby example is primarily there for completeness. In my experience, the primary times I've used or seen these used in Spark-SQL is combining with groupby. Additionally, I've found the collect_list/set pattern to be quite common in Spark-SQL, but less so in pandas. Would be good to hear from @randerzander @efajardo-nv and @BartleyR as well on that topic.

harrism · 2019-12-13T00:26:35Z

Right, duh -- set vs. list. So on a column / series, collect_list is a no-op?

collect_set could either be done with our concurrent_unordered_map(hash table) or by reusing some of the dictionary-building machinery that @davidwendt is working on.

Also CC @jlowe @revans2 for Spark requirements.

jlowe · 2019-12-13T15:23:13Z

So on a column / series, collect_list is a no-op?

On Spark it's not a no-op. A collect_list on a column puts all the values of the column into an array, removing duplicates. You end up with an array column type with only one row in the column, and that row value is an array containing all the values of the original column. It's like transposing into a single column value. With collect_list or collect_set, you always end up with an array of values, which means it needs nested type support.

Agree with @beckernick that I do not expect collect_set to be used much outside of an aggregation within a groupby. From the Spark perspective, the priority would be to implement this as another aggregation method once list/array column types are supported.

harrism · 2019-12-15T21:38:25Z

So if collect_list removes duplicates, how is it different from collect_set (which removes duplicates)?

jlowe · 2019-12-16T14:56:40Z

Sorry, I misspoke earlier. collect_list does not remove duplicates.

BartleyR · 2019-12-16T15:16:30Z

Would be good to hear from @randerzander @efajardo-nv and @BartleyR as well on that topic.

Agree that collect_set is useful in the groupby and it's much more common in Spark than Pandas (although potentially this is just from my experience and primarily working with larger datasets). We use it when we want to bin by something (equivalent to col1 in the example) then take the unique set of values for that (e.g., communication ports). Not that this can't be done other ways, but the ability to roll this into one call is nice.

kkraus14 · 2020-10-26T14:26:41Z

I believe the more Pandas-friendly way of doing this would be to use groupby.unique as opposed to apply(lambda x: set(x)):

import pandas as pd

pdf = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2], 'b': [1, 2, 1, 3, 4, 3]})
pdf.groupby(['a'])['b'].unique()

a
1    [1, 2]
2    [3, 4]
Name: b, dtype: object

@ttnghia

This partially addresses #2973. This PR implements groupby `collect_set` aggregation. The idea of this PR is to simply apply `drop_list_duplicates` (#7528) to the result generated by groupby `collect_list`, obtaining collect lists without duplicate entries. Examples: ``` keys = {1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3}; vals = {10, 11, 10, 10, 20, 21, 21, 20, 30, 33, 32, 31}; keys_output = {1, 2, 3}; vals_output = {{10, 11}, {20, 21}, {30, 31, 32, 33}}; ``` In this PR, a simple, incomplete Python binding for `collect_set` has been added, and no Java binding is implemented yet. Complete bindings for those Python/Java sides need to be implemented later in some other separate PRs. Authors: - Nghia Truong (@ttnghia) Approvers: - AJ Schmidt (@ajschmidt8) - Karthikeyan (@karthikeyann) - Keith Kraus (@kkraus14) - Jason Lowe (@jlowe) - Ashwin Srinath (@shwina) URL: #7420

kkraus14 · 2021-03-24T23:36:30Z

libcudf side is implemented, all that's left is the Python side to expose it via SeriesGroupBy.unique: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.SeriesGroupBy.unique.html

@shwina

Adds support for `SeriesGroupBy.unique()`. Also adds support for `DataFrameGroupBy.unique()` but that's not tested, as Pandas doesn't support it (yet?). Resolves #2973 Authors: - Ashwin Srinath (@shwina) Approvers: - Keith Kraus (@kkraus14) URL: #7726

beckernick added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Oct 4, 2019

beckernick mentioned this issue Oct 4, 2019

[FEA] Support collect_list #2974

Closed

jlowe added the Spark Functionality that helps Spark RAPIDS label Dec 13, 2019

ttnghia mentioned this issue Mar 12, 2021

Implement groupby collect_set #7420

Merged

kkraus14 added the Python Affects Python cuDF API. label Mar 24, 2021

kkraus14 removed Spark Functionality that helps Spark RAPIDS libcudf Affects libcudf (C++/CUDA) code. labels Mar 24, 2021

shwina self-assigned this Mar 25, 2021

kkraus14 mentioned this issue Mar 25, 2021

Add support for unique groupby aggregation #7726

Merged

rapids-bot bot closed this as completed in #7726 Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support collect_set #2973

[FEA] Support collect_set #2973

beckernick commented Oct 4, 2019 •

edited

Loading

harrism commented Dec 12, 2019

harrism commented Dec 12, 2019

beckernick commented Dec 12, 2019 •

edited

Loading

harrism commented Dec 13, 2019

jlowe commented Dec 13, 2019

harrism commented Dec 15, 2019

jlowe commented Dec 16, 2019

BartleyR commented Dec 16, 2019

kkraus14 commented Oct 26, 2020

kkraus14 commented Mar 24, 2021

[FEA] Support collect_set #2973

[FEA] Support collect_set #2973

Comments

beckernick commented Oct 4, 2019 • edited Loading

Groupby examples:

Non-Groupby

harrism commented Dec 12, 2019

harrism commented Dec 12, 2019

beckernick commented Dec 12, 2019 • edited Loading

harrism commented Dec 13, 2019

jlowe commented Dec 13, 2019

harrism commented Dec 15, 2019

jlowe commented Dec 16, 2019

BartleyR commented Dec 16, 2019

kkraus14 commented Oct 26, 2020

kkraus14 commented Mar 24, 2021

beckernick commented Oct 4, 2019 •

edited

Loading

beckernick commented Dec 12, 2019 •

edited

Loading