[FEA] Support collect_list #2974

beckernick · 2019-10-04T18:03:25Z

Similar to #2973 , I'd like to be able to collect_list on grouped objects like in Spark-sql or pandas with a lambda function. Spark API doc.

Groupby examples:

Pandas:

import pandas as pd

df = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]

res = df.groupby("col1").agg({
    "col2": lambda x: list(x)
})
print(res)
              col2
col1              
0     [12, 12, 38]
1         [42, 93]

Pyspark:

import pandas as pd

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

sc = SparkContext()
sqlContext = SQLContext(sc)

df = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]

sdf = sqlContext.createDataFrame(df)
res = sdf.groupby("col1").agg(F.collect_list("col2"))
print(res.show())
+----+------------------+
|col1|collect_list(col2)|
+----+------------------+
|   0|      [12, 12, 38]|
|   1|          [42, 93]|
+----+------------------+

Non-Groupby

Pyspark:

import pandas as pd

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

sc = SparkContext()
sqlContext = SQLContext(sc)

df = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]

sdf = sqlContext.createDataFrame(df)
res = sdf.agg(F.collect_list("col2"))
print(res.show())
+--------------------+
|  collect_list(col2)|
+--------------------+
|[12, 42, 12, 93, 38]|
+--------------------+

The text was updated successfully, but these errors were encountered:

beckernick · 2020-10-26T13:15:52Z

@shwina can we now close this issue after #5874 . I believe #2973 cannot be closed yet

shwina · 2020-10-26T14:14:20Z

Yes - we support collecting into list via list but not collecting into sets. That will require libcudf support. cc: @harrism @kkraus14

kkraus14 · 2020-10-26T14:25:49Z

Yup, closing this as this is implemented.

beckernick added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Oct 4, 2019

beckernick mentioned this issue Nov 6, 2019

[FEA] Support collect_set #2973

Closed

jlowe added the Spark Functionality that helps Spark RAPIDS label Dec 13, 2019

beckernick mentioned this issue Jul 2, 2020

[FEA] cudf groupby aggregate into list #5620

Closed

kkraus14 closed this as completed Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support collect_list #2974

[FEA] Support collect_list #2974

beckernick commented Oct 4, 2019 •

edited

Loading

beckernick commented Oct 26, 2020

shwina commented Oct 26, 2020

kkraus14 commented Oct 26, 2020

[FEA] Support collect_list #2974

[FEA] Support collect_list #2974

Comments

beckernick commented Oct 4, 2019 • edited Loading

Groupby examples:

Non-Groupby

beckernick commented Oct 26, 2020

shwina commented Oct 26, 2020

kkraus14 commented Oct 26, 2020

beckernick commented Oct 4, 2019 •

edited

Loading