Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support collect_list #2974

Closed
beckernick opened this issue Oct 4, 2019 · 3 comments
Closed

[FEA] Support collect_list #2974

beckernick opened this issue Oct 4, 2019 · 3 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@beckernick
Copy link
Member

beckernick commented Oct 4, 2019

Similar to #2973 , I'd like to be able to collect_list on grouped objects like in Spark-sql or pandas with a lambda function. Spark API doc.

Groupby examples:

Pandas:

import pandas as pddf = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]
​
res = df.groupby("col1").agg({
    "col2": lambda x: list(x)
})
print(res)
              col2
col1              
0     [12, 12, 38]
1         [42, 93]

Pyspark:

import pandas as pd
​
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
​
sc = SparkContext()
sqlContext = SQLContext(sc)
​
df = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]
​
sdf = sqlContext.createDataFrame(df)
res = sdf.groupby("col1").agg(F.collect_list("col2"))
print(res.show())
+----+------------------+
|col1|collect_list(col2)|
+----+------------------+
|   0|      [12, 12, 38]|
|   1|          [42, 93]|
+----+------------------+

Non-Groupby

Pyspark:

import pandas as pd
​
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
​
sc = SparkContext()
sqlContext = SQLContext(sc)
​
df = pd.DataFrame()
df['col1'] = [0,1,0,1,0]
df['col2'] = [12,42,12,93,38]
​
sdf = sqlContext.createDataFrame(df)
res = sdf.agg(F.collect_list("col2"))
print(res.show())
+--------------------+
|  collect_list(col2)|
+--------------------+
|[12, 42, 12, 93, 38]|
+--------------------+
@beckernick beckernick added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Oct 4, 2019
@jlowe jlowe added the Spark Functionality that helps Spark RAPIDS label Dec 13, 2019
@beckernick
Copy link
Member Author

@shwina can we now close this issue after #5874 . I believe #2973 cannot be closed yet

@shwina
Copy link
Contributor

shwina commented Oct 26, 2020

Yes - we support collecting into list via list but not collecting into sets. That will require libcudf support. cc: @harrism @kkraus14

@kkraus14
Copy link
Collaborator

Yup, closing this as this is implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

4 participants