Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support arrays_zip #5229

Closed
viadea opened this issue Apr 12, 2022 · 1 comment · Fixed by #5317
Closed

[FEA] Support arrays_zip #5229

viadea opened this issue Apr 12, 2022 · 1 comment · Fixed by #5317
Assignees
Labels
feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented Apr 12, 2022

I wish we can support arrays_zip.

Eg:

from pyspark.sql.functions import *
df = spark.createDataFrame([(["a", "b", "a"], ["b", "c"]), (["a","a"], ["b", "c"]), (["aa"], ["b", "c"])    ], ['x', 'y'])
df.write.format("parquet").mode("overwrite").save("/tmp/testparquet")
df = spark.read.parquet("/tmp/testparquet")
df.select(arrays_zip(df.x,df.y).alias("zip")).collect()
    ! <ArraysZip> arrays_zip(x#72, y#73, x, y) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.ArraysZip
@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 12, 2022
@revans2
Copy link
Collaborator

revans2 commented Apr 12, 2022

arrays_zip will make an array of structs as output. If all of the arrays had the same offsets column it would just involve reordering the columns so that the data columns are under a struct column which is under the array column with corresponding offsets, but because of nulls and different length arrays it does not work out of the box.

We probably can make this work by finding the maximum length of arrays in each row and then creating a segmented gather list to insert the nulls where needed, then gather the child arrays and do the manipulation. We probably don't need cudf for this, but until we really start to write it and see all of the corner cases I don't know for sure.

@sperlingxx sperlingxx self-assigned this Apr 14, 2022
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Apr 19, 2022
Current PR is to enable cuDF API `segmented_gather` in Java package. `segmented_gather` is essential to implement spark array functions like `arrays_zip`(NVIDIA/spark-rapids#5229).

Authors:
  - Alfred Xu (https://github.com/sperlingxx)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #10669
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Apr 19, 2022
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Apr 26, 2022
Add generateListOffsets API, converting list lengths to list offsets, which is useful in the development of spark-rapids.

For example, the support of [array_repeat](NVIDIA/spark-rapids#5226) and [arrays_zip](NVIDIA/spark-rapids#5229) relies on this API.

Authors:
  - Alfred Xu (https://github.com/sperlingxx)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Liangcai Li (https://github.com/firestarman)

URL: #10683
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants