[FEA] Support arrays_zip #5229

viadea · 2022-04-12T21:07:26Z

I wish we can support arrays_zip.

Eg:

from pyspark.sql.functions import *
df = spark.createDataFrame([(["a", "b", "a"], ["b", "c"]), (["a","a"], ["b", "c"]), (["aa"], ["b", "c"])    ], ['x', 'y'])
df.write.format("parquet").mode("overwrite").save("/tmp/testparquet")
df = spark.read.parquet("/tmp/testparquet")
df.select(arrays_zip(df.x,df.y).alias("zip")).collect()

    ! <ArraysZip> arrays_zip(x#72, y#73, x, y) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.ArraysZip

The text was updated successfully, but these errors were encountered:

revans2 · 2022-04-12T22:08:30Z

arrays_zip will make an array of structs as output. If all of the arrays had the same offsets column it would just involve reordering the columns so that the data columns are under a struct column which is under the array column with corresponding offsets, but because of nulls and different length arrays it does not work out of the box.

We probably can make this work by finding the maximum length of arrays in each row and then creating a segmented gather list to insert the nulls where needed, then gather the child arrays and do the manipulation. We probably don't need cudf for this, but until we really start to write it and see all of the corner cases I don't know for sure.

Current PR is to enable cuDF API `segmented_gather` in Java package. `segmented_gather` is essential to implement spark array functions like `arrays_zip`(NVIDIA/spark-rapids#5229). Authors: - Alfred Xu (https://github.com/sperlingxx) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #10669

Add generateListOffsets API, converting list lengths to list offsets, which is useful in the development of spark-rapids. For example, the support of [array_repeat](NVIDIA/spark-rapids#5226) and [arrays_zip](NVIDIA/spark-rapids#5229) relies on this API. Authors: - Alfred Xu (https://github.com/sperlingxx) Approvers: - Nghia Truong (https://github.com/ttnghia) - Liangcai Li (https://github.com/firestarman) URL: #10683

viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 12, 2022

sperlingxx self-assigned this Apr 14, 2022

sperlingxx mentioned this issue Apr 15, 2022

Enable segmented_gather in Java package rapidsai/cudf#10669

Merged

sperlingxx mentioned this issue Apr 19, 2022

JNI: Add generateListOffsets API rapidsai/cudf#10683

Merged

sameerz removed the ? - Needs Triage Need team to review and classify label Apr 19, 2022

sperlingxx mentioned this issue Apr 26, 2022

Support arrays_zip #5317

Merged

sperlingxx closed this as completed in #5317 Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support arrays_zip #5229

[FEA] Support arrays_zip #5229

viadea commented Apr 12, 2022

revans2 commented Apr 12, 2022

[FEA] Support arrays_zip #5229

[FEA] Support arrays_zip #5229

Comments

viadea commented Apr 12, 2022

revans2 commented Apr 12, 2022