[FEA] Let `Scalar Pandas UDF` support array of struct type. #1912

firestarman · 2021-03-11T08:16:13Z

Is your feature request related to a problem? Please describe.
I wish the RAPIDS Accelerator for Apache Spark would support running Scalar Pandas UDF with array type as input, then it can work with collect_list, such as the app code as below.

@pandas_udf(returnType=IntegerType())
def pandas_udf_func(windows: pd.Series) -> pd.Series:
    return pd.Series([1]).repeat(windows.size)

spark.sql("SELECT pandas_udf_func( collect_list( struct(a, b)) ) AS ret_list FROM a_table")

The text was updated successfully, but these errors were encountered:

firestarman · 2021-03-11T08:27:29Z

Now the cuDF complains the error as below when transfering data to Python. Looks like cudf does not handle struct data well.

Caused by: ai.rapids.cudf.CudfException: cuDF failure at: /home/liangcail/work/projects/on_github/cudf/cpp/src/interop/to_arrow.cpp:207: Number of field names and number of children doesn't match

	at ai.rapids.cudf.Table.convertCudfToArrowTable(Native Method)
	at ai.rapids.cudf.Table.access$1500(Table.java:46)
	at ai.rapids.cudf.Table$ArrowIPCTableWriter.write(Table.java:1010)
	at org.apache.spark.sql.rapids.execution.python.GpuArrowPythonRunner$$anon$2.$anonfun$writeIteratorToStream$5(GpuArrowEvalPythonExec.scala:455)
	at org.apache.spark.sql.rapids.execution.python.GpuArrowPythonRunner$$anon$2.$anonfun$writeIteratorToStream$5$adapted(GpuArrowEvalPythonExec.scala:453)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)

firestarman · 2021-03-11T08:59:55Z

Filed an issue rapidsai/cudf#7570 for the exception above.

This PR is to support running scalar pandas UDF with array type. Add array type signature for related expressions and plans. Flatten the names of nested struct columns from schema, which is also required by the cudf Arrow IPC writer. This PR depends on rapidsai/cudf#7598 closes #1912 Signed-off-by: Firestarman <[email protected]>

This PR is to support running scalar pandas UDF with array type. Add array type signature for related expressions and plans. Flatten the names of nested struct columns from schema, which is also required by the cudf Arrow IPC writer. This PR depends on rapidsai/cudf#7598 closes NVIDIA#1912 Signed-off-by: Firestarman <[email protected]>

firestarman added feature request New feature or request ? - Needs Triage Need team to review and classify labels Mar 11, 2021

firestarman self-assigned this Mar 11, 2021

firestarman changed the title ~~[FEA] Let Scalar Pandas UDF support array type.~~ [FEA] Let Scalar Pandas UDF support array of struct type. Mar 11, 2021

firestarman mentioned this issue Mar 16, 2021

Support running scalar pandas UDF with array type. #1944

Merged

firestarman closed this as completed in #1944 Mar 24, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Apr 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Let `Scalar Pandas UDF` support array of struct type. #1912

[FEA] Let `Scalar Pandas UDF` support array of struct type. #1912

firestarman commented Mar 11, 2021 •

edited

Loading

firestarman commented Mar 11, 2021 •

edited

Loading

firestarman commented Mar 11, 2021

[FEA] Let Scalar Pandas UDF support array of struct type. #1912

[FEA] Let Scalar Pandas UDF support array of struct type. #1912

Comments

firestarman commented Mar 11, 2021 • edited Loading

firestarman commented Mar 11, 2021 • edited Loading

firestarman commented Mar 11, 2021

[FEA] Let `Scalar Pandas UDF` support array of struct type. #1912

[FEA] Let `Scalar Pandas UDF` support array of struct type. #1912

firestarman commented Mar 11, 2021 •

edited

Loading

firestarman commented Mar 11, 2021 •

edited

Loading