Support running scalar pandas UDF with array type. #1944

firestarman · 2021-03-16T08:08:05Z

This PR is to support running scalar pandas UDF with array type.

Add array type signature for related expressions and plans.
Flatten the names of nested struct columns from schema, which is also required by the cudf Arrow IPC writer.

closes #1912

and the input of the expression `GpuPythonUDF` Signed-off-by: Firestarman <[email protected]>

when creating an Arrow IPC writer to support transfering data with nested type, e.g. array of struct type. This is required by cudf native. Signed-off-by: Firestarman <[email protected]>

Signed-off-by: Firestarman <[email protected]>

firestarman · 2021-03-20T01:28:47Z

rapidsai/cudf#7598 was merged. Mark as 'ready for review'.

firestarman · 2021-03-20T03:01:32Z

build

Signed-off-by: Firestarman <[email protected]>

firestarman · 2021-03-22T01:50:57Z

build

Signed-off-by: Firestarman <[email protected]>

firestarman · 2021-03-22T03:05:25Z

build

integration_tests/src/main/python/data_gen.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

...gin/src/main/scala/org/apache/spark/sql/rapids/execution/python/GpuArrowEvalPythonExec.scala

Signed-off-by: Firestarman <[email protected]>

firestarman · 2021-03-23T02:54:39Z

build

revans2

The changes look good. I would like to see some follow on work in CUDF to rethink how we deal with nested types for writes now that parquet also supports this.

This PR is to support running scalar pandas UDF with array type. Add array type signature for related expressions and plans. Flatten the names of nested struct columns from schema, which is also required by the cudf Arrow IPC writer. This PR depends on rapidsai/cudf#7598 closes NVIDIA#1912 Signed-off-by: Firestarman <[email protected]>

firestarman added 12 commits March 16, 2021 12:51

Allow array of struct type for GpuArrowEvalPython

6fa1008

and the input of the expression `GpuPythonUDF` Signed-off-by: Firestarman <[email protected]>

Use column metadata instead of column names

21b5fe2

when creating an Arrow IPC writer to support transfering data with nested type, e.g. array of struct type. This is required by cudf native. Signed-off-by: Firestarman <[email protected]>

Add integration tests for array of struct

7413b76

Signed-off-by: Firestarman <[email protected]>

comment update

f6fae65

Signed-off-by: Firestarman <[email protected]>

Flatten the nested column names

8ea0687

Signed-off-by: Firestarman <[email protected]>

Merge branch 'branch-0.5' into dev-adobe-pandas

b0c0ad1

Use different name for array child

995e234

Signed-off-by: Firestarman <[email protected]>

Doc udpate

0b50ff2

Signed-off-by: Firestarman <[email protected]>

Support nullable when flattening column names

6d513d1

Signed-off-by: Firestarman <[email protected]>

correct the indent

7a5d3f3

Signed-off-by: Firestarman <[email protected]>

Only flatten names for nested struct columns

97f9034

Signed-off-by: Firestarman <[email protected]>

Add an argument indicating if to add me

b8ea548

Signed-off-by: Firestarman <[email protected]>

firestarman marked this pull request as ready for review March 20, 2021 01:14

firestarman marked this pull request as draft March 20, 2021 01:17

firestarman marked this pull request as ready for review March 20, 2021 01:18

firestarman requested a review from revans2 March 20, 2021 01:22

name update

fe17ad0

Signed-off-by: Firestarman <[email protected]>

add a comment

0f7eff6

Signed-off-by: Firestarman <[email protected]>

jlowe requested changes Mar 22, 2021

View reviewed changes

firestarman added 3 commits March 23, 2021 09:23

Merge branch 'branch-0.5' into dev-adobe-pandas

37c6cf4

Address some comments

a9c9fb5

Signed-off-by: Firestarman <[email protected]>

doc update

e47b7c9

Signed-off-by: Firestarman <[email protected]>

jlowe approved these changes Mar 23, 2021

View reviewed changes

revans2 approved these changes Mar 23, 2021

View reviewed changes

firestarman merged commit 892b4c5 into NVIDIA:branch-0.5 Mar 24, 2021

firestarman deleted the scalar_udf_array branch March 24, 2021 00:17

sameerz added the feature request New feature or request label Mar 24, 2021

sameerz added this to the Mar 15 - March 26 milestone Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support running scalar pandas UDF with array type. #1944

Support running scalar pandas UDF with array type. #1944

firestarman commented Mar 16, 2021 •

edited

Loading

firestarman commented Mar 20, 2021

firestarman commented Mar 20, 2021

firestarman commented Mar 22, 2021

firestarman commented Mar 22, 2021

firestarman commented Mar 23, 2021

revans2 left a comment

Support running scalar pandas UDF with array type. #1944

Support running scalar pandas UDF with array type. #1944

Conversation

firestarman commented Mar 16, 2021 • edited Loading

firestarman commented Mar 20, 2021

firestarman commented Mar 20, 2021

firestarman commented Mar 22, 2021

firestarman commented Mar 22, 2021

firestarman commented Mar 23, 2021

revans2 left a comment

Choose a reason for hiding this comment

firestarman commented Mar 16, 2021 •

edited

Loading