Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support running scalar pandas UDF with array type. #1944

Merged
merged 17 commits into from
Mar 24, 2021

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Mar 16, 2021

This PR is to support running scalar pandas UDF with array type.

  1. Add array type signature for related expressions and plans.
  2. Flatten the names of nested struct columns from schema, which is also required by the cudf Arrow IPC writer.

This PR depends on rapidsai/cudf#7598

closes #1912

and the input of the expression `GpuPythonUDF`

Signed-off-by: Firestarman <[email protected]>
when creating an Arrow IPC writer to support transfering data
with nested type, e.g. array of struct type.

This is required by cudf native.

Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
@firestarman firestarman marked this pull request as ready for review March 20, 2021 01:14
@firestarman firestarman marked this pull request as draft March 20, 2021 01:17
@firestarman firestarman marked this pull request as ready for review March 20, 2021 01:18
@firestarman firestarman requested a review from revans2 March 20, 2021 01:22
@firestarman
Copy link
Collaborator Author

rapidsai/cudf#7598 was merged. Mark as 'ready for review'.

@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good. I would like to see some follow on work in CUDF to rethink how we deal with nested types for writes now that parquet also supports this.

@firestarman firestarman merged commit 892b4c5 into NVIDIA:branch-0.5 Mar 24, 2021
@firestarman firestarman deleted the scalar_udf_array branch March 24, 2021 00:17
@sameerz sameerz added the feature request New feature or request label Mar 24, 2021
@sameerz sameerz added this to the Mar 15 - March 26 milestone Mar 24, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
This PR is to support running scalar pandas UDF with array type.

Add array type signature for related expressions and plans.
Flatten the names of nested struct columns from schema, which is also required by the cudf Arrow IPC writer.
This PR depends on rapidsai/cudf#7598

closes NVIDIA#1912

Signed-off-by: Firestarman <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
This PR is to support running scalar pandas UDF with array type.

Add array type signature for related expressions and plans.
Flatten the names of nested struct columns from schema, which is also required by the cudf Arrow IPC writer.
This PR depends on rapidsai/cudf#7598

closes NVIDIA#1912

Signed-off-by: Firestarman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Let Scalar Pandas UDF support array of struct type.
4 participants