Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support columnar processing for FlatMapCoGroupInPandas[databricks] #6751

Merged
merged 4 commits into from
Oct 14, 2022

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Oct 11, 2022

closes #305

This PR implements the columnar support for the FlatMapCoGroupInPandasExec, along with some refactors.

  • Moved the GPU Arrow Python runner and its related classes into a separate file.
  • Cleaned some boilerplate code.

Performance

  • About 6.8 GB Parquet data in local files, containing 100 groups.
  • CPU 12 cores, and one GPU (Titan V, with 12GB memory)
CPU Read + CPU CoGroups GPU Read + CPU CoGroups GPU Read + GPU CoGroups
103.57 89.87 81.01

Signed-off-by: firestarman [email protected]

@firestarman firestarman marked this pull request as draft October 11, 2022 01:20
@firestarman
Copy link
Collaborator Author

Draft due to missing the benchmark numbers, working on it now.

Signed-off-by: Firestarman <[email protected]>
@firestarman firestarman changed the title Support columnar processing for FlatMapCoGroupInPandas Support columnar processing for FlatMapCoGroupInPandas[databricks] Oct 11, 2022
@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

firestarman commented Oct 11, 2022

It depends on rapidsai/cudf#11883, which was just merged. Wait for the latest jni jar to run the premerge CI.

@sameerz sameerz added the task Work required that improves the product but is not user facing label Oct 11, 2022
@firestarman firestarman marked this pull request as ready for review October 11, 2022 08:29
@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

1 similar comment
@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo but otherwise lgtm.

/**
* Python UDF Runner for cogrouped UDFs, designed for `GpuFlatMapCoGroupsInPandasExec` only.
*
* It sends Arrow bathes from two different DataFrames, groups them in Python,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* It sends Arrow bathes from two different DataFrames, groups them in Python,
* It sends Arrow batches from two different DataFrames, groups them in Python,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, will update this in a following PR for issue #6778.

@firestarman firestarman merged commit 03b1164 into NVIDIA:branch-22.12 Oct 14, 2022
@firestarman firestarman deleted the cogroup-py-udf branch October 14, 2022 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support efficient data transfers for Python UDFs
3 participants