Support running Pandas UDFs on GPUs in Python processes. #640

firestarman · 2020-09-02T08:37:36Z

This PR is to add support to run Pandas UDFs on GPUs, mainly consisting of two things:

Overriding all the 6 related plans to build GPU context of device and memory for Python processes.
Introducing 2 new python modules rapids.worker and rapids.daemon to execute the GPU memory initialization by leveraging RMM Python APIs.

Fixing an EOFExcetpion by creating a new file object on the same socket as output of the python worker process.

- Add a new object `PythonWorkerSemaphore`. - Add a new conf `spark.rapids.python.concurrentPythonWorkers` - Change class GpuSemaphore from `private` to `private[rapids]` - Let GpuSemaphore support not to initialize the GPU

Currently the limitation only works when pool memory is enabled. - Separate the configs for Python. - Add `OptionalConfEntry` for Python configs

including 5 types SQL_MAP_PANDAS_ITER_UDF SQL_GROUPED_AGG_PANDAS_UDF SQL_GROUPED_MAP_PANDAS_UDF SQL_SCALAR_PANDAS_ITER_UDF SQL_SCALAR_PANDAS_UDF by 4 physical plans GpuMapInPandasExec GpuAggregateInPandasExec GpuFlatMapGroupsInPandasExec GpuArrowEvalPythonExec along with a API update in python/worker.

These two types are SQL_COGROUPED_MAP_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF, mapping to GpuFlatMapCoGroupsInPandasExec and GPuWindowInPandasExec respectively.

python Signed-off-by: Robert (Bobby) Evans <[email protected]>

Fixing an EOFExcetpion by creating a new file object on the same socket as output of the python worker process.

- Add a new object `PythonWorkerSemaphore`. - Add a new conf `spark.rapids.python.concurrentPythonWorkers` - Change class GpuSemaphore from `private` to `private[rapids]` - Let GpuSemaphore support not to initialize the GPU

Currently the limitation only works when pool memory is enabled. - Separate the configs for Python. - Add `OptionalConfEntry` for Python configs

including 5 types SQL_MAP_PANDAS_ITER_UDF SQL_GROUPED_AGG_PANDAS_UDF SQL_GROUPED_MAP_PANDAS_UDF SQL_SCALAR_PANDAS_ITER_UDF SQL_SCALAR_PANDAS_UDF by 4 physical plans GpuMapInPandasExec GpuAggregateInPandasExec GpuFlatMapGroupsInPandasExec GpuArrowEvalPythonExec along with a API update in python/worker.

These two types are SQL_COGROUPED_MAP_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF, mapping to GpuFlatMapCoGroupsInPandasExec and GPuWindowInPandasExec respectively.

Also always check python module configs

into pandas-udf

revans2 · 2020-09-10T14:09:08Z

build

firestarman · 2020-09-10T15:11:12Z

build

into pandas-udf

firestarman · 2020-09-10T16:13:11Z

build

revans2 · 2020-09-10T18:32:20Z

This looks good to me if you don't have any other things you want to get in feel free to merge it.

jenkins/spark-premerge-build.sh

Add support to run Pandas UDFs on GPUs, mainly consisting of two things: Overriding all the 6 related plans to build GPU context of device and memory for Python processes. Introducing 2 new python modules rapids.worker and rapids.daemon to execute the GPU memory initialization by leveraging RMM Python APIs. Signed-off-by: Firestarman <[email protected]> Co-authored-by: Liangcai Li <[email protected]> Co-authored-by: Robert (Bobby) Evans <[email protected]> Co-authored-by: shotai <[email protected]>

Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]>

* Update submodule cudf to f817d96d8bdc47da9fb2725d0e5a7b18586a29ee (NVIDIA#635) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]> * Fixing empty columns when casting to integer or decimal crashing (NVIDIA#633) * fixing empty columns Signed-off-by: Mike Wilson <[email protected]> * cudf submodule commit to v22.10.00 (NVIDIA#640) Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]> * try use new token to fix automerge permission * verify automerge fix of Token permission (NVIDIA#643) Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]> * Revert not working automerge fix [skip ci] (NVIDIA#644) * Revert "verify automerge fix of Token permission (NVIDIA#643)" This reverts commit 8261117. * Revert "try use new token to fix automerge permission" This reverts commit 2a9acde. Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]> * Auto-merge use submodule in BASE ref Signed-off-by: Peixin Li <[email protected]> Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: Mike Wilson <[email protected]> Signed-off-by: Peixin Li <[email protected]> Co-authored-by: Jenkins Automation <[email protected]> Co-authored-by: Mike Wilson <[email protected]>

firestarman and others added 26 commits August 6, 2020 15:38

Support Pandas UDF on GPU

4063857

Fix an error when running rapids.worker.

94f3b22

Fixing an EOFExcetpion by creating a new file object on the same socket as output of the python worker process.

Pack python files

96e35aa

Add API to init GPU context in python process

dccf977

Support limiting the number of python workers

ef36d5e

- Add a new object `PythonWorkerSemaphore`. - Add a new conf `spark.rapids.python.concurrentPythonWorkers` - Change class GpuSemaphore from `private` to `private[rapids]` - Let GpuSemaphore support not to initialize the GPU

Support memory limitaion for Python processes

93966b4

Currently the limitation only works when pool memory is enabled. - Separate the configs for Python. - Add `OptionalConfEntry` for Python configs

Imporve the memory computation for Python workers

3b2e527

Support setting max size of RMM pool

1e4895a

Use maxsize for max pool size when not specified.

d0d15c2

Support two more types of Pandas UDF

61b3589

These two types are SQL_COGROUPED_MAP_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF, mapping to GpuFlatMapCoGroupsInPandasExec and GPuWindowInPandasExec respectively.

Add tests for udfs and basic support for accelerated arrow exchange with

65e497d

python Signed-off-by: Robert (Bobby) Evans <[email protected]>

Support Pandas UDF on GPU

ec2cec6

Fix an error when running rapids.worker.

d18ff4e

Fixing an EOFExcetpion by creating a new file object on the same socket as output of the python worker process.

Pack python files

616d1da

Add API to init GPU context in python process

c5557ca

Support limiting the number of python workers

2fcc2aa

- Add a new object `PythonWorkerSemaphore`. - Add a new conf `spark.rapids.python.concurrentPythonWorkers` - Change class GpuSemaphore from `private` to `private[rapids]` - Let GpuSemaphore support not to initialize the GPU

Support memory limitaion for Python processes

31a9fb3

Currently the limitation only works when pool memory is enabled. - Separate the configs for Python. - Add `OptionalConfEntry` for Python configs

Imporve the memory computation for Python workers

f71f4de

Support setting max size of RMM pool

d4791af

Use maxsize for max pool size when not specified.

d5d156e

Support two more types of Pandas UDF

37dc856

These two types are SQL_COGROUPED_MAP_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF, mapping to GpuFlatMapCoGroupsInPandasExec and GPuWindowInPandasExec respectively.

Use the columnar version rule for Scalar Pandas UDF

29e39ea

Updates the RapidsMeta of plans for Pandas UDF

1033f23

Remove the unnecessary env variable

5e28772

firestarman requested review from GaryShen2008, jlowe, NvTimLiu and revans2 as code owners September 2, 2020 08:37

shotai and others added 7 commits September 10, 2020 18:18

update comment in test start script

7eba830

remove old config

f838ae0

Not init gpu memory when python on gpu is disabled

803fcf4

Also always check python module configs

remove old config

984082b

Merge branch 'pandas-udf' of https://github.com/firestarman/spark-rapids

127ab08

into pandas-udf

import cudf lib normally

6156298

update import cudf

beabf8b

revans2 previously approved these changes Sep 10, 2020

View reviewed changes

Check python module conf only when python gpu enabeld

47ffc98

firestarman dismissed revans2’s stale review via 47ffc98 September 10, 2020 15:09

shotai added 2 commits September 10, 2020 23:54

update dynamic config for udf enable

b1c9be5

Merge branch 'pandas-udf' of https://github.com/firestarman/spark-rapids

9860ee6

into pandas-udf

revans2 approved these changes Sep 10, 2020

View reviewed changes

pxLi reviewed Sep 11, 2020

View reviewed changes

jenkins/spark-premerge-build.sh Show resolved Hide resolved

firestarman merged commit ade7a5f into NVIDIA:branch-0.2 Sep 11, 2020

firestarman deleted the pandas-udf branch September 17, 2020 06:21

kuhushukla mentioned this pull request Sep 18, 2020

[BUG] UDF Integration tests fail if pandas is not installed #810

Closed

jlowe mentioned this pull request Mar 23, 2021

Use Spark's HybridRowQueue to avoid MemoryConsumer API shim #2000

Merged

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023

cudf submodule commit to v22.10.00 (NVIDIA#640)

22cc1fb

Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support running Pandas UDFs on GPUs in Python processes. #640

Support running Pandas UDFs on GPUs in Python processes. #640

firestarman commented Sep 2, 2020 •

edited

Loading

revans2 commented Sep 10, 2020

firestarman commented Sep 10, 2020

firestarman commented Sep 10, 2020

revans2 commented Sep 10, 2020

Support running Pandas UDFs on GPUs in Python processes. #640

Support running Pandas UDFs on GPUs in Python processes. #640

Conversation

firestarman commented Sep 2, 2020 • edited Loading

revans2 commented Sep 10, 2020

firestarman commented Sep 10, 2020

firestarman commented Sep 10, 2020

revans2 commented Sep 10, 2020

firestarman commented Sep 2, 2020 •

edited

Loading