[FEA] Support mapInArrow introduced by pyspark 3.3.0+ #6313

wbo4958 · 2022-08-15T02:46:57Z

Spark 3.3.0 has introduced a new API mapInArrow in PySpark DataFrame, see SPARK-37228 and PR apache/spark#34505. mapInArrow is quite similar with mapInPandas, the only difference is the input is Iterable[pa.RecordBatch] for mapInArrow, while it is Iterator[pd.DataFrame] for mapInPandas.

PyArrow has already supported CUDA Integration, see https://arrow.apache.org/docs/python/integration/cuda.html and potential CUDA IPC, which means, there is a chance that Rapids Accelerator has the opportunity to support ZERO-COPY between JVM process an python process and improve the performance.

I hope it can be supported in Spark-Rapids in 22.12 release.

wbo4958 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Aug 15, 2022

wbo4958 self-assigned this Aug 15, 2022

wbo4958 changed the title ~~[FEA] Support mapInArrow in pyspark 3.3.0+~~ [FEA] Support mapInArrow introduced by pyspark 3.3.0+ Aug 15, 2022

amahussein added the audit_3.3.0 Audit related tasks for 3.3.0 label Aug 16, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label Aug 16, 2022

This was referenced Oct 18, 2022

[FEA] GPU support for mapInArrow #6778

Closed

Support columnar processing for mapInArrow[databricks] #6823

Merged

GaryShen2008 assigned firestarman and unassigned wbo4958 Oct 18, 2022

firestarman closed this as completed in #6823 Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support mapInArrow introduced by pyspark 3.3.0+ #6313

[FEA] Support mapInArrow introduced by pyspark 3.3.0+ #6313

wbo4958 commented Aug 15, 2022

[FEA] Support mapInArrow introduced by pyspark 3.3.0+ #6313

[FEA] Support mapInArrow introduced by pyspark 3.3.0+ #6313

Comments

wbo4958 commented Aug 15, 2022