[SPARK-50238][PYTHON] Add Variant Support in PySpark UDFs/UDTFs/UDAFs #48770

harshmotw-db · 2024-11-06T01:47:38Z

What changes were proposed in this pull request?

This PR adds support for the Variant type in PySpark UDFs/UDTFs/UDAFs. Support is added in both modes - arrow and pickle - and support is also added in pandas UDFs.

Why are the changes needed?

After this change, users will be able to use the new Variant data type with UDFs, which is currently prohibited.

Does this PR introduce any user-facing change?

Yes, users should now be able to use Variants with Python UDFs.

How was this patch tested?

Unit tests in all scenarios - arrow, pickle and pandas

Was this patch authored or co-authored using generative AI tooling?

No.

harshmotw-db · 2024-11-06T01:49:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala

@@ -169,6 +148,23 @@ case class PythonUDAF(

  override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): PythonUDAF =
    copy(children = newChildren)
+
+  override def checkInputDataTypes(): TypeCheckResult = {


@HyukjinKwon I haven't tested support in UDAFs yet, so I kept it disabled in this scenario. Can you point me to examples where Python UDAFs are tested?

import pandas as pd from pyspark.sql.functions import pandas_udf df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) # Declare the function and create the UDF @pandas_udf("double") def mean_udf(v: pd.Series) -> float: return v.mean() df.select(mean_udf(df['v'])).show()

harshmotw-db · 2024-11-06T18:39:27Z

cc @allisonwang-db @ueshin

gene-db

@harshmotw-db Thanks for this feature! I left a few questions.

python/pyspark/sql/tests/test_udf.py

gene-db · 2024-11-06T18:51:42Z

python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py


-        scalar_f = pandas_udf(lambda u: str(u), StringType())
+        scalar_f = pandas_udf(lambda u: u.apply(str), StringType(), PandasUDFType.SCALAR)


Does pandas_udf go through the same path as an arrow udf path?

Yes, for the most part. I recall that for pandas UDFs to work, I also had to add changes in arrow_to_pandas and _create_batch too because they treat struct types in a special way. Example: https://github.com/apache/spark/pull/48770/files#r1831583273

python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py

python/pyspark/sql/pandas/types.py

python/pyspark/sql/pandas/serializers.py

gene-db · 2024-11-06T19:16:04Z

sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala

      case dataType =>
        val fieldType =
          new FieldType(nullable, toArrowType(dataType, timeZoneId, largeVarTypes), null)
        new Field(name, fieldType, Seq.empty[Field].asJava)
    }
  }

+  def isVariantField(field: Field): Boolean = {
+    assert(field.getType.isInstanceOf[ArrowType.Struct])


Should this assert, or should it just return false when it is not a struct?

It should be an assert since the callsite already checks for struct.

sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala

python/pyspark/sql/tests/test_udf.py

python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py

HyukjinKwon · 2024-11-10T09:33:45Z

Seems mostly fine to me

gene-db

@harshmotw-db Thanks for this new ability!

LGTM

…_input_type and unsupported_udf_output_type errorsd

harshmotw-db · 2024-11-12T00:24:58Z

@gene-db @HyukjinKwon @ueshin I have added support for Variant in Python UDAFs in my latest commit.

…ariant_udf_3

harshmotw-db · 2024-11-12T00:31:13Z

@HyukjinKwon @ueshin I'm not sure why SparkConnectSessionHolderSuite is failing on this PR. I am not able to repro the failure on local. Can you look into this?

harshmotw-db · 2024-11-12T19:20:02Z

Noting that the tests that were failing earlier passed on the latest commit. Seems like a broken test on old versions

HyukjinKwon · 2024-11-13T00:53:16Z

Merged to master.

harshmotw-db added 5 commits September 19, 2024 23:33

add support for duplicate keys in from_json(_, 'variant')

efb2f4a

Addressed @MaxGekk's comment

8672883

regenerated golden files

57a71aa

Merge branch 'master' of https://github.com/harshmotw-db/spark

e703e34

Added Variant support in PySpark UDFs/UDTFs

4295ae6

harshmotw-db commented Nov 6, 2024

View reviewed changes

harshmotw-db marked this pull request as ready for review November 6, 2024 01:49

github-actions bot added SQL PYTHON labels Nov 6, 2024

minor change

cde78c2

style fix

876d5ca

gene-db suggested changes Nov 6, 2024

View reviewed changes

some of the changes recommended by Gene

2ecf567

ueshin reviewed Nov 7, 2024

View reviewed changes

python/pyspark/sql/tests/test_udf.py Outdated Show resolved Hide resolved

python/pyspark/sql/tests/test_udf.py Outdated Show resolved Hide resolved

python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py Outdated Show resolved Hide resolved

harshmotw-db added 4 commits November 6, 2024 16:54

added more tests

b7ecf24

added complex chained UDF tests

8637d5b

added nested udtf tests

689ada1

added udtf tests

76205a7

harshmotw-db added 2 commits November 11, 2024 12:45

address comments

17a6f80

Added comments tracking the PySpark parse_json API implementation

2ce7cfc

gene-db approved these changes Nov 11, 2024

View reviewed changes

Added support for variant in Python UDAFs and removed unsupported_udf…

8b556bd

…_input_type and unsupported_udf_output_type errorsd

harshmotw-db added 2 commits November 11, 2024 16:29

Merge branch 'master' of https://github.com/harshmotw-db/spark into v…

f184d18

…ariant_udf_3

Merge branch 'master' of https://github.com/harshmotw-db/spark into v…

ef770bd

…ariant_udf_3

harshmotw-db requested review from HyukjinKwon and ueshin November 12, 2024 00:32

HyukjinKwon approved these changes Nov 12, 2024

View reviewed changes

lint

6bc8aff

ueshin approved these changes Nov 13, 2024

View reviewed changes

harshmotw-db changed the title ~~[SPARK-50238][PYTHON] Add Variant Support in PySpark UDFs/UDTFs~~ [SPARK-50238][PYTHON] Add Variant Support in PySpark UDFs/UDTFs/UDAFs Nov 13, 2024

HyukjinKwon closed this in 4002a53 Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50238][PYTHON] Add Variant Support in PySpark UDFs/UDTFs/UDAFs #48770

[SPARK-50238][PYTHON] Add Variant Support in PySpark UDFs/UDTFs/UDAFs #48770

harshmotw-db commented Nov 6, 2024 •

edited

Loading

harshmotw-db Nov 6, 2024

HyukjinKwon Nov 10, 2024

harshmotw-db commented Nov 6, 2024

gene-db left a comment

gene-db Nov 6, 2024

harshmotw-db Nov 6, 2024

gene-db Nov 6, 2024

harshmotw-db Nov 6, 2024

HyukjinKwon commented Nov 10, 2024

gene-db left a comment

harshmotw-db commented Nov 12, 2024

harshmotw-db commented Nov 12, 2024

harshmotw-db commented Nov 12, 2024

HyukjinKwon commented Nov 13, 2024


		scalar_f = pandas_udf(lambda u: str(u), StringType())
		scalar_f = pandas_udf(lambda u: u.apply(str), StringType(), PandasUDFType.SCALAR)

[SPARK-50238][PYTHON] Add Variant Support in PySpark UDFs/UDTFs/UDAFs #48770

[SPARK-50238][PYTHON] Add Variant Support in PySpark UDFs/UDTFs/UDAFs #48770

Conversation

harshmotw-db commented Nov 6, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

harshmotw-db Nov 6, 2024

Choose a reason for hiding this comment

HyukjinKwon Nov 10, 2024

Choose a reason for hiding this comment

harshmotw-db commented Nov 6, 2024

gene-db left a comment

Choose a reason for hiding this comment

gene-db Nov 6, 2024

Choose a reason for hiding this comment

harshmotw-db Nov 6, 2024

Choose a reason for hiding this comment

gene-db Nov 6, 2024

Choose a reason for hiding this comment

harshmotw-db Nov 6, 2024

Choose a reason for hiding this comment

HyukjinKwon commented Nov 10, 2024

gene-db left a comment

Choose a reason for hiding this comment

harshmotw-db commented Nov 12, 2024

harshmotw-db commented Nov 12, 2024

harshmotw-db commented Nov 12, 2024

HyukjinKwon commented Nov 13, 2024

harshmotw-db commented Nov 6, 2024 •

edited

Loading