[SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic #48677

xinrong-meng · 2024-10-28T06:51:32Z

What changes were proposed in this pull request?

The PR targets at Spark Classic only. Spark Connect will be handled in a follow-up PR.

verifySchema parameter of createDataFrame decides whether to verify data types of every row against schema.

Now it only takes effect for with createDataFrame with

regular Python instances

The PR proposes to make it work with createDataFrame with

pyarrow.Table
pandas.DataFrame with Arrow optimization
pandas.DataFrame without Arrow optimization

By default, verifySchema parameter is pyspark._NoValue, if not provided, createDataFrame with

pyarrow.Table, verifySchema = False
pandas.DataFrame with Arrow optimization, verifySchema = spark.sql.execution.pandas.convertToArrowArraySafely
pandas.DataFrame without Arrow optimization, verifySchema = True
regular Python instances, verifySchema = True (existing behavior)

Why are the changes needed?

The change makes schema validation consistent across all formats, improving data integrity and helping prevent errors.
It also enhances flexibility by allowing users to choose schema verification regardless of the input type.

Part of SPARK-50146.

Does this PR introduce any user-facing change?

Setup:

>>> import pyarrow as pa
>>> import pandas as pd
>>> from pyspark.sql.types import *
>>> 
>>> data = {
...     "id": [1, 2, 3],
...     "value": [100000000000, 200000000000, 300000000000]
... }
>>> schema = StructType([StructField("id", IntegerType(), True), StructField("value", IntegerType(), True)])

Usage - createDataFrame with pyarrow.Table

>>> table = pa.table(data)
>>> spark.createDataFrame(table, schema=schema).show()  # verifySchema defaults to False
+---+-----------+
| id|      value|
+---+-----------+
|  1| 1215752192|
|  2|-1863462912|
|  3| -647710720|
+---+-----------+

>>> spark.createDataFrame(table, schema=schema, verifySchema=True).show()
...
pyarrow.lib.ArrowInvalid: Integer value 100000000000 not in range: -2147483648 to 2147483647

Usage - createDataFrame with pandas.DataFrame without Arrow optimization

>>> pdf = pd.DataFrame(data)
>>> spark.createDataFrame(pdf, schema=schema).show()  # verifySchema defaults to True
...
pyspark.errors.exceptions.base.PySparkValueError: [VALUE_OUT_OF_BOUNDS] Value for `obj` must be between -2147483648 and 2147483647 (inclusive), got 100000000000
>>> spark.createDataFrame(table, schema=schema, verifySchema=False).show()
+---+-----------+
| id|      value|
+---+-----------+
|  1| 1215752192|
|  2|-1863462912|
|  3| -647710720|
+---+-----------+

Usage - createDataFrame with pandas.DataFrame with Arrow optimization

>>> pdf = pd.DataFrame(data)
>>> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
>>> spark.conf.get("spark.sql.execution.pandas.convertToArrowArraySafely")
'false'
>>> spark.createDataFrame(pdf, schema=schema).show()  # verifySchema defaults to "spark.sql.execution.pandas.convertToArrowArraySafely"
+---+-----------+
| id|      value|
+---+-----------+
|  1| 1215752192|
|  2|-1863462912|
|  3| -647710720|
+---+-----------+

>>> spark.conf.set("spark.sql.execution.pandas.convertToArrowArraySafely", True)
>>> spark.createDataFrame(pdf, schema=schema).show()
...
pyspark.errors.exceptions.base.PySparkValueError: [VALUE_OUT_OF_BOUNDS] Value for `obj` must be between -2147483648 and 2147483647 (inclusive), got 100000000000

>>> spark.createDataFrame(table, schema=schema, verifySchema=True).show()
...
pyarrow.lib.ArrowInvalid: Integer value 100000000000 not in range: -2147483648 to 2147483647

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

python/pyspark/sql/connect/session.py

xinrong-meng · 2024-11-13T07:35:04Z

python/pyspark/sql/tests/connect/test_parity_arrow.py

@@ -137,6 +137,10 @@ def test_toPandas_udt(self):
    def test_create_dataframe_namedtuples(self):
        self.check_create_dataframe_namedtuples(True)

+    @unittest.skip("Spark Connect does not support verifySchema.")
+    def test_createDataFrame_verifySchema(self):
+        super().test_createDataFrame_verifySchema()


https://issues.apache.org/jira/browse/SPARK-50146

HyukjinKwon · 2024-11-13T07:35:47Z

python/pyspark/sql/session.py


            .. versionadded:: 2.1.0
+            .. versionchanged:: 4.0.0
+                Adjusts default value to pyspark._NoValue.


I think we don't need to mention this.

Maybe say that this parameter is now respected in Spark Connect and with Arrow optimization

Makes sense! Removed.

xinrong-meng · 2024-11-14T04:02:22Z

Type hints failed weirdly as https://github.com/xinrong-meng/spark/actions/runs/11812734311/job/32908478871, ignoring #type: ignore comments.
Rebased master

HyukjinKwon · 2024-11-14T08:17:08Z

Merged to master.

…eateDataFrame in Spark Connect ### What changes were proposed in this pull request? The PR targets at Spark Connect only. Spark Classic has been handled in #48677. `verifySchema` parameter of createDataFrame on Spark Classic decides whether to verify data types of every row against schema. Now it's not supported on Spark Connect. The PR proposes to support `verifySchema` on Spark Connect. By default, `verifySchema` parameter is `pyspark._NoValue`, if not provided, createDataFrame with - `pyarrow.Table`, **verifySchema = False** - `pandas.DataFrame` with Arrow optimization, **verifySchema = spark.sql.execution.pandas.convertToArrowArraySafely** - regular Python instances, **verifySchema = True** The schema enforcement of numpy ndarray input is unexpected and will be resolved as a follow-up, https://issues.apache.org/jira/browse/SPARK-50323. ### Why are the changes needed? Parity with Spark Classic. ### Does this PR introduce _any_ user-facing change? Yes, `verifySchema` parameter of createDataFrame is supported in Spark Connect. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48841 from xinrong-meng/verifySchemaConnect. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Xinrong Meng <[email protected]>

HyukjinKwon · 2024-11-22T06:35:41Z

Actually had the offline discussion. I think we should evaluate the performance impact, and think about deprecating this if this isn't really useful instead of propagating it.

Let me revert #48677 and #48841 for now.

github-actions bot added SQL PYTHON CONNECT labels Oct 28, 2024

xinrong-meng changed the title ~~[WIP][SPARK-50146][PYTHON][CONNECT] Configurable schema validation when creating DataFrames from Arrow tables~~ [SPARK-50146][PYTHON][CONNECT] Configurable schema validation when creating DataFrames from Arrow tables Oct 31, 2024

xinrong-meng marked this pull request as ready for review October 31, 2024 07:04

HyukjinKwon reviewed Oct 31, 2024

View reviewed changes

python/pyspark/sql/connect/session.py Outdated Show resolved Hide resolved

xinrong-meng marked this pull request as draft November 1, 2024 07:31

xinrong-meng changed the title ~~[SPARK-50146][PYTHON][CONNECT] Configurable schema validation when creating DataFrames from Arrow tables~~ [WIP][SPARK-50146][PYTHON][CONNECT] Configurable schema validation when creating DataFrames from Arrow tables Nov 1, 2024

xinrong-meng changed the title ~~[WIP][SPARK-50146][PYTHON][CONNECT] Configurable schema validation when creating DataFrames from Arrow tables~~ [WIP][SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic Nov 12, 2024

xinrong-meng changed the title ~~[WIP][SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic~~ [SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic Nov 13, 2024

xinrong-meng marked this pull request as ready for review November 13, 2024 07:23

xinrong-meng commented Nov 13, 2024

View reviewed changes

xinrong-meng requested a review from HyukjinKwon November 13, 2024 07:35

HyukjinKwon reviewed Nov 13, 2024

View reviewed changes

HyukjinKwon approved these changes Nov 13, 2024

View reviewed changes

xinrong-meng added 14 commits November 14, 2024 12:01

reuse config

ad858dc

new config

738aba3

test

6e04074

connect parity

4be8b04

fmt

73e1546

- conf; classic

e08b1f9

unuse conf

e93a26a

test from arrow table

71bb3cf

fmt

77faf89

restore connect

9531937

test

95ba8ac

mypy data test

3bb5f13

complete tests

fa1a99c

rmv versionchanged

b707d2a

xinrong-meng force-pushed the arrowSafe branch from 6f411f9 to b707d2a Compare November 14, 2024 04:02

defaults to conf

3f49f19

HyukjinKwon approved these changes Nov 14, 2024

View reviewed changes

HyukjinKwon closed this in aea9e87 Nov 14, 2024

xinrong-meng mentioned this pull request Nov 14, 2024

[SPARK-50298][PYTHON][CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect #48841

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic #48677

[SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic #48677

xinrong-meng commented Oct 28, 2024 •

edited

Loading

xinrong-meng Nov 13, 2024

HyukjinKwon Nov 13, 2024

HyukjinKwon Nov 13, 2024

xinrong-meng Nov 13, 2024

xinrong-meng commented Nov 14, 2024 •

edited

Loading

HyukjinKwon commented Nov 14, 2024

HyukjinKwon commented Nov 22, 2024

[SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic #48677

[SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic #48677

Conversation

xinrong-meng commented Oct 28, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

xinrong-meng Nov 13, 2024

Choose a reason for hiding this comment

HyukjinKwon Nov 13, 2024

Choose a reason for hiding this comment

HyukjinKwon Nov 13, 2024

Choose a reason for hiding this comment

xinrong-meng Nov 13, 2024

Choose a reason for hiding this comment

xinrong-meng commented Nov 14, 2024 • edited Loading

HyukjinKwon commented Nov 14, 2024

HyukjinKwon commented Nov 22, 2024

xinrong-meng commented Oct 28, 2024 •

edited

Loading

xinrong-meng commented Nov 14, 2024 •

edited

Loading