forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-50291][PYTHON] Standardize verifySchema parameter of createDat…
…aFrame in Spark Classic ### What changes were proposed in this pull request? The PR targets at Spark Classic only. Spark Connect will be handled in a follow-up PR. `verifySchema` parameter of createDataFrame decides whether to verify data types of every row against schema. Now it only takes effect for with createDataFrame with - egular Python instances The PR proposes to make it work with createDataFrame with - `pyarrow.Table` - `pandas.DataFrame` with Arrow optimization - `pandas.DataFrame` without Arrow optimization By default, `verifySchema` parameter is `pyspark._NoValue`, if not provided, createDataFrame with - `pyarrow.Table`, **verifySchema = False** - `pandas.DataFrame` with Arrow optimization, **verifySchema = spark.sql.execution.pandas.convertToArrowArraySafely** - `pandas.DataFrame` without Arrow optimization, **verifySchema = True** - regular Python instances, **verifySchema = True** (existing behavior) ### Why are the changes needed? The change makes schema validation consistent across all formats, improving data integrity and helping prevent errors. It also enhances flexibility by allowing users to choose schema verification regardless of the input type. Part of [SPARK-50146](https://issues.apache.org/jira/browse/SPARK-50146). ### Does this PR introduce _any_ user-facing change? Setup: ```py >>> import pyarrow as pa >>> import pandas as pd >>> from pyspark.sql.types import * >>> >>> data = { ... "id": [1, 2, 3], ... "value": [100000000000, 200000000000, 300000000000] ... } >>> schema = StructType([StructField("id", IntegerType(), True), StructField("value", IntegerType(), True)]) ``` Usage - createDataFrame with `pyarrow.Table` ```py >>> table = pa.table(data) >>> spark.createDataFrame(table, schema=schema).show() # verifySchema defaults to False +---+-----------+ | id| value| +---+-----------+ | 1| 1215752192| | 2|-1863462912| | 3| -647710720| +---+-----------+ >>> spark.createDataFrame(table, schema=schema, verifySchema=True).show() ... pyarrow.lib.ArrowInvalid: Integer value 100000000000 not in range: -2147483648 to 2147483647 ``` Usage - createDataFrame with `pandas.DataFrame` without Arrow optimization ```py >>> pdf = pd.DataFrame(data) >>> spark.createDataFrame(pdf, schema=schema).show() # verifySchema defaults to True ... pyspark.errors.exceptions.base.PySparkValueError: [VALUE_OUT_OF_BOUNDS] Value for `obj` must be between -2147483648 and 2147483647 (inclusive), got 100000000000 >>> spark.createDataFrame(table, schema=schema, verifySchema=False).show() +---+-----------+ | id| value| +---+-----------+ | 1| 1215752192| | 2|-1863462912| | 3| -647710720| +---+-----------+ ``` Usage - createDataFrame with `pandas.DataFrame` with Arrow optimization ```py >>> pdf = pd.DataFrame(data) >>> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True) >>> spark.conf.get("spark.sql.execution.pandas.convertToArrowArraySafely") 'false' >>> spark.createDataFrame(pdf, schema=schema).show() # verifySchema defaults to "spark.sql.execution.pandas.convertToArrowArraySafely" +---+-----------+ | id| value| +---+-----------+ | 1| 1215752192| | 2|-1863462912| | 3| -647710720| +---+-----------+ >>> spark.conf.set("spark.sql.execution.pandas.convertToArrowArraySafely", True) >>> spark.createDataFrame(pdf, schema=schema).show() ... pyspark.errors.exceptions.base.PySparkValueError: [VALUE_OUT_OF_BOUNDS] Value for `obj` must be between -2147483648 and 2147483647 (inclusive), got 100000000000 >>> spark.createDataFrame(table, schema=schema, verifySchema=True).show() ... pyarrow.lib.ArrowInvalid: Integer value 100000000000 not in range: -2147483648 to 2147483647 ``` ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48677 from xinrong-meng/arrowSafe. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
- Loading branch information
1 parent
0b1b676
commit aea9e87
Showing
5 changed files
with
99 additions
and
29 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters