Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic #48677

Closed
wants to merge 15 commits into from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Oct 28, 2024

What changes were proposed in this pull request?

The PR targets at Spark Classic only. Spark Connect will be handled in a follow-up PR.

verifySchema parameter of createDataFrame decides whether to verify data types of every row against schema.

Now it only takes effect for with createDataFrame with

  • regular Python instances

The PR proposes to make it work with createDataFrame with

  • pyarrow.Table
  • pandas.DataFrame with Arrow optimization
  • pandas.DataFrame without Arrow optimization

By default, verifySchema parameter is pyspark._NoValue, if not provided, createDataFrame with

  • pyarrow.Table, verifySchema = False
  • pandas.DataFrame with Arrow optimization, verifySchema = spark.sql.execution.pandas.convertToArrowArraySafely
  • pandas.DataFrame without Arrow optimization, verifySchema = True
  • regular Python instances, verifySchema = True (existing behavior)

Why are the changes needed?

The change makes schema validation consistent across all formats, improving data integrity and helping prevent errors.
It also enhances flexibility by allowing users to choose schema verification regardless of the input type.

Part of SPARK-50146.

Does this PR introduce any user-facing change?

Setup:

>>> import pyarrow as pa
>>> import pandas as pd
>>> from pyspark.sql.types import *
>>> 
>>> data = {
...     "id": [1, 2, 3],
...     "value": [100000000000, 200000000000, 300000000000]
... }
>>> schema = StructType([StructField("id", IntegerType(), True), StructField("value", IntegerType(), True)])

Usage - createDataFrame with pyarrow.Table

>>> table = pa.table(data)
>>> spark.createDataFrame(table, schema=schema).show()  # verifySchema defaults to False
+---+-----------+
| id|      value|
+---+-----------+
|  1| 1215752192|
|  2|-1863462912|
|  3| -647710720|
+---+-----------+

>>> spark.createDataFrame(table, schema=schema, verifySchema=True).show()
...
pyarrow.lib.ArrowInvalid: Integer value 100000000000 not in range: -2147483648 to 2147483647

Usage - createDataFrame with pandas.DataFrame without Arrow optimization

>>> pdf = pd.DataFrame(data)
>>> spark.createDataFrame(pdf, schema=schema).show()  # verifySchema defaults to True
...
pyspark.errors.exceptions.base.PySparkValueError: [VALUE_OUT_OF_BOUNDS] Value for `obj` must be between -2147483648 and 2147483647 (inclusive), got 100000000000
>>> spark.createDataFrame(table, schema=schema, verifySchema=False).show()
+---+-----------+
| id|      value|
+---+-----------+
|  1| 1215752192|
|  2|-1863462912|
|  3| -647710720|
+---+-----------+

Usage - createDataFrame with pandas.DataFrame with Arrow optimization

>>> pdf = pd.DataFrame(data)
>>> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
>>> spark.conf.get("spark.sql.execution.pandas.convertToArrowArraySafely")
'false'
>>> spark.createDataFrame(pdf, schema=schema).show()  # verifySchema defaults to "spark.sql.execution.pandas.convertToArrowArraySafely"
+---+-----------+
| id|      value|
+---+-----------+
|  1| 1215752192|
|  2|-1863462912|
|  3| -647710720|
+---+-----------+

>>> spark.conf.set("spark.sql.execution.pandas.convertToArrowArraySafely", True)
>>> spark.createDataFrame(pdf, schema=schema).show()
...
pyspark.errors.exceptions.base.PySparkValueError: [VALUE_OUT_OF_BOUNDS] Value for `obj` must be between -2147483648 and 2147483647 (inclusive), got 100000000000

>>> spark.createDataFrame(table, schema=schema, verifySchema=True).show()
...
pyarrow.lib.ArrowInvalid: Integer value 100000000000 not in range: -2147483648 to 2147483647

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@xinrong-meng xinrong-meng changed the title [WIP][SPARK-50146][PYTHON][CONNECT] Configurable schema validation when creating DataFrames from Arrow tables [SPARK-50146][PYTHON][CONNECT] Configurable schema validation when creating DataFrames from Arrow tables Oct 31, 2024
@xinrong-meng xinrong-meng marked this pull request as ready for review October 31, 2024 07:04
@xinrong-meng xinrong-meng marked this pull request as draft November 1, 2024 07:31
@xinrong-meng xinrong-meng changed the title [SPARK-50146][PYTHON][CONNECT] Configurable schema validation when creating DataFrames from Arrow tables [WIP][SPARK-50146][PYTHON][CONNECT] Configurable schema validation when creating DataFrames from Arrow tables Nov 1, 2024
@xinrong-meng xinrong-meng changed the title [WIP][SPARK-50146][PYTHON][CONNECT] Configurable schema validation when creating DataFrames from Arrow tables [WIP][SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic Nov 12, 2024
@xinrong-meng xinrong-meng changed the title [WIP][SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic [SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic Nov 13, 2024
@xinrong-meng xinrong-meng marked this pull request as ready for review November 13, 2024 07:23
@@ -137,6 +137,10 @@ def test_toPandas_udt(self):
def test_create_dataframe_namedtuples(self):
self.check_create_dataframe_namedtuples(True)

@unittest.skip("Spark Connect does not support verifySchema.")
def test_createDataFrame_verifySchema(self):
super().test_createDataFrame_verifySchema()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


.. versionadded:: 2.1.0
.. versionchanged:: 4.0.0
Adjusts default value to pyspark._NoValue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to mention this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say that this parameter is now respected in Spark Connect and with Arrow optimization

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Removed.

@xinrong-meng
Copy link
Member Author

xinrong-meng commented Nov 14, 2024

Type hints failed weirdly as https://github.com/xinrong-meng/spark/actions/runs/11812734311/job/32908478871, ignoring #type: ignore comments.
Rebased master

@HyukjinKwon
Copy link
Member

Merged to master.

xinrong-meng added a commit that referenced this pull request Nov 19, 2024
…eateDataFrame in Spark Connect

### What changes were proposed in this pull request?
The PR targets at Spark Connect only. Spark Classic has been handled in #48677.

`verifySchema` parameter of createDataFrame on Spark Classic decides whether to verify data types of every row against schema.

Now it's not supported on Spark Connect.

The PR proposes to support `verifySchema` on Spark Connect.

By default, `verifySchema` parameter is `pyspark._NoValue`, if not provided, createDataFrame with
- `pyarrow.Table`,  **verifySchema = False**
- `pandas.DataFrame` with Arrow optimization,  **verifySchema = spark.sql.execution.pandas.convertToArrowArraySafely**
-  regular Python instances, **verifySchema = True**

The schema enforcement of numpy ndarray input is unexpected and will be resolved as a follow-up, https://issues.apache.org/jira/browse/SPARK-50323.

### Why are the changes needed?
Parity with Spark Classic.

### Does this PR introduce _any_ user-facing change?
Yes, `verifySchema`  parameter of createDataFrame is supported in Spark Connect.

### How was this patch tested?
Unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48841 from xinrong-meng/verifySchemaConnect.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Xinrong Meng <[email protected]>
@HyukjinKwon
Copy link
Member

Actually had the offline discussion. I think we should evaluate the performance impact, and think about deprecating this if this isn't really useful instead of propagating it.

Let me revert #48677 and #48841 for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants