-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic #48677
Conversation
@@ -137,6 +137,10 @@ def test_toPandas_udt(self): | |||
def test_create_dataframe_namedtuples(self): | |||
self.check_create_dataframe_namedtuples(True) | |||
|
|||
@unittest.skip("Spark Connect does not support verifySchema.") | |||
def test_createDataFrame_verifySchema(self): | |||
super().test_createDataFrame_verifySchema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyspark/sql/session.py
Outdated
|
||
.. versionadded:: 2.1.0 | ||
.. versionchanged:: 4.0.0 | ||
Adjusts default value to pyspark._NoValue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to mention this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say that this parameter is now respected in Spark Connect and with Arrow optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Removed.
6f411f9
to
b707d2a
Compare
Type hints failed weirdly as https://github.com/xinrong-meng/spark/actions/runs/11812734311/job/32908478871, ignoring |
Merged to master. |
…eateDataFrame in Spark Connect ### What changes were proposed in this pull request? The PR targets at Spark Connect only. Spark Classic has been handled in #48677. `verifySchema` parameter of createDataFrame on Spark Classic decides whether to verify data types of every row against schema. Now it's not supported on Spark Connect. The PR proposes to support `verifySchema` on Spark Connect. By default, `verifySchema` parameter is `pyspark._NoValue`, if not provided, createDataFrame with - `pyarrow.Table`, **verifySchema = False** - `pandas.DataFrame` with Arrow optimization, **verifySchema = spark.sql.execution.pandas.convertToArrowArraySafely** - regular Python instances, **verifySchema = True** The schema enforcement of numpy ndarray input is unexpected and will be resolved as a follow-up, https://issues.apache.org/jira/browse/SPARK-50323. ### Why are the changes needed? Parity with Spark Classic. ### Does this PR introduce _any_ user-facing change? Yes, `verifySchema` parameter of createDataFrame is supported in Spark Connect. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48841 from xinrong-meng/verifySchemaConnect. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Xinrong Meng <[email protected]>
What changes were proposed in this pull request?
The PR targets at Spark Classic only. Spark Connect will be handled in a follow-up PR.
verifySchema
parameter of createDataFrame decides whether to verify data types of every row against schema.Now it only takes effect for with createDataFrame with
The PR proposes to make it work with createDataFrame with
pyarrow.Table
pandas.DataFrame
with Arrow optimizationpandas.DataFrame
without Arrow optimizationBy default,
verifySchema
parameter ispyspark._NoValue
, if not provided, createDataFrame withpyarrow.Table
, verifySchema = Falsepandas.DataFrame
with Arrow optimization, verifySchema = spark.sql.execution.pandas.convertToArrowArraySafelypandas.DataFrame
without Arrow optimization, verifySchema = TrueWhy are the changes needed?
The change makes schema validation consistent across all formats, improving data integrity and helping prevent errors.
It also enhances flexibility by allowing users to choose schema verification regardless of the input type.
Part of SPARK-50146.
Does this PR introduce any user-facing change?
Setup:
Usage - createDataFrame with
pyarrow.Table
Usage - createDataFrame with
pandas.DataFrame
without Arrow optimizationUsage - createDataFrame with
pandas.DataFrame
with Arrow optimizationHow was this patch tested?
Unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.