Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50298][PYTHON][CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect #48841

Closed
wants to merge 4 commits into from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Nov 14, 2024

What changes were proposed in this pull request?

The PR targets at Spark Connect only. Spark Classic has been handled in #48677.

verifySchema parameter of createDataFrame on Spark Classic decides whether to verify data types of every row against schema.

Now it's not supported on Spark Connect.

The PR proposes to support verifySchema on Spark Connect.

By default, verifySchema parameter is pyspark._NoValue, if not provided, createDataFrame with

  • pyarrow.Table, verifySchema = False
  • pandas.DataFrame with Arrow optimization, verifySchema = spark.sql.execution.pandas.convertToArrowArraySafely
  • regular Python instances, verifySchema = True

The schema enforcement of numpy ndarray input is unexpected and will be resolved as a follow-up, https://issues.apache.org/jira/browse/SPARK-50323.

Why are the changes needed?

Parity with Spark Classic.

Does this PR introduce any user-facing change?

Yes, verifySchema parameter of createDataFrame is supported in Spark Connect.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@xinrong-meng xinrong-meng changed the title [WIP][SPARK-50298][CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect [SPARK-50298][CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect Nov 15, 2024
@xinrong-meng xinrong-meng changed the title [SPARK-50298][CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect [SPARK-50298][PYTHON}[CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect Nov 15, 2024
@xinrong-meng xinrong-meng changed the title [SPARK-50298][PYTHON}[CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect [SPARK-50298][PYTHON][CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect Nov 15, 2024
@xinrong-meng xinrong-meng marked this pull request as ready for review November 15, 2024 04:19
@xinrong-meng
Copy link
Member Author

Merged to master, thank you!

@HyukjinKwon
Copy link
Member

Actually had the offline discussion. I think we should evaluate the performance impact, and think about deprecating this if this isn't really useful instead of propagating it.

Let me revert #48677 and #48841 for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants