Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47725][INFRA] Set up the CI for pyspark-connect package #45870

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Apr 4, 2024

What changes were proposed in this pull request?

This PR proposes to set up a scheduled job for pyspark-connect package. The CI:

  1. Build Spark
  2. Package pyspark-connect with test cases
  3. Remove python/lib/pyspark.zip and python/lib/py4j.zip to make sure we don't use JVM
  4. Run the test cases packaged together within pyspark-connect.

Why are the changes needed?

In order to make sure on the feature coverage in pyspark-connect.

Does this PR introduce any user-facing change?

No, test-only.

How was this patch tested?

Manually tested in my fork, https://github.com/HyukjinKwon/spark/actions/runs/8598881063

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon HyukjinKwon marked this pull request as draft April 4, 2024 03:17
@HyukjinKwon HyukjinKwon force-pushed the do-not-merge-ci branch 2 times, most recently from e4b2f73 to 0f4a33d Compare April 4, 2024 07:15
HyukjinKwon added a commit that referenced this pull request Apr 5, 2024
…ckage

### What changes were proposed in this pull request?

This PR is a followup of #45150 that adds the new `shell` module into PyPI package.

### Why are the changes needed?

So PyPI package contains `shell` module.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released yet.

### How was this patch tested?

The test case will be added at #45870. It was found out during working on that PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45882 from HyukjinKwon/SPARK-47081-followup.

Lead-authored-by: Hyukjin Kwon <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Apr 5, 2024
…ible with pyspark-connect

### What changes were proposed in this pull request?

This PR proposes to make `pyspark.testing.connectutils` compatible with `pyspark-connect`.

### Why are the changes needed?

This is the base work to set up the CI for pyspark-connect.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Tested in #45870.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45887 from HyukjinKwon/SPARK-47735.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Apr 5, 2024
…ts SPARK_CONNECT_TESTING_REMOTE env

### What changes were proposed in this pull request?

This PR is a followup of #45868 that proposes to make testing script to inherits SPARK_CONNECT_TESTING_REMOTE environment variable.

### Why are the changes needed?

So the testing script can set `SPARK_CONNECT_TESTING_REMOTE`, and makes the env effective.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested at #45870

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45886 from HyukjinKwon/SPARK-47724-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@HyukjinKwon HyukjinKwon force-pushed the do-not-merge-ci branch 2 times, most recently from cf5128a to 3db677f Compare April 6, 2024 23:32
@github-actions github-actions bot removed the BUILD label Apr 6, 2024
@apache apache deleted a comment Apr 7, 2024
HyukjinKwon added a commit that referenced this pull request Apr 7, 2024
…sts finished

### What changes were proposed in this pull request?

This PR proposes to drop the tables after tests finished.

### Why are the changes needed?

- To clean up resources properly.
- It can affect other test cases when only one session is being used across other tests.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Tested in #45870.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45913 from HyukjinKwon/SPARK-46722-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Apr 7, 2024
…ith pyspark-connect

### What changes were proposed in this pull request?

This PR proposes to make `pyspark.worker_utils` compatible with `pyspark-connect`.

### Why are the changes needed?

In order for `pyspark-connect` to work without classic PySpark packages and dependencies.
Spark Connect does not support `Broadcast` and `Accumulator`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Yes, at #45870. Once CI is setup there, it will be tested there properly.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45914 from HyukjinKwon/SPARK-47751.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Apr 7, 2024
… with pyspark-connect

### What changes were proposed in this pull request?

This PR proposes to make `pyspark.testing` compatible with `pyspark-connect` by using noop context manager `contextlib.nullcontext` instead of `QuietTest` which requires JVM access.

### Why are the changes needed?

In order for `pyspark-connect` to work without classic PySpark packages and dependencies. Also, the logs are hidden as it's written to the separate file so it is actually already quiet.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Yes, at #45870. Once CI is setup there, it will be tested there properly.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45916 from HyukjinKwon/SPARK-47753.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Apr 7, 2024
…k-connect

### What changes were proposed in this pull request?

This PR proposes to make `pyspark.pandas` compatible with `pyspark-connect`.

### Why are the changes needed?

In order for `pyspark-connect` to work without classic PySpark packages and dependencies.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Yes, at #45870. Once CI is setup there, it will be tested there properly.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45915 from HyukjinKwon/SPARK-47752.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon HyukjinKwon force-pushed the do-not-merge-ci branch 3 times, most recently from e7b9db5 to 0a2670b Compare April 7, 2024 13:09
HyukjinKwon added a commit that referenced this pull request Apr 8, 2024
…streaming_foreach_batch`

### What changes were proposed in this pull request?

This PR proposes to drop the tables after tests finished.

### Why are the changes needed?

- To clean up resources properly.
- It can affect other test cases when only one session is being used across other tests.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Tested in #45870

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45920 from HyukjinKwon/minor-cleanup-table.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon HyukjinKwon force-pushed the do-not-merge-ci branch 5 times, most recently from d3924ee to 9730bd0 Compare April 8, 2024 04:34
HyukjinKwon added a commit that referenced this pull request Apr 8, 2024
…setup.py

### What changes were proposed in this pull request?

This PR is a followup of #42563 (but using new JIRA as it's already released), which adds `pyspark.sql.connect.protobuf` into `setup.py`.

### Why are the changes needed?

So PyPI packaged PySpark can support protobuf function with Spark Connect on.

### Does this PR introduce _any_ user-facing change?

Yes. The new feature is now available with Spark Connect on if users install Spark Connect by `pip`.

### How was this patch tested?

Being tested in #45870

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45924 from HyukjinKwon/SPARK-47762.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Apr 8, 2024
…setup.py

This PR is a followup of #42563 (but using new JIRA as it's already released), which adds `pyspark.sql.connect.protobuf` into `setup.py`.

So PyPI packaged PySpark can support protobuf function with Spark Connect on.

Yes. The new feature is now available with Spark Connect on if users install Spark Connect by `pip`.

Being tested in #45870

No.

Closes #45924 from HyukjinKwon/SPARK-47762.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit f94d95d)
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon HyukjinKwon changed the title [DO-NOT-MERGE][SPARK-47725][INFRA] Set up the CI for pyspark-connect package [SPARK-47725][INFRA] Set up the CI for pyspark-connect package Apr 8, 2024
@HyukjinKwon HyukjinKwon marked this pull request as ready for review April 8, 2024 11:44
@HyukjinKwon
Copy link
Member Author

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Nice. Thank you, @HyukjinKwon .

@dongjoon-hyun
Copy link
Member

Let me merge this to start from this.

HyukjinKwon added a commit that referenced this pull request Apr 11, 2024
…pository

### What changes were proposed in this pull request?

This is a followup of #45870 that skips the run in forked repository.

### Why are the changes needed?

For consistency, and to save resources in forked repository by default.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Should be tested in individual forked repository.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45992 from HyukjinKwon/SPARK-47725-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request May 3, 2024
…5 client <> 4.0 server

### What changes were proposed in this pull request?

This PR proposes to skip the tests that fail with 3.5 client and 4.0 server in Spark Connect (by adding `SPARK_SKIP_CONNECT_COMPAT_TESTS`). This is a base work for #46298. This partially backports #45870

This PR also adds `SPARK_CONNECT_TESTING_REMOTE` environment variable so developers can run PySpark unittests against a Spark Connect server.

### Why are the changes needed?

In order to set up the CI that tests 3.5 client and 4.0 server in Spark Connect.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Tested it in my fork, see #46298

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46334 from HyukjinKwon/SPARK-48088.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants