Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark Connect Tests - CI & Test Suite Update #244

Conversation

nijanthanvijayakumar
Copy link
Contributor

@nijanthanvijayakumar nijanthanvijayakumar commented Jul 15, 2024

Proposed changes

This PR is regarding the open issue #241 to add a Spark-Connect test to the CI workflows and introduces the following changes:

Updates to existing files

  • Updates to the ci.yml file:

    • Creates a new step under the test job to test a file using Spark-Connect server's SparkSession instance.
    • Passes the Spark and Hadoop versions as environment variables.
    • Uses a conditional SPARK_CONNECT_MODE_ENABLE to switch between the Spark-Connect session and the Local Spark session.
  • Updates the spark.py to create remote Spark-Connect instance:

    • Uses a conditional to check if the environment configuration set at the CI is true or not to decide which Spark instance to be created/returned.
  • Updates the pyproject.toml:

    • Adds the libraries: pandas = "^1.5.3", pyarrow = "16.1.0" numpy = "^1.21.0" grpcio = "^1.48.1" grpcio-status = "^1.64.1"
  • Updates the Makefile:

    • Adds a line to skip the test_spark_connect.py from being PyTested using the regular tests.
    • Adds a section to test_spark_connect.py separately.

New files:

  • Creates a new test_spark_connect.py file:

    • This file contains a very basic test of one of the existing Quinn functions.
  • Creates a new shell script for running Spark-Connect server:

    • Downloads the Spark binaries, installs and starts the Spark-Connect server.

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Bugfix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Further comments

  • This PR closes the issue: feat: Add spark-connect tests suite and run existing tests with SC #241
  • Please kindly review the versions of the dependencies used (pandas, pyarrow) and let me know if those versions are correct?
  • I have left a TODO comment in the run_spark_connect_server.sh to check if the spark-connect server is running or not. I tried netstat, which didn't work. I will work on that using a different PR.
  • Also, please let me know if the pr.yml file should also be updated for this Spark-Connect test case?

@SemyonSinchenko
Copy link
Collaborator

image

@nijanthanvijayakumar
Copy link
Contributor Author

image

Amazing thank you! @SemyonSinchenko .

Looks like they haven't locked down pandas and numpy versions though. Are we okay with that?

pyproject.toml Outdated Show resolved Hide resolved
@SemyonSinchenko
Copy link
Collaborator

Are we okay with that?

This dependencies should be optional for us (or be in the "provided" scope). Quinn itself should depends of python itself only.

Answering the question: we are mostly OK with that except numpy. The release of 2.0 was after the release of spark and Im more than sure that spark won't work with 2.0; so, I would like to add numpy < 2

@nijanthanvijayakumar
Copy link
Contributor Author

Understood; thanks! This is an interesting PR/issue. I have been testing this spark-connect on the other test cases as well and have noted which ones might need to be worked on (for a later PR). Thanks for your guidance @SemyonSinchenko .

As per the review comment, the recently added dependencies such as
Pyarrow, Pandas etc., are optional and not required for Spark-Classic.

Update the pyproject.toml to reflect that and lock the poetry file
pyproject.toml Outdated Show resolved Hide resolved
@nijanthanvijayakumar
Copy link
Contributor Author

Most likely fixes issue #247 as well. Please review and let me know.

Copy link
Collaborator

@SemyonSinchenko SemyonSinchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments

tests/spark.py Outdated Show resolved Hide resolved
scripts/run_spark_connect_server.sh Outdated Show resolved Hide resolved
tests/test_spark_connect.py Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
.github/workflows/ci.yml Outdated Show resolved Hide resolved
Copy link
Contributor Author

@nijanthanvijayakumar nijanthanvijayakumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted on the changes. Have responded to a few comments for clarifications.

* Remove the test_spark_connect.py file as it is not relevant anymore.
* Update the Makefile to remove spark-connect test
* Hardcode the hadoop version to 3 as 2 is EOL.
@nijanthanvijayakumar
Copy link
Contributor Author

nijanthanvijayakumar commented Jul 15, 2024

@SemyonSinchenko - final go I reckon 🤞🏽 I have made all the changes. Happy for those comments to be resolved and the PR to be merged IF you are happy with it and if the tests pass.

Copy link
Collaborator

@SemyonSinchenko SemyonSinchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LFTM

@SemyonSinchenko
Copy link
Collaborator

@nijanthanvijayakumar Thank you for the work!

@SemyonSinchenko SemyonSinchenko merged commit cfcb720 into mrpowers-io:planning-1.0-release Jul 15, 2024
4 checks passed
@nijanthanvijayakumar
Copy link
Contributor Author

@nijanthanvijayakumar Thank you for the work!

My pleasure. Thanks for the guidance and support.

@nijanthanvijayakumar nijanthanvijayakumar deleted the feature/issue-241-add-spark-connect-tests branch July 15, 2024 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants