Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark Connect Tests - CI & Test Suite Update #244

Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
a4457ec
Create the first-version of files for Spark-Connect tests
nijanthanvijayakumar Jul 14, 2024
e1f0487
Address the fixtures issue in the test file
nijanthanvijayakumar Jul 15, 2024
c196228
Update the CI workflow to initiate the sparkconnect test on the 1.0
nijanthanvijayakumar Jul 15, 2024
1e4a439
Update the poetry & pyproject with the dependencies for Spark-Connect
nijanthanvijayakumar Jul 15, 2024
5cdc03b
Update the CI workflow to run Spark-Connect tests only for v3.4+
nijanthanvijayakumar Jul 15, 2024
37c271b
Update the script to check if Spark-Connect server is running or not
nijanthanvijayakumar Jul 15, 2024
b5f6749
Remove the spark-connect server run check
nijanthanvijayakumar Jul 15, 2024
6cdd1d2
Update workflows & pytest to choose the Sparksession instance based o…
nijanthanvijayakumar Jul 15, 2024
3ed7ee1
Add a TODO statement so that the spark-connect server check can be ad…
nijanthanvijayakumar Jul 15, 2024
8879d59
Remove the 1.0 planning branch for the CI file
nijanthanvijayakumar Jul 15, 2024
92ced5e
Attribute the original script that inspired this
nijanthanvijayakumar Jul 15, 2024
1d8a0de
Mark recently added deps as optional for Spark-Classic
nijanthanvijayakumar Jul 15, 2024
b44a025
Rename the spark-classic to connect & update makefile to install thes…
nijanthanvijayakumar Jul 15, 2024
5cf63f1
Resolve the incoming commit with the makefile and import changes from…
nijanthanvijayakumar Jul 15, 2024
2f2f1a8
Fix the linting issues in the linting CI workflow
nijanthanvijayakumar Jul 15, 2024
4b829b6
Update the files according to the review comments
nijanthanvijayakumar Jul 15, 2024
0da1804
Merge branch 'planning-1.0-release' into feature/issue-241-add-spark-…
nijanthanvijayakumar Jul 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,19 @@ jobs:
- name: Run tests with pytest against PySpark ${{ matrix.pyspark-version }}
run: make test

- name: Run tests using Spark-Connect against PySpark ${{ matrix.pyspark-version }}
env:
HADOOP_VERSION: 3
SPARK_VERSION: ${{ matrix.pyspark-version }}
SPARK_CONNECT_MODE_ENABLE: 1
run: |
if [[ "${SPARK_VERSION}" > "3.4" ]]; then
sh scripts/run_spark_connect_server.sh
make test_spark_connect
else
echo "Skipping Spark-Connect tests for Spark version <= 3.4"
fi

nijanthanvijayakumar marked this conversation as resolved.
Show resolved Hide resolved
check-license-headers:
runs-on: ubuntu-latest
steps:
Expand Down
8 changes: 6 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,11 @@ update_deps:

.PHONY: test
test:
@poetry run pytest tests
@poetry run pytest tests -k "not test_spark_connect.py"

.PHONY: test
test_spark_connect:
@poetry run pytest tests/test_spark_connect.py

.PHONY: lint
lint:
Expand All @@ -28,7 +32,7 @@ format:

.PHONY: help
help:
nijanthanvijayakumar marked this conversation as resolved.
Show resolved Hide resolved
@echo '................... Quin ..........................'
@echo '................... Quinn ..........................'
@echo 'help - print that message'
@echo 'lint - run linter'
@echo 'format - reformat the code'
Expand Down
311 changes: 287 additions & 24 deletions poetry.lock

Large diffs are not rendered by default.

8 changes: 5 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,11 @@ build-backend = "poetry.masonry.api"

[tool.poetry.dependencies]
python = ">=3.9,<4.0"

pandas = "^1.5.3"
pyarrow = "16.1.0"
numpy = "^1.21.0"
grpcio = "^1.48.1"
grpcio-status = "^1.64.1"
nijanthanvijayakumar marked this conversation as resolved.
Show resolved Hide resolved

###########################################################################
# DEPENDENCY GROUPS
Expand Down Expand Up @@ -104,8 +108,6 @@ ignore = [
]

[tool.ruff.lint.per-file-ignores]
"quinn/extensions/column_ext.py" = ["FBT003", "N802"]
"quinn/extensions/__init__.py" = ["F401", "F403"]
"quinn/__init__.py" = ["F401", "F403"]
"quinn/functions.py" = ["FBT003"]
"quinn/keyword_finder.py" = ["A002"]
25 changes: 25 additions & 0 deletions scripts/run_spark_connect_server.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/usr/bin/bash

# This script was inspired by https://github.com/pyspark-ai/pyspark-ai/blob/master/run_spark_connect.sh

# The Hadoop and Spark versions are set as environment variables for this script.
echo "The HADOOP_VERSION is $HADOOP_VERSION"
SemyonSinchenko marked this conversation as resolved.
Show resolved Hide resolved
echo "The SPARK_VERSION is $SPARK_VERSION"

# Download the spark binaries. If the download fails, throw an error message
if ! wget -q https://archive.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz; then
echo "Error: Unable to download Spark binaries"
exit 1
fi

# Extract the downloaded spark binaries and check if the extraction is successful or not
if ! tar -xzf spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz; then
echo "Error: Unable to extract Spark binaries"
exit 1
fi

# Start the Spark server
echo "Starting the Spark-Connect server"
./spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:$SPARK_VERSION

# TODO: Check if the server is running or not (maybe using netstat) and throw an error message if it is not running
6 changes: 5 additions & 1 deletion tests/spark.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
import os
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("chispa").getOrCreate()
if "SPARK_CONNECT_MODE_ENABLE" in os.environ:
SemyonSinchenko marked this conversation as resolved.
Show resolved Hide resolved
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
else:
spark = SparkSession.builder.master("local").appName("chispa").getOrCreate()
20 changes: 20 additions & 0 deletions tests/test_spark_connect.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import chispa
nijanthanvijayakumar marked this conversation as resolved.
Show resolved Hide resolved
from pyspark.sql.types import IntegerType, StringType, StructField, StructType

import quinn
from .spark import spark


def test_create_df():
rows_data = [("abc", 1), ("lu", 2), ("torrence", 3)]
col_specs = [("name", StringType()), ("age", IntegerType())]

expected_schema = StructType(
[
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
],
)
actual = quinn.create_df(spark, rows_data, col_specs)
expected = spark.createDataFrame(rows_data, expected_schema)
chispa.assert_df_equality(actual, expected)