Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark Connect Tests - CI & Test Suite Update #244

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
a4457ec
Create the first-version of files for Spark-Connect tests
nijanthanvijayakumar Jul 14, 2024
e1f0487
Address the fixtures issue in the test file
nijanthanvijayakumar Jul 15, 2024
c196228
Update the CI workflow to initiate the sparkconnect test on the 1.0
nijanthanvijayakumar Jul 15, 2024
1e4a439
Update the poetry & pyproject with the dependencies for Spark-Connect
nijanthanvijayakumar Jul 15, 2024
5cdc03b
Update the CI workflow to run Spark-Connect tests only for v3.4+
nijanthanvijayakumar Jul 15, 2024
37c271b
Update the script to check if Spark-Connect server is running or not
nijanthanvijayakumar Jul 15, 2024
b5f6749
Remove the spark-connect server run check
nijanthanvijayakumar Jul 15, 2024
6cdd1d2
Update workflows & pytest to choose the Sparksession instance based o…
nijanthanvijayakumar Jul 15, 2024
3ed7ee1
Add a TODO statement so that the spark-connect server check can be ad…
nijanthanvijayakumar Jul 15, 2024
8879d59
Remove the 1.0 planning branch for the CI file
nijanthanvijayakumar Jul 15, 2024
92ced5e
Attribute the original script that inspired this
nijanthanvijayakumar Jul 15, 2024
1d8a0de
Mark recently added deps as optional for Spark-Classic
nijanthanvijayakumar Jul 15, 2024
b44a025
Rename the spark-classic to connect & update makefile to install thes…
nijanthanvijayakumar Jul 15, 2024
5cf63f1
Resolve the incoming commit with the makefile and import changes from…
nijanthanvijayakumar Jul 15, 2024
2f2f1a8
Fix the linting issues in the linting CI workflow
nijanthanvijayakumar Jul 15, 2024
4b829b6
Update the files according to the review comments
nijanthanvijayakumar Jul 15, 2024
0da1804
Merge branch 'planning-1.0-release' into feature/issue-241-add-spark-…
nijanthanvijayakumar Jul 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ on:
pull_request:
branches:
- main
- planning-1.0-release
nijanthanvijayakumar marked this conversation as resolved.
Show resolved Hide resolved
workflow_dispatch:

jobs:
Expand Down Expand Up @@ -66,6 +67,18 @@ jobs:
- name: Run tests with pytest against PySpark ${{ matrix.pyspark-version }}
run: make test

- name: Run tests using Spark-Connect against PySpark ${{ matrix.pyspark-version }}
env:
SPARK_VERSION: ${{ matrix.pyspark-version }}
SPARK_CONNECT_MODE_ENABLED: 1
run: |
if [[ "${SPARK_VERSION}" > "3.4" ]]; then
sh scripts/run_spark_connect_server.sh
# The tests should be called from here.
else
echo "Skipping Spark-Connect tests for Spark version <= 3.4"
fi

check-license-headers:
runs-on: ubuntu-latest
steps:
Expand Down
8 changes: 4 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
all: help

.PHONY: install_test
install_test: ## Install test dependencies
@poetry install --with=development,testing
install_test: ## Install the 'dev, test and extras' dependencies
@poetry install --with=development,testing --extras connect

.PHONY: install_deps
install_deps: ## Install all dependencies
Expand All @@ -15,7 +15,7 @@ update_deps: ## Update dependencies
@poetry update --with=development,linting,testing,docs

.PHONY: test
test: ## Run the unit tests
test: ## Run all tests
@poetry run pytest tests

.PHONY: lint
Expand All @@ -31,4 +31,4 @@ format: ## Format the code
help: ## Show help for the commands
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'

.DEFAULT_GOAL := help
.DEFAULT_GOAL := help
307 changes: 283 additions & 24 deletions poetry.lock

Large diffs are not rendered by default.

13 changes: 10 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,15 @@ build-backend = "poetry.masonry.api"
[tool.poetry.dependencies]
python = ">=3.9,<4.0"

# Below are the optional dependencies
pyarrow = "13.0.0"
pandas = { version = "^1.5.3", optional = true }
numpy = { version = "^1.21.0", optional = true }
grpcio = { version = "^1.48.1", optional = true }
grpcio-status = { version = "^1.64.1", optional = true }

[tool.poetry.extras]
connect = ["pyarrow", "pandas", "numpy", "grpcio", "grpcio-status"]

###########################################################################
# DEPENDENCY GROUPS
Expand Down Expand Up @@ -102,11 +111,9 @@ ignore = [
]

[tool.ruff.lint.per-file-ignores]
"quinn/extensions/column_ext.py" = ["FBT003", "N802"]
"quinn/extensions/__init__.py" = ["F401", "F403"]
"quinn/__init__.py" = ["F401", "F403"]
"quinn/functions.py" = ["FBT003"]
"quinn/keyword_finder.py" = ["A002"]

[tool.ruff.isort]
[tool.ruff.lint.isort]
required-imports = ["from __future__ import annotations"]
2 changes: 1 addition & 1 deletion quinn/schema_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ def _lookup_type(type_str: str) -> T.DataType:

return type_lookup[type_str]

def _convert_nullable(null_str: Optional[str]) -> bool:
def _convert_nullable(null_str: str | None) -> bool:
if null_str is None:
return True

Expand Down
24 changes: 24 additions & 0 deletions scripts/run_spark_connect_server.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/usr/bin/bash

# This script was inspired by https://github.com/pyspark-ai/pyspark-ai/blob/master/run_spark_connect.sh

# The Spark version is set as an environment variable for this script.
echo "The SPARK_VERSION is $SPARK_VERSION"

# Download the spark binaries. If the download fails, throw an error message
if ! wget -q https://archive.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop3.tgz; then
echo "Error: Unable to download Spark binaries"
exit 1
fi

# Extract the downloaded spark binaries and check if the extraction is successful or not
if ! tar -xzf spark-$SPARK_VERSION-bin-hadoop3.tgz; then
echo "Error: Unable to extract Spark binaries"
exit 1
fi

# Start the Spark server
echo "Starting the Spark-Connect server"
./spark-$SPARK_VERSION-bin-hadoop3/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:$SPARK_VERSION

# TODO: Check if the server is running or not (maybe using netstat) and throw an error message if it is not running
6 changes: 5 additions & 1 deletion tests/spark.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
import os
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("chispa").getOrCreate()
if "SPARK_CONNECT_MODE_ENABLED" in os.environ:
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
else:
spark = SparkSession.builder.master("local").appName("chispa").getOrCreate()
20 changes: 20 additions & 0 deletions tests/test_spark_connect.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import chispa
nijanthanvijayakumar marked this conversation as resolved.
Show resolved Hide resolved
from pyspark.sql.types import IntegerType, StringType, StructField, StructType

import quinn
from .spark import spark


def test_create_df():
rows_data = [("abc", 1), ("lu", 2), ("torrence", 3)]
col_specs = [("name", StringType()), ("age", IntegerType())]

expected_schema = StructType(
[
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
],
)
actual = quinn.create_df(spark, rows_data, col_specs)
expected = spark.createDataFrame(rows_data, expected_schema)
chispa.assert_df_equality(actual, expected)
Loading