Skip to content

Commit

Permalink
Added instructions for using Bitnami Spark images
Browse files Browse the repository at this point in the history
  • Loading branch information
normanj-bitquill committed Dec 5, 2024
1 parent c1edd9b commit 8a3155b
Show file tree
Hide file tree
Showing 6 changed files with 216 additions and 7 deletions.
41 changes: 41 additions & 0 deletions docker/apache-spark-sample/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
services:
spark:
image: bitnami/spark:3.5.3
ports:
- "8080:8080"
- "7077:7077"
- "4040:4040"
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_PUBLIC_DNS=localhost
volumes:
- type: bind
source: ./spark-defaults.conf
target: /opt/bitnami/spark/conf/spark-defaults.conf
- type: bind
source: ../../ppl-spark-integration/target/scala-2.12/ppl-spark-integration-assembly-0.7.0-SNAPSHOT.jar
target: /opt/bitnami/spark/jars/ppl-spark-integration-assembly-0.7.0-SNAPSHOT.jar

spark-worker:
image: bitnami/spark:3.5.3
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_PUBLIC_DNS=localhost
volumes:
- type: bind
source: ./spark-defaults.conf
target: /opt/bitnami/spark/conf/spark-defaults.conf
- type: bind
source: ../../ppl-spark-integration/target/scala-2.12/ppl-spark-integration-assembly-0.7.0-SNAPSHOT.jar
target: /opt/bitnami/spark/jars/ppl-spark-integration-assembly-0.7.0-SNAPSHOT.jar
29 changes: 29 additions & 0 deletions docker/apache-spark-sample/spark-defaults.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.sql.extensions org.opensearch.flint.spark.FlintPPLSparkExtensions
spark.sql.catalog.dev org.apache.spark.opensearch.catalog.OpenSearchCatalog
File renamed without changes.
130 changes: 130 additions & 0 deletions docs/spark-docker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Running Queries with Apache Spark in Docker

There are [Bitnami Apache Spark docker images](https://hub.docker.com/r/bitnami/spark). These
can be modified to be able to include the OpenSearch Spark PPL extension. With the OpenSearch
Spark PPL extension, the docker image can be used to test PPL commands.

The Bitnami Apache Spark image can be used to run a Spark cluster and also to run
`spark-shell` for running queries.

## Setup

### spark-conf

Contains the Apache Spark configuration. Need to add three lines to the `spark-defaults.conf`
file:
```
spark.sql.legacy.createHiveTableByDefault false
spark.sql.extensions org.opensearch.flint.spark.FlintPPLSparkExtensions
spark.sql.catalog.dev org.apache.spark.opensearch.catalog.OpenSearchCatalog
```

An example file available in this repository at `docker/apache-spark-sample/spark-defaults.conf`

## Prepare OpenSearch Spark PPL Extension

Create a local build or copy of the OpenSearch Spark PPL extension. Make a note of the
location of the Jar file as well as the name of the Jar file.

## Run the Spark Cluster

Need to run a master node and a worker node. For these to communicate, first create a network
for them to use.

```
docker network create spark-network
```

### Master Node

The master node can be run with the following command:
```
docker run \
-d \
--name spark \
--network spark-network \
-p 8080:8080 \
-p 7077:7077 \
-p 4040:4040 \
-e SPARK_MODE=master \
-e SPARK_RPC_AUTHENTICATION_ENABLED=no \
-e SPARK_RPC_ENCRYPTION_ENABLED=no \
-e SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no \
-e SPARK_SSL_ENABLED=no \
-e SPARK_PUBLIC_DNS=localhost \
-v <PATH_TO_SPARK_CONFIG_FILE>:/opt/bitnami/spark/conf/spark-defaults.conf \
-v <PATH_TO_SPARK_PPL_JAR_FILE>/<SPARK_PPL_JAR_FILE>:/opt/bitnami/spark/jars/<SPARK_PPL_JAR_FILE> \
bitnami/spark:3.5.3
```

* `-d`
Run the container in the background and return to the shell
* `--name spark`
Name the docker container `spark`
* `<PATH_TO_SPARK_CONFIG_FILE>`
Replace with the path to the Spark configuration file.
* `<PATH_TO_SPARK_PPL_JAR_FILE>`
Replace with the path to the directory containing the OpenSearch Spark PPL extension
Jar file.
* `<SPARK_PPL_JAR_FILE>`
Replace with the filename of the OpenSearch Spark PPL extension Jar file.

### Worker Node

The worker node can be run with the following command:
```
docker run \
-d \
--name spark-worker \
--network spark-network \
-e SPARK_MODE=worker \
-e SPARK_MASTER_URL=spark://spark:7077 \
-e SPARK_WORKER_MEMORY=1G \
-e SPARK_WORKER_CORES=1 \
-e SPARK_RPC_AUTHENTICATION_ENABLED=no \
-e SPARK_RPC_ENCRYPTION_ENABLED=no \
-e SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no \
-e SPARK_SSL_ENABLED=no \
-e SPARK_PUBLIC_DNS=localhost \
-v <PATH_TO_SPARK_CONFIG_FILE>:/opt/bitnami/spark/conf/spark-defaults.conf \
-v <PATH_TO_SPARK_PPL_JAR_FILE>/<SPARK_PPL_JAR_FILE>:/opt/bitnami/spark/jars/<SPARK_PPL_JAR_FILE> \
bitnami/spark:3.5.3
```

* `-d`
Run the container in the background and return to the shell
* `--name spark-worker`
Name the docker container `spark-worker`
* `<PATH_TO_SPARK_CONFIG_FILE>`
Replace with the path to the Spark configuration file.
* `<PATH_TO_SPARK_PPL_JAR_FILE>`
Replace with the path to the directory containing the OpenSearch Spark PPL extension
Jar file.
* `<SPARK_PPL_JAR_FILE>`
Replace with the filename of the OpenSearch Spark PPL extension Jar file.

## Running Spark Shell

Can run `spark-shell` on the master node.

```
docker exec -it spark /opt/bitnami/spark/bin/spark-shell
```

Within the Spark Shell, you can submit queries, including PPL queries.

## Docker Compose Sample

There is a sample `docker-compose.yml` file in this repository at
`docker/apache-spark-sample/docker-compose.yml` It can be used to start up both nodes with
the command:

```
docker compose up -d
```

The cluster can be stopped with:

```
docker compose down
```
23 changes: 16 additions & 7 deletions docs/spark-emr-docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,17 +37,17 @@ spark.sql.catalog.dev org.apache.spark.opensearch.catalog.OpenSearchCatalog
An Apache Spark app is needed to provide queries to be run on the Spark EMR instance.
The image has been tested with an app written in Scala.

An example app is available in this repository in `docker/spark-emr-sample/example-app`.
An example app is available in this repository in `docker/spark-sample--app`.

### Bulid the Example App

The example app can be built using [SBT](https://www.scala-sbt.org/).
```
cd docker/spark-emr-sample
cd docker/spark-sample-app
sbt clean package
```

This will produce a Jar file in `docker/spark-emr-sample/example-app/target/scala-2.12`
This will produce a Jar file in `docker/spark-sample-app/target/scala-2.12`
that can be used with the Spark EMR image.

## Prepare OpenSearch Spark PPL Extension
Expand All @@ -57,12 +57,12 @@ location of the Jar file as well as the name of the Jar file.

## Run the Spark EMR Image

The Spark EMR image can be run with the following command:
The Spark EMR image can be run with the following command from the root of this repository:
```
docker run \
--name spark-emr \
-v ./docker/spark-emr-sample/logging-conf:/var/loggingConfiguration/spark \
-v ./docker/spark-emr-sample/example-app/target/scala-2.12:/app \
-v ./docker/spark-sample-app/target/scala-2.12:/app \
-v ./docker/spark-emr-sample/spark-conf:/etc/spark/conf \
-v <PATH_TO_SPARK_PPL_JAR_FILE>/<SPARK_PPL_JAR_FILE>:/usr/lib/spark/jars/<SPARK_PPL_JAR_FILE> \
public.ecr.aws/emr-serverless/spark/emr-7.5.0:20241125 \
Expand All @@ -77,7 +77,7 @@ docker run \

Bind the directory containing logging shell scripts to the docker image. Needs to bind
to `/var/loggingConfiguration/spark` in the image.
* `-v ./docker/spark-emr-sample/example-app/target/scala-2.12:/app`
* `-v ./docker/spark-sample-app/target/scala-2.12:/app`

Bind the directory containing the Apache Spark app Jar file to a location in the
docker image. The directory in the docker image must match the path used in the final
Expand All @@ -98,4 +98,13 @@ docker run \
The main class of the Spark App to run.
* `/app/myapp_2.12-1.0.jar`
The full path within the docker container where the Jar file of the Spark app is
located.
located.

## Logs

The logs are available in `/var/log/spark` in the docker container.

STDERR for the app run is available in `/var/log/spark/user/stderr`.

STDOUT for the app
run is available in `/var/log/spark/user/stdout`.

0 comments on commit 8a3155b

Please sign in to comment.