To get the image
docker pull foodytechnologies/spark-openjdk8-alpine
To run simple container
docker run -p 4040:4040 -dti --privileged foodytechnologies/spark-openjdk11-alpine
It's a Spark Standalone Cluster With Zookeeper composed of two zookeeper server, two spark masters, two slaves with each of them 5 workers and 1 application submitter
docker-compose up -d --scale LocalClusterNetwork.spark.Slave=2
- To launch a local python application
docker exec -ti ApplicationSubmitter sh StartApplication.sh /apps/python-apps/example.py
- To Launch a local java application
Compile your jobs source:
$ cd ./data/dockervolumes/applications/java-apps/
$ mvn package
Docker compose will mount local
./data/dockervolumes/applications
directory on/apps
directory of Application and slaves containers.
We can also pass files/data as arguments to jobs by placing them on local directory./data/dockervolumes/data
(we should give directory write authorization if jobs will save some files on it) Docker compose will bind this local directory on/data
directory of started containers
Manipulate a json file and generate a new one:
docker exec -ti ApplicationSubmitter sh StartApplication.sh --class com.databootcamp.sparkjobs.BasicLoadJson /apps/java-apps/target/sparkjobs-1.0.0-SNAPSHOT.jar /data/tweets.json /data/HaveTweets
Basic Flat Map by reading a file:docker exec -ti ApplicationSubmitter sh StartApplication.sh --class com.databootcamp.sparkjobs.BasicFlatMap /apps/java-apps/target/sparkjobs-1.0.0-SNAPSHOT.jar /data/spark.txt
Basic Avg:docker exec -ti ApplicationSubmitter sh StartApplication.sh --class com.databootcamp.sparkjobs.BasicAvg /apps/java-apps/target/sparkjobs-1.0.0-SNAPSHOT.jar
Expose Context with thrift server:
- start standalone thrift server and expose a context in temporary view:
docker exec -ti ApplicationSubmitter sh StartApplication.sh --class com.databootcamp.sparkjobs.ExposeContextWithLiveThrift /apps/java-apps/target/sparkjobs-1.0.0-SNAPSHOT.jar `docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}},{{end}}' ApplicationSubmitter | cut -d',' -f1` 10011 /data/tweets.json exposethecontext
- start thrift client and read context:
beeline -u jdbc:hive2://IP_TO_SPARK_EXECUTOR:10011
0: jdbc:hive2://172.28.0.5:10011> show tables;
Export json file to Hive table: to keep data and meta-data we should configure hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/
docker exec -ti ApplicationSubmitter sh StartApplication.sh --class com.databootcamp.sparkjobs.SaveHive /apps/java-apps/target/sparkjobs-1.0.0-SNAPSHOT.jar /data/tweets.json tweets
Read Streaming from Flume Agent (polling mode) and save in Hbase table (in batch mode):
please refer to the example described in the "Flume" repository: MedAmineBB/Flume
please refert to the how to describe in the "Hbase" repository: MedAmineBB/HBaseWithHDFS by applying section "Launch Hdfs and Hbase"
$ # Create a network where we will expose all dockers that shall communicate in this example
$ docker network create -d bridge --subnet 172.28.0.0/16 bridge_nw
$ # Expose Hbase layer (zookeeper, HMaster and Region Servers)
$ docker network connect bridge_nw zoo1
$ docker network connect bridge_nw zoo2
$ docker network connect bridge_nw zoo3
$ docker network connect bridge_nw rs1
$ docker network connect bridge_nw rs2
$ docker network connect bridge_nw rs3
$ docker network connect bridge_nw hm1
$ docker network connect bridge_nw hm2
$ # Expose Flume layer
$ docker network connect bridge_nw relayer
$ # Expose Spark AllplicationSubmitter, Slaves, masters
$ docker network connect ApplicationSubmitter
$ docker network connect ownspark_LocalClusterNetwork.spark.Slave_COMPLETE_THAT
$ docker network connect Master0
$ docker network connect Master1
EXTRA_JARS=`docker exec -ti ApplicationSubmitter sh -c 'ls -p /apps/java-apps/target/libs/*.jar | tr "\n" ","'` sh -c 'docker exec -ti ApplicationSubmitter sh StartApplication.sh --jars $EXTRA_JARS --class com.databootcamp.sparkjobs.StreamingFromFlumeToHBase /apps/java-apps/target/sparkjobs-1.0.0-SNAPSHOT.jar relayer 4545 zoo1,zoo2,zoo3 2181 databootcamp netcat data'