This tutorial can either be run in spark-shell or in an IDE (IntelliJ or Scala IDE for Eclipse)
Below are the steps for the setup.
Java/JDK 1.7+ has to be installed on the laptop before proceeding with the steps below.
Download Spark 2.1.0 from here : http://spark.apache.org/downloads.html
Direct Download link : http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
tar -zxvf spark-2.1.0-bin-hadoop2.7.tgz
export PATH=$PATH:/Users/path_to_downloaded_spark/spark-2.1.0-bin-hadoop2.7/bin
- spark-shell
Unzip spark-2.1.0-bin-hadoop2.7.tgz
Add the spark bin directory to Path : ...\spark-2.1.0-bin-hadoop2.7\bin
- download winutils.exe from https://github.com/steveloughran/winutils/tree/master/hadoop-2.6.0/bin
- move it to c:\hadoop\bin
- set HADOOP_HOME in your environment variables
- HADOOP_HOME = C:\hadoop
- run from command prompt:
- C:\hadoop\bin\winutils.exe chmod 777 /tmp/hive
- run spark-shell from command prompt with extra conf parameter
- spark-shell --driver-memory 2G --executor-memory 3G --executor-cores 2 -conf spark.sql.warehouse.dir=file:///c:/tmp/spark-warehouse
When pasting larger sections of the code in spark-shell, use the below:
scala> :paste
If you prefer to use IDE over spark-shell, below are the steps.
You can either use IntelliJ or Scala IDE for Eclipse.
- Install IntelliJ from https://www.jetbrains.com/idea/download/
- Add the scala language plugin
- Import the code as a maven project and let it build
- If using Eclipse, do use Scala IDE for Eclipse available at : http://scala-ide.org/download/sdk.html
- Import the code as a maven project and let it build
Have the following downloaded before the session
- JDK installed (> 1.7.x)
- Spark binaries
- https://github.com/WhiteFangBuck/strata-sanjose-2017
Nice to have