Hadoop/Spark with Terraform on AWS

This project create an Hadoop and Spark cluster on Amazon AWS with Terraform

Variables
Software version
Project Structure
How to
See also

Variables

Name	Description	Default
region	AWS region	us-east-1
access_key	AWS access key
secret_key	AWS secret key
token	AWS token	null
instance_type	AWS instance type	m5.xlarge
ami_image	AWS AMI image	ami-0885b1f6bd170450c
key_name	Name of the key pair used between nodes	localkey
key_path	Path of the key pair used between nodes	.
aws_key_name	AWS key pair used to connect to nodes	amzkey
amz_key_path	AWS key pair path used to connect to nodes	amzkey.pem
namenode_count	Namenode count	1
datanode_count	Datanode count	3
ips	Default private ips used for nodes	See variables.tf
hostnames	Default private hostnames used for nodes	See variables.tf

Software version

Default AMI image: ami-0885b1f6bd170450c (Ubuntu 20.04, amd64, hvm-ssd)
Spark: 3.0.1
Hadoop: 2.7.7
Python: last available (currently 3.8)
Java: openjdk 8u275 jdk

Project Structure

app/: folder where you can put your application, it will copied to the namenode
install-all.sh: script which is executed in every node, it install hadoop/spark and do all the configuration for you
main.tf: definition of the resources
output.tf: terraform output declaration
variables.tf: terraform variable declaration

How to

Download and install Terraform
Download the project and unzip it
Open the terraform project folder "spark-terraform-master/"
Create a file named "terraform.tfvars" and paste this:

access_key="<YOUR AWS ACCESS KEY>"
secret_key="<YOUR AWS SECRET KEY>"
token="<YOUR AWS TOKEN>"

Note: without setting the other variables (you can find it on variables.tf), terraform will create a cluster on region "us-east-1", with 1 namenode, 3 datanode and with an instance type of m5.xlarge.

Put your application files into the "app" terraform project folder
Open a terminal and generate a new ssh-key

ssh-keygen -f <PATH_TO_SPARK_TERRAFORM>/spark-terraform-master/localkey

Where <PATH_TO_SPARK_TERRAFORM> is the path to the /spark-terraform-master/ folder (e.g. /home/user/)

Login to AWS and create a key pairs named amzkey in PEM file format. Follow the guide on AWS DOCS. Download the key and put it in the spark-terraform-master/ folder.
Open a terminal and go to the spark-terraform-master/ folder, execute the command

terraform init
terraform apply

After a while (wait!) it should print some public DNS in a green color, these are the public dns of your instances.

Connect via ssh to all your instances via

ssh -i <PATH_TO_SPARK_TERRAFORM>/spark-terraform-master/amzkey.pem ubuntu@<PUBLIC DNS>

Execute on the master (one by one):

$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slaves.sh spark://s01:7077

You are ready to execute your app! Execute this command on the master

/opt/spark-3.0.1-bin-hadoop2.7/bin/spark-submit --master spark://s01:7077  --executor-cores 2 --executor-memory 14g yourfile.py

Remember to do terraform destroy to delete your EC2 instances

Note: The steps from 0 to 5 (included) are needed only on the first execution ever

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Hadoop/Spark with Terraform on AWS

Variables

Software version

Project Structure

How to

See also

Files

README.md

Latest commit

History

README.md

File metadata and controls

Hadoop/Spark with Terraform on AWS

Variables

Software version

Project Structure

How to

See also