Before Start

This installation is for a native AWS machine running ubuntu.
Static public IP and DNS are used. An ELASTIC IP is required. (Fee may applied)
It is possible to use this guide on other machines, also if they have dinamic IPs, but some precautions must be taken during the installation.

Thanks to sedaatalay for the original guide

Download and Install

Clone the repo:

$ git clone https://github.com/TheMastro-11/spark-hadoop_configuration.git

select the line of code according to your needs:

$ sudo bash spark-hadoop_configuration/Master/installMaster.sh

or:

$ sudo bash spark-hadoop_configuration/Slave/installSlave.sh

update ambient variables:

$ source /home/ubuntu/.bashrc

Give permission:

$ sudo chown -R ubuntu $HADOOP_HOME

delete installation file:

$ rm installMaster.sh

or

$ rm installSlave.sh

SSH (Master)

First from your local pc transfer the [name].pem key to remote istance

$ scp path/to/[name].pem ubuntu@[dns]:/home/ubuntu/.ssh

Then from istance:

$ mv /home/ubuntu/.ssh/[name].pem /home/ubuntu/.ssh/my-key.pem

FOR EACH NEW DATANODE ADDED:

You have to configure the ssh connection:

$ nano /home/ubuntu/.ssh/config

Must be like this:

Host namenode
    HostName <publicdns>
    User ubuntu
    IdentityFile /home/ubuntu/.ssh/my-key.pem

Host datanode1
    HostName <publicdns>
    User ubuntu
    IdentityFile /home/ubuntu/.ssh/my-key.pem

Host datanode2
    HostName <publicdns>
    User ubuntu
    IdentityFile /home/ubuntu/.ssh/my-key.pem

update this file:

$ sudo nano $HADOOP_CONF_DIR/slaves

Transfer the rsa key to the slave.

$ ssh datanode2 'cat >> /home/ubuntu/.ssh/authorized_keys '</home/ubuntu/.ssh/id_rsa.pub

Run these commands to reset the configurations before starting the cluster with new data nodes.

$ bash resetHadoop.sh

On slaves:

$ sudo rm -rf $HADOOP_HOME/data/hdfs/*

START

Hadoop

USE THIS COMMANDS FROM MASTER To start Hadoop the first time is necessary to format the namenode:

$ hdfs namenode -format

Then simply run this script, even on subsequent reboots:

$ bash startHadoop.sh

to check if all is start correctly just do this:

$ hdfs dfsadmin -report

and you will see how many datanodes are online

To stop:

$ bash stopHadoop.sh

Spark

To start spark launch this from Master:

$ ./spark/sbin/start-master.sh

If it has started up correctly, you can view the web UI at IPMASTERPUBLIC:8080

Then use this command to start workers (replace ADDRESS with the spark://privateaddress:7070 you will find in the web page above):

$ ./spark/sbin/start-slave.sh ADDRESS

To start a process use this (change ADDRESS with the same above and PATHFILE with the file you wanna run):

$ ./spark/bin/spark-submit --master ADDRESS PATHFILE

WARNING
Python file are not supported in cluster mode, so only master local cores will run.
For more information about spark-submit visit the Official Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Before Start

Download and Install

SSH (Master)

FOR EACH NEW DATANODE ADDED:

START

Hadoop

Spark

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Master		Master
Slave		Slave
.gitignore		.gitignore
OtherMachines.md		OtherMachines.md
README.md		README.md
resetHadoop.sh		resetHadoop.sh
startHadoop.sh		startHadoop.sh
stopHadoop.sh		stopHadoop.sh

TheMastro-11/spark-hadoop_configuration

Folders and files

Latest commit

History

Repository files navigation

Before Start

Download and Install

SSH (Master)

FOR EACH NEW DATANODE ADDED:

START

Hadoop

Spark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages