Skip to content

Latest commit

 

History

History
163 lines (102 loc) · 4.25 KB

README.md

File metadata and controls

163 lines (102 loc) · 4.25 KB

egoEconomotrics

This project uses Apache Spark run on a Single Node Hadoop/Yarn.

Warning:

This install will play with your ~/.ssh folder, more specifically the .ssh/authorized_keys file
It will allow hadoop to run the ssh localhost command using a DSA PassPhraseLess key

  • Defintion:

    • Hadoop HDFS (Hadoop Distributed File System)
    • Yarn, MapReduce 2.0
    • Spark general engine for large-scale data processing
  • Languages:

    • Scala and Python

Prerequisites:

  • wget
  • java (JVM)
  • *nix - Darwin, Cygwin (not yet)
  • Python (if running Python)
  • sbt to run the Scala examples

Running the install:

./bootstrap/install.sh

Hadoop NameNode Daemons

  • Set the Hadoop Home

HDFS_HOME=~/bin/local/bigdata/hadoop

  • Starting the services

$HDFS_HOME/sbin/start-dfs.sh

Note: On MacOS, make sure SSH is started. System Preferences/Sharing/Remote Login [ON]

  • Checking Services are running

jps

13049 NameNode (HDFS Name Node) -- Make sure this is running
13241 DataNode (HDFS Data Node)
22752 ResourceManager (Yarn Resource)
22894 NodeManager (Yarn Node)


Monitoring DFS Health

Browsing the File System's health

http://localhost:50070

Yarn Daemons

  • Start ResourceManager daemon and NodeManager daemon:

$HDFS_HOME/sbin/start-yarn.sh

Monitoring Resource Manager

If you want to look at the running jobs or already executed (Jobwatch Equivalent)

http://localhost:8088

Hadoop Distributed File System (Hadoop DFS) handling

  • Create and Mount a new Hadoop DFS

${HDFS_HOME}/bin/hdfs namenode -format

Note: You need to restart HDFS

  • Create a directory in Hadoop DFS

Create the user directory along with the owner directory

${HDFS_HOME}/bin/hdfs dfs -mkdir -p /user/${USER}

Running the example

Package a jar containing your application

sbt package

... [success] Total time: ...

Set SPARK HOME

SPARK_HOME=~/bin/local/bigdata/spark

Use spark-submit to run your application

${SPARK_HOME}/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/egoeconometrics_2.10-0.1-SNAPSHOT.jar

... Lines with a: 41, Lines with b: 17

Running Interactive Shell

  • Scala

${SPARK_HOME}/bin/spark-shell

  • Python

${SPARK_HOME}/bin/bin/pyspark --master local[4]

Running Spark Streaming

  • You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by using

nc -lk 9999

  • Then, in a different terminal, you can start the example by using

${SPARK_HOME}/bin/spark-submit --class "QuickStreamingApp" --master local[4] target/scala-2.10/egoeconometrics_2.10-0.1-SNAPSHOT.jar localhost 9999

Stopping the Services

When you're done, stop the daemons with:

$HDFS_HOME/sbin/stop-yarn.sh

$HDFS_HOME/sbin/stop-dfs.sh

AWS Ubuntu

sudo /bin/dd if=/dev/zero of=/var/swap.1 bs=1M count=1024
sudo /sbin/mkswap /var/swap.1
sudo /sbin/swapon /var/swap.1
To turn off the swap do the following:
sudo /sbin/swapoff /var/swap.1

Recommended reading for AWS http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Kwown issues:

License

Copyleft © 2014 EgoOyiri [AfricaCoin]

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.