-
Notifications
You must be signed in to change notification settings - Fork 73
Event specific cluster setup and job information
These instructions are meant to be used on the day of the HackReduce event. The servers will not be accessible except at the venue.
{CLUSTER NUMBER}: Will be assigned to your team at the event
git clone https://github.com/hackreduce/Hackathon.git
-
cd ~/.ssh
-
Obtain the key:
- OSX:
curl -O http://manager.hackreduce.org/hackreduce.tar
- Linux:
wget http://manager.hackreduce.org/hackreduce.tar
-
tar xvf hackreduce.tar
-
chmod 600 hackreduce-cambridge.pem
The team folders will be used for storing your code and data on the cluster's master node.
-
ssh -i ~/.ssh/hackreduce-cambridge.pem hackreduce@cluster-{CLUSTER_NUMBER}-master.gg.hackreduce.net
-
Create the code folder:
mkdir -p ~/users/{team name}
. This is where you will be storing all your team's files.
Starting on your local system:
-
cd {HackReduce project}
-
Compile your code with the following commands depending on whether you're using Gradle or Ant:
- Gradle:
gradle
- Ant:
ant
-
Copy your jar to the cluster's master node:
scp -i
/.ssh/hackreduce-cambridge.pem build/libs/{HackReduce custom}.jar hackreduce@cluster-{CLUSTER_NUMBER}-master.gg.hackreduce.net:/users/{team name} -
Log onto the cluster:
ssh -i ~/.ssh/hackreduce-cambridge.pem hackreduce@cluster-{CLUSTER_NUMBER}-master.gg.hackreduce.net
-
Launch your job:
hadoop jar ~/users/{team name}/{HackReduce custom}.jar {Java job class} /datasets/{dataset chosen} /users/{team name}/job/
e.g. hadoop jar ~/users/hopper/myjar.jar org.hackreduce.examples.bixi.RecordCounter /datasets/bixi-montreal-2011/bixi.xml /users/hopper/bixi_recordcounts
-
Track the progress of your job on the Hadoop MapReduce job tracker:
http://cluster-{CLUSTER_NUMBER}-master.gg.hackreduce.net:50030
-
When the job is finished, you can download the output from HDFS into the local file system:
hadoop dfs -copyToLocal /users/{team name}/job ~/users/{team name}/
Simply pressing CTRL+C in the command line won't kill the running job in MapReduce. Follow these steps to kill the job:
-
Find the jobid of your MR job (from the job tracker), e.g. job_201108131339_0001
-
Log onto the cluster:
ssh -i ~/.ssh/hackreduce-cambridge.pem hackreduce@cluster-{CLUSTER_NUMBER}-master.gg.hackreduce.net
-
Kill the job with the hadoop command line utility:
hadoop job -kill job_201108131339_0001
-
Visit http://cluster-{CLUSTER_NUMBER}-master.gg.hackreduce.net:50070
-
Go to "Browse the filesystem"
-
Log onto your namenode (cluster-{CLUSTER_NUMBER}-master.gg.hackreduce.net)
-
Run the command
hadoop dfs
and see the commands
-
The number of reducers used by a job needs to be defined manually by one of the following methods:
- Java: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Job.html#setNumReduceTasks(int)
- Streaming: http://hadoop.apache.org/common/docs/current/streaming.html#Specifying+the+Number+of+Reducers (http://hadoop.apache.org/docs/stable/streaming.html#Specifying+the+Number+of+Reducers)
- More information can be found on http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Reducer (emphasis on the "How Many Reduces?" section)