Command line tool for monitoring Amazon Elastic MapReduce (Amazon EMR) jobflows and analyze past jobflows.
- Amazon EMR via Amazon Elastic MapReduce Ruby Client to get description of a jobflow:
```bash
$ elastic-mapreduce --describe … ```
- Amazon EC2 via Amazon EC2 API Tools to retrieve history of spots instances price:
```bash
$ ec2-describe-spot-price-history … ```
- Amazon S3 via S3cmd to get size of both input and output files, to retrieve potential errors and to get log summary:
```bash
$ s3cmd ls <input|output> $ s3cmd get s3://…/steps/…/stderr $ s3cmd get s3://…/jobs/job_… ```
-
Amazon Elastic MapReduce Pricing of On-Demand instances via this URL and its underlying JSON service.
-
Hadoop JobTracker running on the master node and accessed by an automatic SSH tunnel:
```bash
$ ssh -N -L 12345:localhost:9100 hadoop@ … $ wget http://localhost:12345/jobtracker.jsp ```
- Additionally, EMR Monitoring computes elapsed times between various events and realizes an estimation of the jobflow's total cost.
An animation is better than a thousand words:
Result with a completed jobflow (click for full resolution image):
- The ask price for spot instances comes in real time from EC2 API Tools.
- The total price in general section is the sum of the prices of each instance group,
i.e. for each group:
<instance-price> × <number-of-instances> × ceil(<number-of-hours>)
.
- Elapsed times in gray measure the time elapsed between initialization and start date of instance/step, and between start date and end date of instance/step.
- When start date or end date is unknown, then elapsed times are computed according to the local time
and a
≈
sign is added.
Completion percentages are computed from Hadoop JobTracker data and are NOT the number of remaining tasks divided by the number of completed tasks.
Error messages, if any, are always displayed:
A task timeline is generated via gnuplot including all jobs of in progress or past jobflow and giving details on number of mapper, shuffle, merge and reducer tasks.
Animation from generated task timelines throughout jobflow run:
Create a folder, e.g. /usr/local/lib/emr-monitoring
, and cd
into it.
Then clone the repository (the folder must be empty!):
$ git clone git://github.com/Hi-Media/EmrMonitoring.git .
Initialize configuration file from conf/config-dist.php
and adapt it:
$ cp '/usr/local/lib/emr-monitoring/conf/config-dist.php' '/usr/local/lib/emr-monitoring/conf/config.php'
All dependencies are checked at launch and EMR Monitoring systematically helps to resolve them.
PHP class autoloading and PHP dependencies are managed by composer.
To set up the project dependencies with composer, run one of the following commands:
$ composer install
# or
$ php composer.phar install
If needed, to install composer locally, run one of the following commands:
$ curl -sS https://getcomposer.org/installer | php
# or
$ wget --no-check-certificate -q -O- https://getcomposer.org/installer | php
Read http://getcomposer.org/doc/00-intro.md#installation-nix for more information.
Amazon Elastic MapReduce Ruby Client is needed to get description of a jobflow.
To install Amazon EMR Command Line Interface:
$ sudo apt-get install ruby-full
$ mkdir /usr/local/lib/elastic-mapreduce-cli
$ wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
$ unzip -d /usr/local/lib/elastic-mapreduce-cli elastic-mapreduce-ruby.zip
Create a file named /usr/local/lib/elastic-mapreduce-cli/credentials.json
with at least the following lines:
{
"keypair": "Your key pair name",
"key-pair-file": "The path and name of your PEM/private key file"
}
If necessary, adapt emr_cli_bin
, aws_access_key
and aws_secret_key
keys
of $aConfig['Himedia\EMR']
in conf/config.php
.
Read http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-install.html for more information.
Amazon EC2 API Tools allows to retrieve history of spots instances price.
To install Amazon EC2 API Tools:
$ wget http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip
$ unzip -d /usr/local/lib ec2-api-tools.zip
$ If necessary, adapt ec2_api_tools_dir, aws_access_key and aws_secret_key keys of $aConfig['Himedia\EMR'] in conf/config.php.
$ Set and export both JAVA_HOME and EC2_HOME environment variables.
For example, include these commands in your ~/.bashrc
and reload it:
export JAVA_HOME=/usr
export EC2_HOME=/usr/local/lib/ec2-api-tools-1.6.7.2
Read http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/setting_up_ec2_command_linux.html for more information.
S3cmd is required to get size of both input and output files, to retrieve potential errors and to get log summary.
Please run:
$ sudo apt-get install s3cmd
$ s3cmd --configure
Read http://s3tools.org/s3cmd for more information.
Task timelines are generated via gnuplot for in progress or past jobflow and give details on number of mapper, shuffle, merge and reducer tasks.
$ sudo apt-get install gnuplot
You can view the options by running:
$ src/emr-monitoring.php [-h|--help]
Usage
emr_monitoring.php [OPTION]…
Options
-h, --help
Display this help.
-l, --list-all-jobflows
List all jobflows in the last 2 weeks.
-j, --jobflow-id <jobflowid>
Display statistics on any <jobflowid>, finished or in progress.
⇒ to monitor a jobflow in real-time: watch -n10 --color emr_monitoring.php -j <jobflowid>
--list-input-files
With -j, list all S3 input files really loaded by Hadoop instance of the completed <jobflowid>.
-p, --ssh-tunnel-port <port>
With -j, specify the <port> used to establish a connection to the master node and retrieve data
from the Hadoop jobtracker.
-d, --debug
Enable debug mode and list all shell commands.
Simply:
$ src/emr-monitoring.php -j <jobflowid>
-
Launching a jobflow using Amazon Elastic MapReduce:
$ /usr/local/lib/elastic-mapreduce-cli/elastic-mapreduce
--region us-east-1 --log-uri s3n://path/to/hadoop-logs
--create --name my-name --visible-to-all-users --enable-debugging
--pig-script s3://path/to/script.pig
--args "-p,INPUT=s3://path/to/input"
--args "-p,OUTPUT=s3://path/to/output"
--args …
--instance-group master --instance-type m1.medium --instance-count 1
--instance-group core --instance-type m1.medium --instance-count 5
--instance-group task --instance-type m1.medium --instance-count 90 --bid-price 0.015
```
2. You can see it in the list of all jobflows:
```bash
$ src/emr-monitoring.php -l ```
![All jobflows](doc/images/list-all-jobflows.png "All jobflows")
-
Start monitoring of the jobflow:
$ watch -n15 --color src/emr-monitoring.php -j j-88OW7Z7O3T9H ```
You can easily view the task timeline with, for example, [Eye of Gnome](http://projects.gnome.org/eog/):
```bash
$ eog & ```
API documentation generated by ApiGen
and included in the doc/api
folder.
Licensed under the Apache License 2.0. See LICENSE file for details.
See CHANGELOG file for details.
The git branching model used for development is the one described and assisted by twgit
tool: https://github.com/Twenga/twgit.