-
Command line tool for launching Hadoop jobs using AWS EMR (Elastic MapReduce) to process Common Crawl data.
-
Elasticrawl can be used with crawl data from April 2014 onwards.
-
A list of crawls released by Common Crawl is maintained on the wiki.
-
Common Crawl announce new crawls on their blog.
-
Ships with a default configuration that launches the elasticrawl-examples jobs. This is an implementation of the standard Hadoop Word Count example.
This blog post has a walkthrough of running the example jobs on the November 2014 crawl.
- Elasticrawl needs a Ruby installation (2.1 or higher).
- Install Ruby from RubyGems.
gem install elasticrawl --no-rdoc --no-ri
If you get the error "EMR service role arn:aws:iam::156793023547:role/EMR_DefaultRole is invalid" when launching a cluster then you don't have the necessary IAM roles. To fix this install the AWS CLI and run the command below.
aws emr create-default-roles
The init command takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created and will store your data and logs.
~$ elasticrawl init your-s3-bucket
Enter AWS Access Key ID: ************
Enter AWS Secret Access Key: ************
...
Bucket s3://elasticrawl-test created
Config dir /Users/ross/.elasticrawl created
Config complete
The parse command takes in the crawl name and an optional number of segments and files to parse.
~$ elasticrawl parse CC-MAIN-2015-48 --max-segments 2 --max-files 3
Segments
Segment: 1416400372202.67 Files: 150
Segment: 1416400372490.23 Files: 124
Job configuration
Crawl: CC-MAIN-2015-48 Segments: 2 Parsing: 3 files per segment
Cluster configuration
Master: 1 m1.medium (Spot: 0.12)
Core: 2 m1.medium (Spot: 0.12)
Task: --
Launch job? (y/n)
y
Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
The combine command takes in the results of previous parse jobs and produces a combined set of results.
~$ elasticrawl combine --input-jobs 1420124830792
Job configuration
Combining: 2 segments
Cluster configuration
Master: 1 m1.medium (Spot: 0.12)
Core: 2 m1.medium (Spot: 0.12)
Task: --
Launch job? (y/n)
y
Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
The status command shows crawls and your job history.
~$ elasticrawl status
Crawl Status
CC-MAIN-2015-48 Segments: to parse 98, parsed 2, total 100
Job History (last 10)
1420124830792 Launched: 2015-01-01 15:07:10 Crawl: CC-MAIN-2015-48 Segments: 2 Parsing: 3 files per segment
The reset comment resets a crawl so it is parsed again.
~$ elasticrawl reset CC-MAIN-2015-48
Reset crawl? (y/n)
y
CC-MAIN-2015-48 Segments: to parse 100, parsed 0, total 100
The destroy command deletes your S3 bucket and the ~/.elasticrawl directory.
~$ elasticrawl destroy
WARNING:
Bucket s3://elasticrawl-test and its data will be deleted
Config dir /home/vagrant/.elasticrawl will be deleted
Delete? (y/n)
y
Bucket s3://elasticrawl-test deleted
Config dir /home/vagrant/.elasticrawl deleted
Config deleted
The elasticrawl init command creates the ~/elasticrawl/ directory which contains
-
aws.yml - stores your AWS access credentials. Or you can set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
-
cluster.yml - configures the EC2 instances that are launched to form your EMR cluster
-
jobs.yml - stores your S3 bucket name and the config for the parse and combine jobs
Elasticrawl is developed in Ruby and requires Ruby 2.1.0 or later (Ruby 2.3 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.
- Add support for Streaming and Pig jobs
- Thanks to everyone at Common Crawl for making this awesome dataset available!
- Thanks to Robert Slifka for the elasticity gem which provides a nice Ruby wrapper for the EMR REST API.
- Thanks to Phusion for creating Traveling Ruby.
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
This code is licensed under the MIT license.