elasticrawl-examples

Example Hadoop jobs demonstrating the Elasticrawl tool. Elasticrawl is a tool for launching AWS Elastic MapReduce jobs against the Common Crawl corpus.

Jobs

WordCount - An implementation of the standard Hadoop Word Count example that parses text data in Common Crawl WET (WARC Encoded Text) files. Each WordCount job parses a single segment of Common Crawl data.
SegmentCombiner - Combines data from multiple Common Crawl segments to produce a single set of results.

Running with Elasticrawl

See http://github.com/rossf7/elasticrawl#quick-start

Building

Developed on Ubuntu 12.04 and OpenJDK 6 using Eclipse Kepler and the m2e plugin.

with Maven

git clone https://github.com/rossf7/elasticrawl-examples.git
cd elasticrawl-examples
mvn install

with Eclipse

cd ~/workspace
git clone https://github.com/rossf7/elasticrawl-examples.git

Open Eclipse
File --> Import
Maven --> Existing Maven Project
Run As --> Maven install

Links

Common Crawl 2013 data structure and file formats

Thanks

Mark Watson for his example-warc-java that got me started with WARC files.
Lemur project developers for their edu.cmu.lemurproject package. Source for this is included with a couple of minor changes needed to process WET files stored on S3.

License

This code is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src/main/java		src/main/java
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

elasticrawl-examples

Jobs

Running with Elasticrawl

Building

with Maven

with Eclipse

Links

Thanks

License

About

Releases

Packages

Languages

License

rossf7/elasticrawl-examples

Folders and files

Latest commit

History

Repository files navigation

elasticrawl-examples

Jobs

Running with Elasticrawl

Building

with Maven

with Eclipse

Links

Thanks

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages