cosr-ops

This repository contains the tools needed for managing the operations of Common Search. A demo is currently hosted on uidemo.commonsearch.org

Help is welcome! We have a complete guide on how to contribute.

Understand the project

We have an early documentation available for operations.

In a nutshell, 2 components are managed from this repository:

Our Elasticsearch cluster, using AWS CloudFormation.
The Spark cluster for our Backend, using Flintrock.

Here is how they fit in our current infrastructure:

Local install

A complete guide available in INSTALL.md.

Documentation

We have a first tutorial online:

Analyzing the web with Spark on EC2

Provisioning info

Common Crawl indexing

On the Spark workers, bottleneck is the CPU so all cores should be at 100% all the time.
The average CPU time on an EC2 c4 core is 17 minutes per ~1G Common Crawl file.
The June 2016 crawl has ~20.000 of them, so you need ~5.500 core hours.
Spot prices can reach as low as $0.01/h per core, so the whole job can be done for less than $60.

WebGraph generation

In progress.

PageRank

In progress.

Creating a configuration file

You will need to create a configs/cosr-ops.prod.json with the following template:

{
	"AWS_STACKNAME": "mystack",

	"AWS_REGION": "us-east-1",
	"AWS_ZONE": "us-east-1a",
	"AWS_SUBNET": "subnet-xxxxxx",
	"AWS_VPC": "vpc-xxxxxxx",
	"AWS_SECURITYGROUP": "sg-xxxxxxx",

	"AWS_KEYNAME": "mykeyname",
	"AWS_USER": "root",

	"AWS_SPARK_AMI": "ami-668dba0c",
	"AWS_SPARK_SPOTBID": "0.1",
	"AWS_SPARK_INSTANCETYPE_MASTER": "c4.xlarge",
	"AWS_SPARK_INSTANCETYPE_WORKER": "c4.xlarge",
	"AWS_SPARK_WORKER_COUNT": 15,
	"AWS_SPARK_VERSION": "1.6.0",
	"AWS_SPARK_PLACEMENTGROUP": "myplacementgroup",

	"SPARK_PATH": "../spark-1.6.0"
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
aws		aws
configs		configs
.dockerhash		.dockerhash
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
INSTALL.md		INSTALL.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
requirements3.txt		requirements3.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cosr-ops

Understand the project

Local install

Documentation

Provisioning info

Common Crawl indexing

WebGraph generation

PageRank

Creating a configuration file

About

Releases

Packages

Contributors 2

Languages

License

commonsearch/cosr-ops

Folders and files

Latest commit

History

Repository files navigation

cosr-ops

Understand the project

Local install

Documentation

Provisioning info

Common Crawl indexing

WebGraph generation

PageRank

Creating a configuration file

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages