This repository contains the tools needed for managing the operations of Common Search. A demo is currently hosted on uidemo.commonsearch.org
Help is welcome! We have a complete guide on how to contribute.
We have an early documentation available for operations.
In a nutshell, 2 components are managed from this repository:
- Our Elasticsearch cluster, using AWS CloudFormation.
- The Spark cluster for our Backend, using Flintrock.
Here is how they fit in our current infrastructure:
A complete guide available in INSTALL.md.
We have a first tutorial online:
- On the Spark workers, bottleneck is the CPU so all cores should be at 100% all the time.
- The average CPU time on an EC2 c4 core is 17 minutes per ~1G Common Crawl file.
- The June 2016 crawl has ~20.000 of them, so you need ~5.500 core hours.
- Spot prices can reach as low as $0.01/h per core, so the whole job can be done for less than $60.
- In progress.
- In progress.
You will need to create a configs/cosr-ops.prod.json
with the following template:
{
"AWS_STACKNAME": "mystack",
"AWS_REGION": "us-east-1",
"AWS_ZONE": "us-east-1a",
"AWS_SUBNET": "subnet-xxxxxx",
"AWS_VPC": "vpc-xxxxxxx",
"AWS_SECURITYGROUP": "sg-xxxxxxx",
"AWS_KEYNAME": "mykeyname",
"AWS_USER": "root",
"AWS_SPARK_AMI": "ami-668dba0c",
"AWS_SPARK_SPOTBID": "0.1",
"AWS_SPARK_INSTANCETYPE_MASTER": "c4.xlarge",
"AWS_SPARK_INSTANCETYPE_WORKER": "c4.xlarge",
"AWS_SPARK_WORKER_COUNT": 15,
"AWS_SPARK_VERSION": "1.6.0",
"AWS_SPARK_PLACEMENTGROUP": "myplacementgroup",
"SPARK_PATH": "../spark-1.6.0"
}