Crawler

An engineering exercise implemented in Go.

A simple web crawler that visits all pages within a given domain, but does not follow external links. It outputs a simple structured site map, showing for each page:

domain-internal page links
external page links
links to static content such as images

This entire project can be cloned directly from github via: https://github.com/nerophon/crawler

Prerequisites

The Go Programming Langugage must be installed to build, test, and install this software.

Installation

Clone this project.
cd to the project directory
run go install

The software will be installed to the $GOPATH/bin directory by default.

Testing & Benchmarking

This software includes unit tests. They can be run as per standard for Go tests:

cd to the source folder with test files you wish to run
run go test

Benchmarks exist for key steps in the process. These can be run from the root project directory, via the crawler_test.go file. I suggest running each benchmark separately, using the following commands:

go test -bench=BenchmarkFetch -benchtime=7s
go test -bench=BenchmarkCrawl -benchtime=15s

Please be aware that this kind of benchmark could, if run without care, be interpreted as a DOS attack. The benchtime flag may need to be adjusted depending upon which website is being used in the test. I strongly advise NOT using commonly DOS'd websites such as those belonging to major corporations.

Launching

cd to the install directory, usually $GOPATH/bin
run ./crawler

Operation

At the application command prompt, the following commands are available:

crawl [URL]		begin crawling the specified domain
help			show available commands
quit			exit the application

Press ctrl-c during a crawl to halt and force quit back to the OS command line.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Crawler

Prerequisites

Installation

Testing & Benchmarking

Launching

Operation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Crawler

Prerequisites

Installation

Testing & Benchmarking

Launching

Operation