GitHub - aishpant/crawler: A simple web crawler in Go

Usage

This concurrent web crawler outputs a site map of crawled websites.

GOPATH should be set. Refer to this for an overview on how to organise Go code.

Download all the dependencies.

go get github.com/rs/zerolog/log
go get golang.org/x/net/html
go get github.com/stretchr/testify/assert

Run the program.

go build
./crawler -url https://golang.org/ -depth 4 -output out.txt
# or run the next command to suppress the logs
./crawler -url https://golang.org/ -depth 4 -output out.txt &> /dev/null

The output sitemap is written to out.txt.

This web crawler is inspired by the Web Crawler exercise from the gotour

Run go test command to run all the tests.

Issues

Does not follow robots.txt, this crawler could throttle a server with too many connections. Ideally, there should be a delay between requests.
Uses a global logger; should be a dependency
url cache collectedUrls & url state urlState are global variables
Possibly, one lock could be reduced in web_crawler.go
Does not print static links.
http client should be created once and re-used, it is safe for concurrent use
There is no timeout on the client

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
crawler_test.go		crawler_test.go
fetcher.go		fetcher.go
fetcher_test.go		fetcher_test.go
web_crawler.go		web_crawler.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage

Issues

About

Releases

Packages

Languages

aishpant/crawler

Folders and files

Latest commit

History

Repository files navigation

Usage

Issues

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages