open source, restful, distributed crawler engine
- Persistence
- Dynamic Master
I wrote a crawler engine named ants in python base on scrapy. But sometimes, dynamic language is chaos. So I start to write it in a compile language.
I design the crawler framework by imitating scrapy. such as downloader,scraper,and the way user write customize spider, but in a compile way
I design my distributed architecture by imitating elasticsearch. it spire me to do a engine for distributed crawler
go get github.com/PuerkitoBio/goquery
go get github.com/go-sql-driver/mysql
go get github.com/wcong/ants-go
go install github.com/wcong/ants-go
cd bin
./ants-go
curl 'http://localhost:8200/cluster'
curl 'http://localhost:8200/spiders'
curl 'http://localhost:8200/crawl?spider=spiderName'
to test cluster in one computer,you can run it from different port in different terminal
one node,use the default port tcp 8300 http 8200
cd bin
./ants-go
the other node set tcp port and http port
cd bin
./ants-go -tcp 9300 -http 9200
there are some flags you can set,check out the help message
./ants-go -h
./ants-go -help
- go to spiders
- write your spiders follow the example deap_loop_spider.go or go to the spider page
- add you spider to spiderMap,follow the example in LoadAllSpiders in load_all_spider.go
- install again