CoCrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

Crawling the web can be easy or hard, depending upon the details. Mature crawlers like Nutch and Heritrix work great in many situations, and fall short in others. Some of the most demanding crawl situations include open-ended crawling of the whole web.

The object of this project is to create a modular crawler with pluggable modules, capable of working well for a large variety of crawl tasks. The core of the crawler is written in Python 3.7+ using coroutines.

Status

CoCrawler is pre-release, with major restructuring going on. It is currently able to crawl at around 170 megabits / 170 pages/sec on a 4 core machine.

Screenshot:

Installing

We recommend that you use pyenv / virtualenv to separate the python executables and packages used by cocrawler from everything else.

You can install cocrawler from pypi using "pip install cocrawler".

For a more fresh version, clone the repo and install like this:

git clone https://github.com/cocrawler/cocrawler.git
cd cocrawler
pip install . .[test]
make pytest
make test_coverage

The CI for this repo uses the latest versions of everything. To see exactly what worked last, click on the "Build Status" link above. Alternately, you can look at requirements.txt for a test combination that I probably ran before checking in.

Credits

CoCrawler draws on ideas from the Python 3.4 code in "500 Lines or Less", which can be found at https://github.com/aosabook/500lines. It is also heavily influenced by the experiences that Greg acquired while working at blekko and the Internet Archive.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 1,142 Commits
.github/workflows		.github/workflows
cocrawler		cocrawler
data		data
examples		examples
scripts		scripts
tests		tests
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
optional-requirements.txt		optional-requirements.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoCrawler

Status

Installing

Credits

License

About

Releases

Packages

Contributors 2

Languages

License

cocrawler/cocrawler

Folders and files

Latest commit

History

Repository files navigation

CoCrawler

Status

Installing

Credits

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages