CrawlNews

A beautiful crawl framework , scrapy. but used a little part function.

First of all , the section one , the sprider just crawls the parts of the website "bbc.com" information

It can be done better in the future, and will be done the theme crwal in more details.

The second part is storage that is supported three kinds of mongodb store,such as singletone , distirbute mode and S3 storage, however only singleton mongodb is worked in this case.

The section three dupfilter is based on the redis server to big scale crwal internet information.

The last thing about UnitTest has not been done yet.

use JSON to fetch JSON-LD format context that contains headline, date,links and

so on

In future , we will use Natural Language Processing to analysis the text

Thank you

Hao Li

[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
CrawlBBC		CrawlBBC
README.md		README.md
count_record.png		count_record.png
one_title_record.png		one_title_record.png
weekend_edition_news.png		weekend_edition_news.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrawlNews

About

Releases

Packages

Languages

UniKrau/CrawlNews

Folders and files

Latest commit

History

Repository files navigation

CrawlNews

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages