Skip to content

UniKrau/CrawlNews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CrawlNews

A beautiful crawl framework , scrapy. but used a little part function.

First of all , the section one , the sprider just crawls the parts of the website "bbc.com" information

It can be done better in the future, and will be done the theme crwal in more details.

The second part is storage that is supported three kinds of mongodb store,such as singletone , distirbute mode and S3 storage, however only singleton mongodb is worked in this case.

The section three dupfilter is based on the redis server to big scale crwal internet information.

The last thing about UnitTest has not been done yet.

use JSON to fetch JSON-LD format context that contains headline, date,links and

so on

In future , we will use Natural Language Processing to analysis the text

Thank you

Hao Li

[email protected]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages