A beautiful crawl framework , scrapy. but used a little part function.
First of all , the section one , the sprider just crawls the parts of the website "bbc.com" information
It can be done better in the future, and will be done the theme crwal in more details.
The second part is storage that is supported three kinds of mongodb store,such as singletone , distirbute mode and S3 storage, however only singleton mongodb is worked in this case.
The section three dupfilter is based on the redis server to big scale crwal internet information.
The last thing about UnitTest has not been done yet.
use JSON to fetch JSON-LD format context that contains headline, date,links and
so on
In future , we will use Natural Language Processing to analysis the text
Thank you
Hao Li