Skip to content

Latest commit

 

History

History
56 lines (38 loc) · 2.98 KB

README_EN.md

File metadata and controls

56 lines (38 loc) · 2.98 KB

ScrapyWeiboRelate

Crawl tweet data via Scrapy and map out a network of people via AntV
zh Chinese
GitHub license python

Project description

Crawler functions

  1. crawl twitter user information

  2. recent user tweets

  3. crawl users' social connections (followers/fans)

How to use

  1. modify user_id = int('user_id') in spiders/weibo.py to determine which user is the core of the relationship
  2. modify relate_deep = 2 and deepth_fans = 2 to determine the depth of dispersion (deepth of 2 will include my followers/fans of followers/fans, the number of exponential growth)
  3. please rewrite proxy_handle and get_cookies to make sure middlewares can get the correct cookie and IP proxy
  4. run run.py
  5. open Draw/index.html, important parameters are: linkDistance: 50 (control edge length), endArrow: true (whether the edge has arrows), lineWidth: 0.65 (the thickness of the edge), can be changed according to your needs

Special note

  1. because the size of my cookie pool is too small, so in spiders/weibo.py 73 lines, 107 lines, 132 lines, 166 lines, 258 lines added time.sleep, running slow for no other reason
  2. the crawl filtered the users, filtered out the big V and the users with more than 10000 followers
  3. the accuracy rate of NLP is 89% due to the lack of training corpus, so the training corpus is attached to this project (the source of the corpus is unknown, download it from CDSN)

Final results

  1. 1000 nodes or so 800 1000+

  2. 5000 nodes or so 4800

  3. 10000+ nodes 14000

  4. Fans and followers are distinguished by different colored lines 1000 1000

Completion progress

The main features are now complete, we are optimizing the readability of the images and other widgets, and correlating friendliness with social connections.

Thanks

Thanks for your help and guidance

Description

Because of the enhanced anti-crawl capability of Weibo, the mock login function in the project is no longer available and the followers/fans can only crawl to the first 20 pages, but this part of the project is no longer updated to focus on showing the relationships between users, so here is a note.