Skip to content

Latest commit

 

History

History
22 lines (11 loc) · 726 Bytes

README.md

File metadata and controls

22 lines (11 loc) · 726 Bytes

Hacker News Scraper

A scraper for Hacker News (duh)

Usage:

run hn_scraper.py (modify the main function call to change the end_id to parse to)

Filtering Details

Parallelism details

Generating comment trees can be expensive. We use two levels of parallelism to solve this. Items can either be stories or comments, we are given the total number of items on Hacker News (~25,000,000). One pool of workers check if items are stories, and if so initiate another pool of workers that traverse the n comment trees (n is the numebr of good top-level comments) in order to build a single comment chain for each top level comment.

Example:

TODO:

[] Add in flags for better data management [] Fix requests issue