- Simple: Must only fulfil core requirements, will evolve in iterations
- Single-domain: Must not follow external links
- Fast: Must be able to crawl sites with blogs
- Prints sitemap showing links between pages
- Stateful: Must preserve state beyond crawl lifecycle to be able to resume partial crawls
- Usable: Must allow users across the world to submit requests for crawling a domain or specific page via some URL
- Asynchronous: Must not wait until end of crawling to print sitemap, user must be able to see crawling progress (number of links discovered and crawled) and request partial sitemap
- Robust: Should not have single point of failures
- Scalable and Distributed: Must be able to crawl sites like Wikipedia or GitHub
- Polite: Introduces itself to sites as a crawler, respects disallow and crawl-delay policies using robots.txt
- Content Indexing: Must be able to extract and store relevant content by scraping pages
- Automatic Reindexing: Must regularly reindex known pages based on calculated page importance, and changeFrequency / priority specifications found in sitemap.xml
- Search: Must allow users to access indexed information and suggest related pages.
Attaining speed and building for scalability in a streaming problem like web crawling calls for micro-services. But for the scope of MVP Spring Integration will be used instead, for simplicity of deployment and demos.
Caches response from a given URL. It is most likely to be the slowest component in MVP version. Needs good amount of parallelism and optimizations, one such optimization is maintaining a pool of reusable and persistent connections, and the fact that Fetcher retrieves a response, caches relevant parts of it and lets go of the underlying resources helps us scale better.
Uses the CachedResponse to extract meaningful Page information using a single parse over the HTML string. As of MVP version, the only information extracted is outgoing links and page titles.
Maintains state about Pages crawled and decides which outgoing links must be requested for crawling next. It discards links to external websites, and ignores pages already crawled.
The idea behind persisting parsed Page information is that we could later introduce indexing and search functionality based on the data accumulated. Page Repository is currently only implemented using ConcurrentHashMap for the sake of MVP. Ideally, a No-SQL store like ElasticSearch works best.
We are using some low-level optimizations for fetching Responses across multiple pages in parallel. However, the HttpClient API itself has certain limitations and require us to regularly evict stale connections. Hence we use a scheduler with configurable delay to manage connection evictions behind the scene.