Skip to content

Notes for Apache Nutch Users and developers

Thamme Gowda edited this page Jun 3, 2016 · 2 revisions

This document explains some of the design choices made for building Sparkler.

Sparkler is heavily influenced by the internals of Apache Nutch. In fact some of the pieces of Apache Nutch have been planned to reuse (thus you see Nutch as a dependent library for build).

Note 0: Hello Apache Spark

Two main motivations of this crawler are :

  • to improve the performance of crawler by making use of some of the advancements in Distributed computing.
  • Make it easy to use by well packaging the system.

Sparkler chose to run on top of Apache Spark to gain the advantage of spark goodness like caching, reusable containers etc. So, this runs on Apache Spark and not directly on Hadoop's Map Reduce.

Note 1: Crawldb is put on indexed store

We have decided to keep crawldb completely in Solr index.

Why solr instead of some custom file on (D)FS?

  • The realtime analytics/stats of crawl is a MUST have feature and it should be supported from core. Keeping crawldb on an indexed store like Solr helps to support this.
  • We can build a powerful admin dashbords to interact with them. If you ask - "whats going on with crawler" or "What sites did our crawler crawl in last hour" or any such typical questions, we can convert those questions to lucene queries and show results in realtime on d3 charts. We can even update db from UI if we think we need to drop or insert some urls.
  • We may switch to streaming mode, as we all know Spark can do batch and stream together with ease.
  • We could have kept it on someother store but we love lucene/solr, we dont put binary/raw crawled content into solr.
  • We could have kept a copy on HDFS like what Nutch does - but, copies could go out of sync and brings in additional overhead of managing all those.
  • We think Solr cloud can take the load of a large scale crawler, if not we need to achieve the same functionality with other stores.

Note 2: Crawled content

  • Sparkler can produce the segments on HDFS, trying to keep it compatible with nutch content format
  • Kafka data sink is about to come

If you see any issues/tradeoffs with this approach, please let us know. Thanks