Skip to content

Crawler

Benjamin Ooghe-Tabanou edited this page Dec 21, 2012 · 1 revision

Table of Contents

Crawler

The crawler fetches web pages and gathers information about outgoing links, redirects, etc.

Each crawl task is defined by a set of input arguments

  • the crawler runs as a service on scrapyd, which listens to requests for crawl tasks
  • each crawl task triggers a spider run
  • crawl task are defined by a set of spider arguments

Controlling API

The crawler is controlled by a the Scrapyd API, which is a HTTP JSON API documented here

Spider arguments

  • the spider arguments are passed via scrapyd API using schedule.json method
  • the values of multi-valued fields (start_urls, follow_prefixes, etc) should be passed separated with commas: value1,value2,value3

Available arguments

The supported spider arguments are:

start_urls
the start urls for the crawl
maxdepth
how deep to crawl from the start urls
follow_prefixes
list of LRU prefixes that will be followed. See also: Link following
nofollow_prefixes
list of LRU prefixes that will not be followed. See also: Link following
discover_prefixes
list of LRU prefixes to disallow following. See also: Link following
user_agent
the user agent to use when crawling

Example

Here is an example of how to trigger a crawl with curl:

 

$ curl http://scrapyd.host:6800/schedule.json \

  -d "project=hci" \
  -d "spider=pages" \
  -d "start_urls=http://www.mongodb.org/,http://blog.mongodb.org/" \
  -d "maxdepth=2" \
  -d "follow_prefixes=s:http|t:80|h:org|h:mongodb" \
  -d "nofollow_prefixes=s:http|t:80|h:org|h:mongodb|p:support" \
  -d "discover_prefixes=s:http|t:80|h:ly|h:bit,s:http|t:80|h:ly|h:bit" \
  -d "user_agent=Mozilla/5.0 (compatible; hcibot/0.1)"

Link following

There are 4 spider arguments that control how links are followed: maxdepth, follow_prefixes, nofollow_prefixes, discover_prefixes

Assuming the crawler is in current_lru, links will be followed to next_url if *all* the following conditions are met:

  • depth of current_lru is lower than maxdepth
  • next_lru matches any item in follow_prefixes or discover_prefixes
  • next_lru does not match any item in nofollow_prefixes
  • current_lru does not match any item in discover_prefixes

Item (scraped fields)

These are the fields scraped by the spider:

url
the URL of the page (string)
lru
the LRU of the page (string)
lrulinks
the LRUs of all outgoing links (list of strings)
timestamp
the unix timestamp when the page was crawled (integer)
body
the body of the page (only availabe in the page store, not in the output queue) (string)
encoding
the encoding of the page (string)
depth
the distance from the page to the initial URLs of the crawl (integer)
content_type
the content type of the page (for example: text/html; charset=utf-8) (string)
redirects_to
the url this one redirects to (if the current url is a redirection) (string)
error
if there was an error trying to fetch this page, it will be populated here (string). See also: Errors

Output

The output is stored in MongoDB (hci database) in these two collections:

crawler.pages
used for storing full pages (including body) keyed by url (_id is the url)
crawler.queue
user for storing summarized pages (without body). this is meant to be consumed by the core as a queue using find_and_modify method for popping messages atomically

Errors

Here are the possible errors that can be found in the error field:

connection_error
connection errors (unable to connect, connection dropped early)
dns_error
DNS lookup error (domain not found)
timeout_error
TCP timeout error