Crawler

Table of Contents Crawler Controlling API Spider arguments Available arguments Example Link following Item (scraped fields) Output Errors

Crawler

The crawler fetches web pages and gathers information about outgoing links, redirects, etc.

Each crawl task is defined by a set of input arguments

the crawler runs as a service on scrapyd, which listens to requests for crawl tasks
each crawl task triggers a spider run
crawl task are defined by a set of spider arguments

Controlling API

The crawler is controlled by a the Scrapyd API, which is a HTTP JSON API documented here

Spider arguments

the spider arguments are passed via scrapyd API using schedule.json method
the values of multi-valued fields (start_urls, follow_prefixes, etc) should be passed separated with commas: value1,value2,value3

Available arguments

The supported spider arguments are:

start_urls

the start urls for the crawl

maxdepth

how deep to crawl from the start urls

follow_prefixes

list of LRU prefixes that will be followed. See also: Link following

nofollow_prefixes

list of LRU prefixes that will not be followed. See also: Link following

discover_prefixes

list of LRU prefixes to disallow following. See also: Link following

user_agent

the user agent to use when crawling

Example

Here is an example of how to trigger a crawl with curl:

$ curl http://scrapyd.host:6800/schedule.json \

  -d "project=hci" \
  -d "spider=pages" \
  -d "start_urls=http://www.mongodb.org/,http://blog.mongodb.org/" \
  -d "maxdepth=2" \
  -d "follow_prefixes=s:http|t:80|h:org|h:mongodb" \
  -d "nofollow_prefixes=s:http|t:80|h:org|h:mongodb|p:support" \
  -d "discover_prefixes=s:http|t:80|h:ly|h:bit,s:http|t:80|h:ly|h:bit" \
  -d "user_agent=Mozilla/5.0 (compatible; hcibot/0.1)"

Link following

There are 4 spider arguments that control how links are followed: maxdepth, follow_prefixes, nofollow_prefixes, discover_prefixes

Assuming the crawler is in current_lru, links will be followed to next_url if *all* the following conditions are met:

depth of current_lru is lower than maxdepth
next_lru matches any item in follow_prefixes or discover_prefixes
next_lru does not match any item in nofollow_prefixes
current_lru does not match any item in discover_prefixes

Item (scraped fields)

These are the fields scraped by the spider:

url

the URL of the page (string)

lru

the LRU of the page (string)

lrulinks

the LRUs of all outgoing links (list of strings)

timestamp

the unix timestamp when the page was crawled (integer)

body

the body of the page (only availabe in the page store, not in the output queue) (string)

encoding

the encoding of the page (string)

depth

the distance from the page to the initial URLs of the crawl (integer)

content_type

the content type of the page (for example: text/html; charset=utf-8) (string)

redirects_to

the url this one redirects to (if the current url is a redirection) (string)

error

if there was an error trying to fetch this page, it will be populated here (string). See also: Errors

Output

The output is stored in MongoDB (hci database) in these two collections:

crawler.pages

used for storing full pages (including body) keyed by url (_id is the url)

crawler.queue

user for storing summarized pages (without body). this is meant to be consumed by the core as a queue using find_and_modify method for popping messages atomically

Errors

Here are the possible errors that can be found in the error field:

connection_error

connection errors (unable to connect, connection dropped early)

dns_error

DNS lookup error (domain not found)

timeout_error

TCP timeout error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler

Table of Contents

Crawler

Controlling API

Spider arguments

Available arguments

Example

Link following

Item (scraped fields)

Output

Errors

Clone this wiki locally