-
Notifications
You must be signed in to change notification settings - Fork 60
Crawler
The crawler fetches web pages and gathers information about outgoing links, redirects, etc.
Each crawl task is defined by a set of input arguments
- the crawler runs as a service on scrapyd, which listens to requests for crawl tasks
- each crawl task triggers a spider run
- crawl task are defined by a set of spider arguments
The crawler is controlled by a the Scrapyd API, which is a HTTP JSON API documented here
- the spider arguments are passed via scrapyd API using schedule.json method
- the values of multi-valued fields (start_urls, follow_prefixes, etc) should be passed separated with commas: value1,value2,value3
The supported spider arguments are:
- start_urls
- the start urls for the crawl
- maxdepth
- how deep to crawl from the start urls
- follow_prefixes
- list of LRU prefixes that will be followed. See also: Link following
- nofollow_prefixes
- list of LRU prefixes that will not be followed. See also: Link following
- discover_prefixes
- list of LRU prefixes to disallow following. See also: Link following
- user_agent
- the user agent to use when crawling
Here is an example of how to trigger a crawl with curl:
$ curl http://scrapyd.host:6800/schedule.json \
-d "project=hci" \ -d "spider=pages" \ -d "start_urls=http://www.mongodb.org/,http://blog.mongodb.org/" \ -d "maxdepth=2" \ -d "follow_prefixes=s:http|t:80|h:org|h:mongodb" \ -d "nofollow_prefixes=s:http|t:80|h:org|h:mongodb|p:support" \ -d "discover_prefixes=s:http|t:80|h:ly|h:bit,s:http|t:80|h:ly|h:bit" \ -d "user_agent=Mozilla/5.0 (compatible; hcibot/0.1)"
There are 4 spider arguments that control how links are followed: maxdepth, follow_prefixes, nofollow_prefixes, discover_prefixes
Assuming the crawler is in current_lru, links will be followed to next_url if *all* the following conditions are met:
- depth of current_lru is lower than maxdepth
- next_lru matches any item in follow_prefixes or discover_prefixes
- next_lru does not match any item in nofollow_prefixes
- current_lru does not match any item in discover_prefixes
These are the fields scraped by the spider:
- url
- the URL of the page (string)
- lru
- the LRU of the page (string)
- lrulinks
- the LRUs of all outgoing links (list of strings)
- timestamp
- the unix timestamp when the page was crawled (integer)
- body
- the body of the page (only availabe in the page store, not in the output queue) (string)
- encoding
- the encoding of the page (string)
- depth
- the distance from the page to the initial URLs of the crawl (integer)
- content_type
- the content type of the page (for example: text/html; charset=utf-8) (string)
- redirects_to
- the url this one redirects to (if the current url is a redirection) (string)
- error
- if there was an error trying to fetch this page, it will be populated here (string). See also: Errors
The output is stored in MongoDB (hci database) in these two collections:
- crawler.pages
- used for storing full pages (including body) keyed by url (_id is the url)
- crawler.queue
- user for storing summarized pages (without body). this is meant to be consumed by the core as a queue using find_and_modify method for popping messages atomically
Here are the possible errors that can be found in the error field:
- connection_error
- connection errors (unable to connect, connection dropped early)
- dns_error
- DNS lookup error (domain not found)
- timeout_error
- TCP timeout error