Pholcidae, commonly known as cellar spiders, are a spider family in the suborder Araneomorphae.
Pholcidae is a tiny Python module allows you to write your own crawl spider fast and easy.
View end of README to read about changes in v2
- python 2.7 or higher
pip install git+https://github.com/bbrodriges/pholcidae.git
from pholcidae2 import Pholcidae
class MySpider(Pholcidae):
def crawl(self, data):
print(data.url)
settings = {'domain': 'www.test.com', 'start_page': '/sitemap/'}
spider = MySpider()
spider.extend(settings)
spider.start()
Settings must be passed as dictionary to extend
method of the crawler.
Params you can use:
Required
- domain string - defines domain which pages will be parsed. Defines without trailing slash.
Additional
- start_page string - URL which will be used as entry point to parsed site. Default:
/
- protocol string - defines protocol to be used by crawler. Default:
http://
- valid_links list - list of regular expression strings (or full URLs), which will be used to filter site URLs to be passed to
crawl()
method. Default:['(.*)']
- append_to_links string - text to be appended to each link before fetching it. Default:
''
- exclude_links list - list of regular expression strings (or full URLs), which will be used to filter site URLs which must not be checked at all. Default:
[]
- cookies dict - a dictionary of string key-values which represents cookie name and cookie value to be passed with site URL request. Default:
{}
- headers dict - a dictionary of string key-values which represents header name and value value to be passed with site URL request. Default:
{}
- follow_redirects bool - allows crawler to bypass 30x headers and not follow redirects. Default:
True
- precrawl string - name of function which will be called before start of crawler. Default:
None
- postcrawl string - name of function which will be called after the end crawling. Default:
None
- callbacks dict - a dictionary of key-values which represents URL pattern from
valid_links
dict and string name of self defined method to get parsed data. Default:{}
- proxy dict - a dictionary mapping protocol names to URLs of proxies, e.g., {'http': 'http://user:passwd@host:port'}. Default:
{}
New in v2:
- silent_links list - list of regular expression strings (or full URLs), which will be used to filter site URLs which must not pass page data to callback function, yet still collect URLs from this page. Default:
[]
- valid_mimes list - list of strings representing valid MIME types. Only URLs that can be identified with this MIME types will be parsed. Default:
[]
- threads int - number of concurrent threads of pages fetchers. Default:
1
- with_lock bool - whether use or not lock while URLs sync. It slightly decreases crawling speed but eliminates race conditions. Default:
True
- hashed bool - whether or not store parsed URLs as shortened SHA1 hashes. Crawler may run a little bit slower but consumes a lot less memory. Default:
False
- respect_robots_txt bool - whether or not read
robots.txt
file before start and addDisallow
directives to exclude_links list. Default:True
While inherit Pholcidae class you can override built-in crawl()
method to retrieve data gathered from page. Any response object will contain some attributes depending on page parsing success.
Successful parsing
- body string - raw HTML/XML/XHTML etc. representation of page.
- url string - URL of parsed page.
- headers AttrDict - dictionary of response headers.
- cookies AttrDict - dictionary of response cookies.
- status int - HTTP status of response (e.g. 200).
- match list - matched part from valid_links regex.
Unsuccessful parsing
- body string - raw representation of error.
- status int - HTTP status of response (e.g. 400). Default: 500
- url string - URL of parsed page.
See test.py
Pholcidae does not contain any built-in XML, XHTML, HTML or other parser. You can manually add any response body parsing methods using any available python libraries you want.
Major changes have been made in version 2.0:
- All code has been completely rewritten from scratch
- Less abstractions = more speed
- Threads support
- Matches in page data are now list and not optional
- Option
stay_in_domain
has been removed. Crawler cannot break out of initial domain anymore.
There are some minor code changes which breaks backward code compatibility between version 1.x and 2.0:
- You need to explicitly pass settings to
extend
method of your crawler - Option
autostart
has been removed. You must callspider.srart()
explisitly - Module is now called
pholcidae2