Software architecture group

Table of Contents Team members Questions Existing tools review Issue Crawler navicrawler crawler INA - phagosite : IIPC tools : Formats identified Modules Questions - Answers Session of crawl What is a web entitites Can we change web entities ? When to define web entities ? Work on page level ?

Team members

Paul, Erik, Guillaume, Thomas, Elias, Sébastian, Camille, Mathieu, Guillaume, Raphaël

Questions

What to do with web contents? Indexation vs archiving? Scraping?
How should qualitative navigation and quantitative crawling work with each other?
What place should take the explorative tools in the method?
server side - client side issue
which crawler ?
standalone version ?

Existing tools review

Issue Crawler

- written in JAVA
- only crawls

navicrawler

firefox plugin
XUL and javascript
crawls using Firefox browser engine to harvest links
set a corpus : boundaries + tag system
export in GDF for graph view in Gephi
save in WXSF xml format

crawler INA - phagosite :

crawls and archives
bots (phagosite, heretrix, automated firefox), connects to scheduler which distributes jobs, has crawl policy on top (prospect, archive, ...)
output = xml
easier to distribute on different machines
uses DAFF (handles redundancy, in contrast to ARC...
VORTEX (proxy, scheduler, application proxy)
lot of libraries to parse links and provide ‘services’
can be a shared server / repository
installable on ‘any’ linux machine, turning that machine into a proxy which properly archives what is requested through it. It spits out DAFF or ARC, the latter which can be converted into WARC

IIPC tools :

Heritrix : http://crawler.archive.org/
- Crawler used by IIPC and BNF. For big crawls, difficult to distribute, centralized architecture.
IIPC Web Curator Tool http://webcurator.sourceforge.net/

Formats

Topology: GEXF, Guess .GDF, List of Gephi supported formats
web corpus:
- ARC
  - internet archive
  - very simple
  - record based HTTP response storage
- WARC :
  - IIPC, Heritrix
  - iso
  - complex
  - must define a policy
- issuecrawler format (based on xml)
- navicrawler format GDF, WXSF (based on xml)
- CSV,
- DAFF
Analysis: statistics

identified Modules

live crawling
archiving
web corpus handler
web corpus definition
exploration tool

Questions - Answers

Session of crawl

granularity of crawling
- The granularity defines a tree of “grains”
- Granularity is the technical precision (makes an approximation that may be necessary in large crawls)
- Examples:
  - “Full” precision: we take all URLs, and consider them just as strings.

1. 1. 1. The graph of pages is stored
    2. We can define WebEntities by regexp

- - “Stem” precision: URLs are considered as a serie of stems.

For example “blogspot.com/tommaso” is the series of stems “blogspot”, “com” and “tommaso” (simplification). So the minimal nodes are stemmed URLs (we have a graph of stemmed URLs) AND WebEntities are defined only according to stems.

1. 1. 1. The graph of stemmed URLs (and links from a stemmed URL to stemmed URLs) is stored
    2. We can define WebEntities as all the stemmed URLs that begin with the same series of stems. Like “blogspot.com” followed by anything.

What is a web entitites

boundaries on the URLS
qualification

Can we change web entities ?

yes in the limit of granularity

When to define web entities ?

before crawling iteration

Work on page level ?

depends on granularity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly