Skip to content

Software architecture group

Benjamin Ooghe-Tabanou edited this page Dec 21, 2012 · 1 revision

Table of Contents

Team members

Paul, Erik, Guillaume, Thomas, Elias, Sébastian, Camille, Mathieu, Guillaume, Raphaël

Questions

  • What to do with web contents? Indexation vs archiving? Scraping?
  • How should qualitative navigation and quantitative crawling work with each other?
  • What place should take the explorative tools in the method?
  • server side - client side issue
  • which crawler ?
  • standalone version ?

Existing tools review

Issue Crawler

    • written in JAVA
    • only crawls

navicrawler

  • firefox plugin
  • XUL and javascript
  • crawls using Firefox browser engine to harvest links
  • set a corpus : boundaries + tag system
  • export in GDF for graph view in Gephi
  • save in WXSF xml format

crawler INA - phagosite :

  • crawls and archives
  • bots (phagosite, heretrix, automated firefox), connects to scheduler which distributes jobs, has crawl policy on top (prospect, archive, ...)
  • output = xml
  • easier to distribute on different machines
  • uses DAFF (handles redundancy, in contrast to ARC...
  • VORTEX (proxy, scheduler, application proxy)
  • lot of libraries to parse links and provide ‘services’
  • can be a shared server / repository
  • installable on ‘any’ linux machine, turning that machine into a proxy which properly archives what is requested through it. It spits out DAFF or ARC, the latter which can be converted into WARC

IIPC tools :

Formats

  • Topology: GEXF, Guess .GDF, List of Gephi supported formats
  • web corpus:
    • ARC
      • internet archive
      • very simple
      • record based HTTP response storage
    • WARC :
      • IIPC, Heritrix
      • iso
      • complex
      • must define a policy
    • issuecrawler format (based on xml)
    • navicrawler format GDF, WXSF (based on xml)
    • CSV,
    • DAFF
  • Analysis: statistics

identified Modules

  • live crawling
  • archiving
  • web corpus handler
  • web corpus definition
  • exploration tool

Questions - Answers

Session of crawl

  • granularity of crawling
    • The granularity defines a tree of “grains”
    • Granularity is the technical precision (makes an approximation that may be necessary in large crawls)
    • Examples:
      • “Full” precision: we take all URLs, and consider them just as strings.
        1. The graph of pages is stored
        2. We can define WebEntities by regexp
      • “Stem” precision: URLs are considered as a serie of stems.
For example “blogspot.com/tommaso” is the series of stems “blogspot”, “com” and “tommaso” (simplification). So the minimal nodes are stemmed URLs (we have a graph of stemmed URLs) AND WebEntities are defined only according to stems.
        1. The graph of stemmed URLs (and links from a stemmed URL to stemmed URLs) is stored
        2. We can define WebEntities as all the stemmed URLs that begin with the same series of stems. Like “blogspot.com” followed by anything.

What is a web entitites

  • boundaries on the URLS
  • qualification

Can we change web entities ?

  • yes in the limit of granularity

When to define web entities ?

  • before crawling iteration

Work on page level ?

depends on granularity

Clone this wiki locally