Skip to content

danielfagerstrom/nutch-xquery-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Nutch XQuery Parser

Parses the html content in segment directories from Apache nutch using XQuery scripts. What XQuery script that is used for the parsing is based on:

  • Domain name
  • regexp on the URL path
  • XPath on the document (optionally)

This is configured in an XML file.

Build

Note: Maven 3.X is required to build nutch-xquery-parser

Download from http://maven.apache.org/download.html

Download Nutch 1.4

bin/get-parse-html.sh

bin/get-saxon.sh

mvn install

To create an eclipse project:

mvn eclipse:eclipse -DdownloadJavadocs=true -DdownloadSources=true

To create a runable jar

mvn package

Execute

On a segment directory:

java -jar target/nutch-xquery-parser-1.0-SNAPSHOT-jar-with-dependencies.jar -libjars <conf dir> -D xqueryparser.rules.file=xquery/parse-rules.xml <segment dir(s)> <result dir>

On a sigle URL:

java  -cp "<conf dir>:target/nutch-xquery-parser-1.0-SNAPSHOT-jar-with-dependencies.jar" com.peer2gear.nutch.xquery.XQueryParser <url>

About

Parse HTML pages fetched with Apache Nutch with XQuery scripts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published