Skip to content
This repository has been archived by the owner on Nov 21, 2018. It is now read-only.

Commit

Permalink
Document extraction app added
Browse files Browse the repository at this point in the history
  • Loading branch information
chauff committed Jul 26, 2013
1 parent db38e53 commit b7c327d
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,10 +102,10 @@ $ hadoop jar clueweb-tools-0.3-SNAPSHOT-fatjar.jar \
```

The parameters are:
+`docidsfile`: a file with one docid per line; all docids are extracted from the WARC input files
+`input`: list of WARC files
+`keephtml`: parameter that is either `true` (keep the HTML source of each document) or `false` (parse the documents, remove HTML)
+`output`: folder where the documents' content is stored - one file per docid
+ `docidsfile`: a file with one docid per line; all docids are extracted from the WARC input files
+ `input`: list of WARC files
+ `keephtml`: parameter that is either `true` (keep the HTML source of each document) or `false` (parse the documents, remove HTML)
+ `output`: folder where the documents' content is stored - one file per docid


Retrieval runs
Expand Down

0 comments on commit b7c327d

Please sign in to comment.