Skip to content
This repository has been archived by the owner on Nov 21, 2018. It is now read-only.
/ clueweb Public archive

Hadoop tools for manipulating ClueWeb collections

Notifications You must be signed in to change notification settings

lintool/clueweb

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

fca438b · Jul 27, 2013

History

92 Commits
Jul 25, 2013
Jul 25, 2013
Jul 27, 2013
Jul 23, 2013
Jul 14, 2013
Jul 26, 2013
Jul 23, 2013
Jul 23, 2013
Jul 14, 2013

Repository files navigation

ClueWeb Tools (Fork) - Retrieval models

The code is forked from Jimmy Lin's clueweb repository. The only change occured in the addition of a retrieval app and spam filter app

Retrieval

Implemented is currently the basic language modeling approach to IR; smoothing types are linear interpolation and Dirichlet.

To run the code first follow the installation guideline of the original [clueweb repository]((https://github.com/lintool/clueweb) and build the dictionary and document vectors as described.

To conduct a retrieval run, call:

$ hadoop jar clueweb-tools-X.X-SNAPSHOT-fatjar.jar \
	org.clueweb.clueweb12.app.LMRetrieval \
	-dictionary /data/private/clueweb12/derived/dictionary.XXX \
	-smoothing 1000 \
	-output /user/chauff/res.dir1000 \
	-queries /user/chauff/web.queries.trec2013 \
	-docvector /data/private/clueweb12/derived/docvectors.XXX/*/part* \
	-topk 1000 \
	-preprocessing porter

The parameters are:

  • dictionary: HDFS path to the dictionary created by the clueweb tools
  • smoothing: the smoothing parameter in the LM-based retrieval model; a value of <=1 automatically backs off to smoothing with linear interpolation while a value >1 runs Dirichlet smoothing (default is 1000)
  • output: folder in which the TREC results are collected (in TREC result file format); to merge everything into one file in the end call hadoop fs -getmerge /user/chauff/res.dir1000 res.dir1000; the merged result file should run smoothly through trec_eval
  • queries: HDFS path to query file (assumed format is the same as this year's distributed query file, i.e. per line [queryID]:[term1] [term2] ...)
  • docvector: HDFS path to the document vectors (PFor format) created by the clueweb tools; beware of the necessity for using * to identify the files (instead of just the folder)
  • topk: number of results that should be returned per query (default is 1000)
  • preprocessing: indicates the tokenizaton/stemming procedure; either porter or standard at the moment; needs to be in line with the dictionary/docvector

Spam Filter

The spam scores provided by UWaterloo are also available on sara.

The spam filtering app takes as input a TREC result file (generated by the retrieval app for instance) and filters out all documents with a spam score below a certain threshold. Spam scores can be between 0 and 99 with the spammiest documents having a score of 0 and the most non-spam documents having a score of 99. Using 70 as threshold usually works well.

To run the code, call:

$ hadoop jar clueweb-tools-0.3-SNAPSHOT-fatjar.jar
	org.clueweb.clueweb12.app.SpamScoreFiltering \
	-output /user/chauff/res.dir1000.porter.spamFiltered \
	-spamScoreFolder /data/private/clueweb12/derived/waterloo-spam-cw12-decoded \ 		
	-spamThreshold 70 \
	-trecinputfile /user/chauff/res.dir1000.porter \

The parameters are:

  • output: folder in which the TREC results are collected (in TREC result file format)
  • spamScoreFolder: HDFS path to the folder where the UWaterloo spam scores reside
  • spamThreshold: documents with a spam score BELOW this number are considered as spam
  • trecinputfile: HDFS path to the TREC result file which is used as starting point for filtering

De-duplication

To increase diversity, duplicate documents can be removed from the result ranking (in effect pushing lower ranked results up the ranking).

A simple cosine based similarity approach is implemented in DuplicateFiltering: every document at rank x is compared to all non-duplicate documents at higher ranks. If its cosine similarity is high enough, it is filtered out.

To run the code, call:

$ hadoop jar clueweb-tools-0.3-SNAPSHOT-fatjar.jar \
	org.clueweb.clueweb12.app.DuplicateFiltering 
	-cosineSimThreshold 0.8 \ 
	-dictionary /data/private/clueweb12/derived/dictionary.XXX \
	-docvector /data/private/clueweb12/derived/docvectors.XXX/*/part* \
	-output /user/chauff/res.dir1000.porter.deduplicated \
	-topk 1000 \
	-trecinputfile /user/chauff/res.dir1000.porter

The parameters (apart from the usual ones) are:

  • cosineSimThreshold: documents having a cosine similarity above this threshold are removed from the result file
  • trecinputfile: file in TREC result format which is used as a starting point for de-duplication

Document extraction

A helper app: given a file with a list of docids, it extracts the documents' content from the WARC files.

To run the code, call:

$ hadoop jar clueweb-tools-0.3-SNAPSHOT-fatjar.jar \
	org.clueweb.clueweb12.app.DocumentExtractor \
	-docidsfile /user/chauff/docids \
	-input /data/private/clueweb12/Disk*/*/*/*.warc.gz \
	-keephtml false \
	-output /user/chauff/docids-output

The parameters are:

  • docidsfile: a file with one docid per line; all docids are extracted from the WARC input files
  • input: list of WARC files
  • keephtml: parameter that is either true (keep the HTML source of each document) or false (parse the documents, remove HTML)
  • output: folder where the documents' content is stored - one file per docid

Retrieval runs

The files runs/res.dir1000.{standard,porter} contain the baseline results when running the above retrieval program (i.e. LM with Dirichlet smoothing and mu=1000) with standard and porter preprocessing respectively. On an empty sara cluster, this run on 50 queries takes about one hour.

The file runs/res.dir1000.porter.spamFiltered is based on runs/res.dir1000.porter and filters out all documents with a spam score below 70.

Sanity check

To have confidence in the implementation, the baseline runs are compared with the official 2013 baselines (Indri runs) provided by the TREC's Web track organizers.

Since no relevance judgments are available for ClueWeb12, we report the overlap in document ids among the top 10 / top 1000 ranked documents for each query between our baseline and the organizer ql baseline results-catA.txt. The Perl script to compute the overlap is available as well.

The organizer's baseline was run with Krovetz stemming, so we expect the Porter-based run to have higher overlap than the standard run. This is indeed the case. The few 0% queries can be explained by the different tokenization, HTML parsing and the different stemming approaches (Porter is more agressive than Krovetz).

Query Standard Top 10 Standard Top 1000 Porter Top 10 Porter Top 1000
201 90% 84% 90% 85%
202 60% 88% 70% 88%
203 60% 66% 70% 73%
204 20% 70% 70% 83%
205 30% 46% 60% 70%
206 70% 85% 70% 87%
207 0% 15% 0% 15%
208 60% 89% 60% 91%
209 30% 57% 80% 81%
210 50% 81% 70% 83%
211 20% 22% 50% 52%
212 30% 46% 60% 86%
213 90% 92% 90% 95%
214 60% 67% 100% 83%
215 10% 53% 20% 60%
216 20% 50% 60% 82%
217 30% 63% 40% 58%
218 50% 59% 80% 89%
219 0% 14% 0% 15%
220 10% 24% 40% 67%
221 40% 69% 60% 71%
222 90% 73% 100% 88%
223 100% 81% 100% 86%
224 40% 47% 40% 59%
225 80% 88% 80% 83%
226 70% 88% 70% 88%
227 0% 3% 0% 5%
228 10% 42% 40% 57%
229 60% 80% 90% 91%
230 0% 29% 50% 28%
231 0% 28% 30% 27%
232 80% 63% 80% 74%
233 70% 86% 70% 94%
234 70% 85% 80% 89%
235 60% 76% 60% 84%
236 70% 74% 80% 84%
237 90% 52% 80% 60%
238 70% 63% 70% 68%
239 70% 92% 90% 93%
240 80% 45% 80% 75%
241 0% 2% 0% 33%
242 60% 89% 100% 94%
243 40% 82% 60% 84%
244 70% 92% 80% 92%
245 50% 78% 30% 83%
246 30% 72% 80% 81%
247 80% 56% 90% 60%
248 50% 63% 90% 87%
249 0% 2% 0% 4%
250 90% 53% 90% 57%

About

Hadoop tools for manipulating ClueWeb collections

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages