The code is forked from Jimmy Lin's clueweb repository. The only change occured in the addition of a retrieval app and spam filter app
Implemented is currently the basic language modeling approach to IR; smoothing types are linear interpolation and Dirichlet.
To run the code first follow the installation guideline of the original [clueweb repository]((https://github.com/lintool/clueweb) and build the dictionary and document vectors as described.
To conduct a retrieval run, call:
$ hadoop jar clueweb-tools-X.X-SNAPSHOT-fatjar.jar \
org.clueweb.clueweb12.app.LMRetrieval \
-dictionary /data/private/clueweb12/derived/dictionary.XXX \
-smoothing 1000 \
-output /user/chauff/res.dir1000 \
-queries /user/chauff/web.queries.trec2013 \
-docvector /data/private/clueweb12/derived/docvectors.XXX/*/part* \
-topk 1000 \
-preprocessing porter
The parameters are:
dictionary
: HDFS path to the dictionary created by the clueweb toolssmoothing
: the smoothing parameter in the LM-based retrieval model; a value of <=1 automatically backs off to smoothing with linear interpolation while a value >1 runs Dirichlet smoothing (default is 1000)output
: folder in which the TREC results are collected (in TREC result file format); to merge everything into one file in the end callhadoop fs -getmerge /user/chauff/res.dir1000 res.dir1000
; the merged result file should run smoothly throughtrec_eval
queries
: HDFS path to query file (assumed format is the same as this year's distributed query file, i.e. per line [queryID]:[term1] [term2] ...)docvector
: HDFS path to the document vectors (PFor format) created by the clueweb tools; beware of the necessity for using*
to identify the files (instead of just the folder)topk
: number of results that should be returned per query (default is 1000)preprocessing
: indicates the tokenizaton/stemming procedure; eitherporter
orstandard
at the moment; needs to be in line with the dictionary/docvector
The spam scores provided by UWaterloo are also available on sara.
The spam filtering app takes as input a TREC result file (generated by the retrieval app for instance) and filters out all documents with a spam score below a certain threshold. Spam scores can be between 0 and 99 with the spammiest documents having a score of 0 and the most non-spam documents having a score of 99. Using 70 as threshold usually works well.
To run the code, call:
$ hadoop jar clueweb-tools-0.3-SNAPSHOT-fatjar.jar
org.clueweb.clueweb12.app.SpamScoreFiltering \
-output /user/chauff/res.dir1000.porter.spamFiltered \
-spamScoreFolder /data/private/clueweb12/derived/waterloo-spam-cw12-decoded \
-spamThreshold 70 \
-trecinputfile /user/chauff/res.dir1000.porter \
The parameters are:
output
: folder in which the TREC results are collected (in TREC result file format)spamScoreFolder
: HDFS path to the folder where the UWaterloo spam scores residespamThreshold
: documents with a spam score BELOW this number are considered as spamtrecinputfile
: HDFS path to the TREC result file which is used as starting point for filtering
To increase diversity, duplicate documents can be removed from the result ranking (in effect pushing lower ranked results up the ranking).
A simple cosine based similarity approach is implemented in DuplicateFiltering
: every document at rank x is compared to all non-duplicate documents at higher ranks. If its cosine similarity is high enough, it is filtered out.
To run the code, call:
$ hadoop jar clueweb-tools-0.3-SNAPSHOT-fatjar.jar \
org.clueweb.clueweb12.app.DuplicateFiltering
-cosineSimThreshold 0.8 \
-dictionary /data/private/clueweb12/derived/dictionary.XXX \
-docvector /data/private/clueweb12/derived/docvectors.XXX/*/part* \
-output /user/chauff/res.dir1000.porter.deduplicated \
-topk 1000 \
-trecinputfile /user/chauff/res.dir1000.porter
The parameters (apart from the usual ones) are:
cosineSimThreshold
: documents having a cosine similarity above this threshold are removed from the result filetrecinputfile
: file in TREC result format which is used as a starting point for de-duplication
A helper app: given a file with a list of docids, it extracts the documents' content from the WARC files.
To run the code, call:
$ hadoop jar clueweb-tools-0.3-SNAPSHOT-fatjar.jar \
org.clueweb.clueweb12.app.DocumentExtractor \
-docidsfile /user/chauff/docids \
-input /data/private/clueweb12/Disk*/*/*/*.warc.gz \
-keephtml false \
-output /user/chauff/docids-output
The parameters are:
docidsfile
: a file with one docid per line; all docids are extracted from the WARC input filesinput
: list of WARC fileskeephtml
: parameter that is eithertrue
(keep the HTML source of each document) orfalse
(parse the documents, remove HTML)output
: folder where the documents' content is stored - one file per docid
The files runs/res.dir1000.{standard,porter}
contain the baseline results when running the above retrieval program (i.e. LM with Dirichlet smoothing and mu=1000) with standard
and porter
preprocessing respectively.
On an empty sara cluster, this run on 50 queries takes about one hour.
The file runs/res.dir1000.porter.spamFiltered
is based on runs/res.dir1000.porter
and filters out all documents with a spam score below 70.
To have confidence in the implementation, the baseline runs are compared with the official 2013 baselines (Indri runs) provided by the TREC's Web track organizers.
Since no relevance judgments are available for ClueWeb12, we report the overlap in document ids among the top 10 / top 1000 ranked documents for each query between our baseline and the organizer ql baseline results-catA.txt. The Perl script to compute the overlap is available as well.
The organizer's baseline was run with Krovetz stemming, so we expect the Porter-based run to have higher overlap than the standard
run. This is indeed the case. The few 0% queries can be explained by the different tokenization, HTML parsing and the different stemming approaches (Porter is more agressive than Krovetz).
Query | Standard Top 10 | Standard Top 1000 | Porter Top 10 | Porter Top 1000 |
---|---|---|---|---|
201 | 90% | 84% | 90% | 85% |
202 | 60% | 88% | 70% | 88% |
203 | 60% | 66% | 70% | 73% |
204 | 20% | 70% | 70% | 83% |
205 | 30% | 46% | 60% | 70% |
206 | 70% | 85% | 70% | 87% |
207 | 0% | 15% | 0% | 15% |
208 | 60% | 89% | 60% | 91% |
209 | 30% | 57% | 80% | 81% |
210 | 50% | 81% | 70% | 83% |
211 | 20% | 22% | 50% | 52% |
212 | 30% | 46% | 60% | 86% |
213 | 90% | 92% | 90% | 95% |
214 | 60% | 67% | 100% | 83% |
215 | 10% | 53% | 20% | 60% |
216 | 20% | 50% | 60% | 82% |
217 | 30% | 63% | 40% | 58% |
218 | 50% | 59% | 80% | 89% |
219 | 0% | 14% | 0% | 15% |
220 | 10% | 24% | 40% | 67% |
221 | 40% | 69% | 60% | 71% |
222 | 90% | 73% | 100% | 88% |
223 | 100% | 81% | 100% | 86% |
224 | 40% | 47% | 40% | 59% |
225 | 80% | 88% | 80% | 83% |
226 | 70% | 88% | 70% | 88% |
227 | 0% | 3% | 0% | 5% |
228 | 10% | 42% | 40% | 57% |
229 | 60% | 80% | 90% | 91% |
230 | 0% | 29% | 50% | 28% |
231 | 0% | 28% | 30% | 27% |
232 | 80% | 63% | 80% | 74% |
233 | 70% | 86% | 70% | 94% |
234 | 70% | 85% | 80% | 89% |
235 | 60% | 76% | 60% | 84% |
236 | 70% | 74% | 80% | 84% |
237 | 90% | 52% | 80% | 60% |
238 | 70% | 63% | 70% | 68% |
239 | 70% | 92% | 90% | 93% |
240 | 80% | 45% | 80% | 75% |
241 | 0% | 2% | 0% | 33% |
242 | 60% | 89% | 100% | 94% |
243 | 40% | 82% | 60% | 84% |
244 | 70% | 92% | 80% | 92% |
245 | 50% | 78% | 30% | 83% |
246 | 30% | 72% | 80% | 81% |
247 | 80% | 56% | 90% | 60% |
248 | 50% | 63% | 90% | 87% |
249 | 0% | 2% | 0% | 4% |
250 | 90% | 53% | 90% | 57% |