This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name	Name	Last commit message	Last commit date
Latest commit chauff Fixed stream closing issues Jul 27, 2013 fca438b · Jul 27, 2013 History 92 Commits
runs	runs	Updated README with spam filtering instructions	Jul 25, 2013
scripts	scripts	Updated README with spam filtering instructions	Jul 25, 2013
src	src	Fixed stream closing issues	Jul 27, 2013
.gitignore	.gitignore	Updated filtering for near-duplicates and spam scores	Jul 23, 2013
HISTORY.md	HISTORY.md	Aded code boilterplate, tweaked documentation.	Jul 14, 2013
README.md	README.md	Document extraction app added	Jul 26, 2013
org.eclipse.jdt.core.prefs	org.eclipse.jdt.core.prefs	Updated filtering for near-duplicates and spam scores	Jul 23, 2013
org.eclipse.jdt.ui.prefs	org.eclipse.jdt.ui.prefs	Updated filtering for near-duplicates and spam scores	Jul 23, 2013
pom.xml	pom.xml	[maven-release-plugin] prepare for next development iteration	Jul 14, 2013

Repository files navigation

ClueWeb Tools (Fork) - Retrieval models

The code is forked from Jimmy Lin's clueweb repository. The only change occured in the addition of a retrieval app and spam filter app

Retrieval

Implemented is currently the basic language modeling approach to IR; smoothing types are linear interpolation and Dirichlet.

To run the code first follow the installation guideline of the original [clueweb repository]((https://github.com/lintool/clueweb) and build the dictionary and document vectors as described.

To conduct a retrieval run, call:

$ hadoop jar clueweb-tools-X.X-SNAPSHOT-fatjar.jar \
	org.clueweb.clueweb12.app.LMRetrieval \
	-dictionary /data/private/clueweb12/derived/dictionary.XXX \
	-smoothing 1000 \
	-output /user/chauff/res.dir1000 \
	-queries /user/chauff/web.queries.trec2013 \
	-docvector /data/private/clueweb12/derived/docvectors.XXX/*/part* \
	-topk 1000 \
	-preprocessing porter

The parameters are:

dictionary: HDFS path to the dictionary created by the clueweb tools
smoothing: the smoothing parameter in the LM-based retrieval model; a value of <=1 automatically backs off to smoothing with linear interpolation while a value >1 runs Dirichlet smoothing (default is 1000)
output: folder in which the TREC results are collected (in TREC result file format); to merge everything into one file in the end call hadoop fs -getmerge /user/chauff/res.dir1000 res.dir1000; the merged result file should run smoothly through trec_eval
queries: HDFS path to query file (assumed format is the same as this year's distributed query file, i.e. per line [queryID]:[term1] [term2] ...)
docvector: HDFS path to the document vectors (PFor format) created by the clueweb tools; beware of the necessity for using * to identify the files (instead of just the folder)
topk: number of results that should be returned per query (default is 1000)
preprocessing: indicates the tokenizaton/stemming procedure; either porter or standard at the moment; needs to be in line with the dictionary/docvector

Spam Filter

The spam scores provided by UWaterloo are also available on sara.

The spam filtering app takes as input a TREC result file (generated by the retrieval app for instance) and filters out all documents with a spam score below a certain threshold. Spam scores can be between 0 and 99 with the spammiest documents having a score of 0 and the most non-spam documents having a score of 99. Using 70 as threshold usually works well.

To run the code, call:

$ hadoop jar clueweb-tools-0.3-SNAPSHOT-fatjar.jar
	org.clueweb.clueweb12.app.SpamScoreFiltering \
	-output /user/chauff/res.dir1000.porter.spamFiltered \
	-spamScoreFolder /data/private/clueweb12/derived/waterloo-spam-cw12-decoded \ 		
	-spamThreshold 70 \
	-trecinputfile /user/chauff/res.dir1000.porter \

The parameters are:

output: folder in which the TREC results are collected (in TREC result file format)
spamScoreFolder: HDFS path to the folder where the UWaterloo spam scores reside
spamThreshold: documents with a spam score BELOW this number are considered as spam
trecinputfile: HDFS path to the TREC result file which is used as starting point for filtering

De-duplication

To increase diversity, duplicate documents can be removed from the result ranking (in effect pushing lower ranked results up the ranking).

A simple cosine based similarity approach is implemented in DuplicateFiltering: every document at rank x is compared to all non-duplicate documents at higher ranks. If its cosine similarity is high enough, it is filtered out.

To run the code, call:

$ hadoop jar clueweb-tools-0.3-SNAPSHOT-fatjar.jar \
	org.clueweb.clueweb12.app.DuplicateFiltering 
	-cosineSimThreshold 0.8 \ 
	-dictionary /data/private/clueweb12/derived/dictionary.XXX \
	-docvector /data/private/clueweb12/derived/docvectors.XXX/*/part* \
	-output /user/chauff/res.dir1000.porter.deduplicated \
	-topk 1000 \
	-trecinputfile /user/chauff/res.dir1000.porter

The parameters (apart from the usual ones) are:

cosineSimThreshold: documents having a cosine similarity above this threshold are removed from the result file
trecinputfile: file in TREC result format which is used as a starting point for de-duplication

Document extraction

A helper app: given a file with a list of docids, it extracts the documents' content from the WARC files.

To run the code, call:

$ hadoop jar clueweb-tools-0.3-SNAPSHOT-fatjar.jar \
	org.clueweb.clueweb12.app.DocumentExtractor \
	-docidsfile /user/chauff/docids \
	-input /data/private/clueweb12/Disk*/*/*/*.warc.gz \
	-keephtml false \
	-output /user/chauff/docids-output

The parameters are:

docidsfile: a file with one docid per line; all docids are extracted from the WARC input files
input: list of WARC files
keephtml: parameter that is either true (keep the HTML source of each document) or false (parse the documents, remove HTML)
output: folder where the documents' content is stored - one file per docid

Retrieval runs

The files runs/res.dir1000.{standard,porter} contain the baseline results when running the above retrieval program (i.e. LM with Dirichlet smoothing and mu=1000) with standard and porter preprocessing respectively. On an empty sara cluster, this run on 50 queries takes about one hour.

The file runs/res.dir1000.porter.spamFiltered is based on runs/res.dir1000.porter and filters out all documents with a spam score below 70.

Sanity check

To have confidence in the implementation, the baseline runs are compared with the official 2013 baselines (Indri runs) provided by the TREC's Web track organizers.

Since no relevance judgments are available for ClueWeb12, we report the overlap in document ids among the top 10 / top 1000 ranked documents for each query between our baseline and the organizer ql baseline results-catA.txt. The Perl script to compute the overlap is available as well.

The organizer's baseline was run with Krovetz stemming, so we expect the Porter-based run to have higher overlap than the standard run. This is indeed the case. The few 0% queries can be explained by the different tokenization, HTML parsing and the different stemming approaches (Porter is more agressive than Krovetz).

Query	Standard Top 10	Standard Top 1000	Porter Top 10	Porter Top 1000
201	90%	84%	90%	85%
202	60%	88%	70%	88%
203	60%	66%	70%	73%
204	20%	70%	70%	83%
205	30%	46%	60%	70%
206	70%	85%	70%	87%
207	0%	15%	0%	15%
208	60%	89%	60%	91%
209	30%	57%	80%	81%
210	50%	81%	70%	83%
211	20%	22%	50%	52%
212	30%	46%	60%	86%
213	90%	92%	90%	95%
214	60%	67%	100%	83%
215	10%	53%	20%	60%
216	20%	50%	60%	82%
217	30%	63%	40%	58%
218	50%	59%	80%	89%
219	0%	14%	0%	15%
220	10%	24%	40%	67%
221	40%	69%	60%	71%
222	90%	73%	100%	88%
223	100%	81%	100%	86%
224	40%	47%	40%	59%
225	80%	88%	80%	83%
226	70%	88%	70%	88%
227	0%	3%	0%	5%
228	10%	42%	40%	57%
229	60%	80%	90%	91%
230	0%	29%	50%	28%
231	0%	28%	30%	27%
232	80%	63%	80%	74%
233	70%	86%	70%	94%
234	70%	85%	80%	89%
235	60%	76%	60%	84%
236	70%	74%	80%	84%
237	90%	52%	80%	60%
238	70%	63%	70%	68%
239	70%	92%	90%	93%
240	80%	45%	80%	75%
241	0%	2%	0%	33%
242	60%	89%	100%	94%
243	40%	82%	60%	84%
244	70%	92%	80%	92%
245	50%	78%	30%	83%
246	30%	72%	80%	81%
247	80%	56%	90%	60%
248	50%	63%	90%	87%
249	0%	2%	0%	4%
250	90%	53%	90%	57%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClueWeb Tools (Fork) - Retrieval models

Retrieval

Spam Filter

De-duplication

Document extraction

Retrieval runs

Sanity check

About

Releases

Packages

Languages

lintool/clueweb

Folders and files

Latest commit

History

Repository files navigation

ClueWeb Tools (Fork) - Retrieval models

Retrieval

Spam Filter

De-duplication

Document extraction

Retrieval runs

Sanity check

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages