Skip to content
This repository has been archived by the owner on Nov 21, 2018. It is now read-only.

Merging in Claudia's LM Retrieval code #2

Open
wants to merge 96 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
b596009
added basic JM language model
chauff Jul 16, 2013
f0c887b
added basic JM language model
chauff Jul 16, 2013
76bf3e8
added selective term statistics reading
chauff Jul 16, 2013
58fa597
removed settings folder
chauff Jul 16, 2013
c05593e
retrieval functionality
chauff Jul 17, 2013
9fa704a
retrieval functionality
chauff Jul 17, 2013
0b2f22c
updated retrieval functionality
chauff Jul 17, 2013
c8356a7
updated retrieval functionality
chauff Jul 17, 2013
ed36a3e
Updated retrieval functionality
chauff Jul 17, 2013
e38af82
updated retrieval code
chauff Jul 18, 2013
f4e18e4
LM retrieval model in semi-working form
chauff Jul 18, 2013
5d092f9
Updated retrieval model; baseline run result added
chauff Jul 19, 2013
864fd89
Updated retrieval model; baseline run result added
chauff Jul 19, 2013
bd866fa
Updated retrieval model; baseline run result added
chauff Jul 19, 2013
d4ff2ce
Added overlap Perl script
chauff Jul 19, 2013
2814ea9
Added overlap Perl script
chauff Jul 19, 2013
ee73c71
updated README
chauff Jul 19, 2013
057ed13
updated README
chauff Jul 19, 2013
47efb6d
updated README
chauff Jul 19, 2013
6aad52c
updated README
chauff Jul 19, 2013
8b9d164
updated README
chauff Jul 19, 2013
94368b3
updated README
chauff Jul 19, 2013
020a10c
updated README
chauff Jul 19, 2013
f5505dd
updated README
chauff Jul 19, 2013
6c5ab08
Non-working version of cosine based document filtering added
chauff Jul 19, 2013
b89df4d
Refined filtering
chauff Jul 22, 2013
a8fa533
Refined filtering
chauff Jul 22, 2013
42e8749
erging changes from lintool clueweb
chauff Jul 23, 2013
d7c0b03
fixing merge/push/pull issue
chauff Jul 23, 2013
93c50a4
updated for with working branch of lintool/clueweb
chauff Jul 23, 2013
c63c320
Updated README
chauff Jul 23, 2013
4fd0b07
Updated filtering for near-duplicates and spam scores
chauff Jul 23, 2013
7a3c70c
Added switch for Lucene Analyzer
chauff Jul 23, 2013
f57aaca
moved Analyzer code to new class
chauff Jul 24, 2013
425e341
Fixed Build*DocVectors (missaligned Lucene Analyzer)
chauff Jul 24, 2013
2722800
Updated README, new baseline run added
chauff Jul 25, 2013
d5869ea
Updated README, new baseline run added
chauff Jul 25, 2013
9516464
Updated README, new baseline run added
chauff Jul 25, 2013
dd0abbc
Updated README, new baseline run added
chauff Jul 25, 2013
5ece04f
Updated README, new baseline run added
chauff Jul 25, 2013
d7ac619
Updated README, new baseline run added
chauff Jul 25, 2013
f0367c4
Updated README, new baseline run added
chauff Jul 25, 2013
b2fc146
Updated README, new baseline run added
chauff Jul 25, 2013
2aa5cf9
Updated Spam scoring
chauff Jul 25, 2013
2fda847
Spam filtering revised, working version
chauff Jul 25, 2013
d9dc41b
Updated README with spam filtering instructions
chauff Jul 25, 2013
8b7d392
safety commit
chauff Jul 25, 2013
db38e53
Document extraction app added
chauff Jul 26, 2013
b7c327d
Document extraction app added
chauff Jul 26, 2013
fca438b
Fixed stream closing issues
chauff Jul 27, 2013
ae40796
Added Krovetz stemmer.
jimmy0017 Jul 27, 2013
dbfd070
Started relevance model implementation
chauff Jul 29, 2013
850bda6
Krovetz stemmer code from lintool-clueweb added
chauff Jul 29, 2013
f4b00ef
small fixes
chauff Jul 30, 2013
31cb4c5
eclipse settings added
chauff Jul 30, 2013
c8854c3
Added sanity check to TRECResultFileParser
chauff Jul 30, 2013
fa285c6
small fixes
chauff Jul 30, 2013
3d1016b
Tika boilerpipe HTML parser added
chauff Jul 31, 2013
72d33e0
updated HTML parser options
chauff Aug 3, 2013
b3892f5
RM3 adaptation
chauff Aug 3, 2013
a2a8525
updated HTML parser
chauff Aug 4, 2013
59fd49a
updated html parser
chauff Aug 4, 2013
55694c9
Merge pull request #1 from chauff/htmlParser
chauff Aug 4, 2013
c96abbf
Updated README
chauff Aug 4, 2013
9903285
Updated README
chauff Aug 4, 2013
dd991cf
RMModel working, started RMRetrieval
chauff Aug 6, 2013
9c0ce63
RMModel working, started RMRetrieval
chauff Aug 6, 2013
3748880
deleted unused settings files
chauff Aug 6, 2013
2a7a31c
RM retrieval model
chauff Aug 6, 2013
013e1bb
RM retrieval model
chauff Aug 6, 2013
b3939fe
updates on RMRetrieval
chauff Aug 7, 2013
a229ff8
Updated RMRetrieval app
chauff Aug 7, 2013
986047b
Fixed RMRetrieval
chauff Aug 8, 2013
aa26f1b
Fixed RMRetrieval
chauff Aug 8, 2013
f9181ca
manually resolving conflicts
chauff Aug 8, 2013
1218718
Merge pull request #2 from chauff/RMModel
chauff Aug 8, 2013
9bcf992
Manually fixing remaining conflicts
chauff Aug 8, 2013
65bb9fd
Manually fixing conflicts
chauff Aug 8, 2013
146cea5
Added description for RM1/RM3 retrieval runs
chauff Aug 9, 2013
c124190
Add a Bitdeli badge to README
bitdeli-chef Feb 13, 2014
4e99015
Merge pull request #3 from bitdeli-chef/master
chauff Feb 13, 2014
846d24e
Added additional helper progs, additional work on relevance model
chauff Jun 3, 2014
cb965a4
Updated RM
chauff Aug 13, 2014
4ce1a25
Exception handling improved
chauff Aug 15, 2014
d287827
Added contenteditable material
chauff Aug 21, 2014
a884243
Added comparison to Indri runs
chauff Aug 29, 2014
bcbd3c3
Move to Hadoop 2.X
chauff Jul 11, 2016
c2a934f
Separate parsing of <title> and <body> tags added.
chauff Jul 15, 2016
9e64452
Unit testing HTML parsers
chauff Jul 15, 2016
b184d7f
-
chauff Jul 15, 2016
650f3d0
-
chauff Jul 15, 2016
fa896af
-
chauff Jul 15, 2016
168aa2f
Extraction of all documents in the specified path; duplicate check
chauff Jul 15, 2016
7da0362
Document extraction for Solr
chauff Jul 15, 2016
6fbe9f9
clueweb09.app.DocumentExtractor documentation updated
chauff Jul 15, 2016
5c0e91c
Version nudge
chauff Jul 15, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Added Krovetz stemmer.
jimmy0017 committed Jul 27, 2013

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit ae407960f13e0b597aefa167abd4971c1976b2e9
486 changes: 486 additions & 0 deletions src/main/java/org/clueweb/dictionary/KrovetzAnalyzer.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,486 @@
package org.clueweb.dictionary;

import java.io.IOException;
import java.io.Reader;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.en.KStemFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.CharArraySet;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;

import com.google.common.collect.Lists;

/**
* Filters {@link StandardTokenizer} with {@link StandardFilter}, {@link LowerCaseFilter},
* {@link StopFilter}, and {@link KStemFilter}.
*/
public final class KrovetzAnalyzer extends StopwordAnalyzerBase {

// Stopwords used in the baseline run of TREC 2013 Web track.
static final String[] STOPWORDS = {
"a",
"about",
"above",
"according",
"across",
"after",
"afterwards",
"again",
"against",
"albeit",
"all",
"almost",
"alone",
"along",
"already",
"also",
"although",
"always",
"am",
"among",
"amongst",
"an",
"and",
"another",
"any",
"anybody",
"anyhow",
"anyone",
"anything",
"anyway",
"anywhere",
"apart",
"are",
"around",
"as",
"at",
"av",
"be",
"became",
"because",
"become",
"becomes",
"becoming",
"been",
"before",
"beforehand",
"behind",
"being",
"below",
"beside",
"besides",
"between",
"beyond",
"both",
"but",
"by",
"can",
"cannot",
"canst",
"certain",
"cf",
"choose",
"contrariwise",
"cos",
"could",
"cu",
"day",
"do",
"does",
"doesn't",
"doing",
"dost",
"doth",
"double",
"down",
"dual",
"during",
"each",
"either",
"else",
"elsewhere",
"enough",
"et",
"etc",
"even",
"ever",
"every",
"everybody",
"everyone",
"everything",
"everywhere",
"except",
"excepted",
"excepting",
"exception",
"exclude",
"excluding",
"exclusive",
"far",
"farther",
"farthest",
"few",
"ff",
"first",
"for",
"formerly",
"forth",
"forward",
"from",
"front",
"further",
"furthermore",
"furthest",
"get",
"go",
"had",
"halves",
"hardly",
"has",
"hast",
"hath",
"have",
"he",
"hence",
"henceforth",
"her",
"here",
"hereabouts",
"hereafter",
"hereby",
"herein",
"hereto",
"hereupon",
"hers",
"herself",
"him",
"himself",
"hindmost",
"his",
"hither",
"hitherto",
"how",
"however",
"howsoever",
"i",
"ie",
"if",
"in",
"inasmuch",
"inc",
"include",
"included",
"including",
"indeed",
"indoors",
"inside",
"insomuch",
"instead",
"into",
"inward",
"inwards",
"is",
"it",
"its",
"itself",
"just",
"kind",
"kg",
"km",
"last",
"latter",
"latterly",
"less",
"lest",
"let",
"like",
"little",
"ltd",
"many",
"may",
"maybe",
"me",
"meantime",
"meanwhile",
"might",
"moreover",
"most",
"mostly",
"more",
"mr",
"mrs",
"ms",
"much",
"must",
"my",
"myself",
"namely",
"need",
"neither",
"never",
"nevertheless",
"next",
"no",
"nobody",
"none",
"nonetheless",
"noone",
"nope",
"nor",
"not",
"nothing",
"notwithstanding",
"now",
"nowadays",
"nowhere",
"of",
"off",
"often",
"ok",
"on",
"once",
"one",
"only",
"onto",
"or",
"other",
"others",
"otherwise",
"ought",
"our",
"ours",
"ourselves",
"out",
"outside",
"over",
"own",
"per",
"perhaps",
"plenty",
"provide",
"quite",
"rather",
"really",
"round",
"said",
"sake",
"same",
"sang",
"save",
"saw",
"see",
"seeing",
"seem",
"seemed",
"seeming",
"seems",
"seen",
"seldom",
"selves",
"sent",
"several",
"shalt",
"she",
"should",
"shown",
"sideways",
"since",
"slept",
"slew",
"slung",
"slunk",
"smote",
"so",
"some",
"somebody",
"somehow",
"someone",
"something",
"sometime",
"sometimes",
"somewhat",
"somewhere",
"spake",
"spat",
"spoke",
"spoken",
"sprang",
"sprung",
"stave",
"staves",
"still",
"such",
"supposing",
"than",
"that",
"the",
"thee",
"their",
"them",
"themselves",
"then",
"thence",
"thenceforth",
"there",
"thereabout",
"thereabouts",
"thereafter",
"thereby",
"therefore",
"therein",
"thereof",
"thereon",
"thereto",
"thereupon",
"these",
"they",
"this",
"those",
"thou",
"though",
"thrice",
"through",
"throughout",
"thru",
"thus",
"thy",
"thyself",
"till",
"to",
"together",
"too",
"toward",
"towards",
"ugh",
"unable",
"under",
"underneath",
"unless",
"unlike",
"until",
"up",
"upon",
"upward",
"upwards",
"us",
"use",
"used",
"using",
"very",
"via",
"vs",
"want",
"was",
"we",
"week",
"well",
"were",
"what",
"whatever",
"whatsoever",
"when",
"whence",
"whenever",
"whensoever",
"where",
"whereabouts",
"whereafter",
"whereas",
"whereat",
"whereby",
"wherefore",
"wherefrom",
"wherein",
"whereinto",
"whereof",
"whereon",
"wheresoever",
"whereto",
"whereunto",
"whereupon",
"wherever",
"wherewith",
"whether",
"whew",
"which",
"whichever",
"whichsoever",
"while",
"whilst",
"whither",
"who",
"whoa",
"whoever",
"whole",
"whom",
"whomever",
"whomsoever",
"whose",
"whosoever",
"why",
"will",
"wilt",
"with",
"within",
"without",
"worse",
"worst",
"would",
"wow",
"ye",
"yet",
"year",
"yippee",
"you",
"your",
"yours",
"yourself",
"yourselves",
};

/** Default maximum allowed token length */
public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;

private int maxTokenLength = DEFAULT_MAX_TOKEN_LENGTH;

public static final CharArraySet STOP_WORDS_SET = new CharArraySet(Version.LUCENE_43,
Lists.newArrayList(STOPWORDS), true);

public KrovetzAnalyzer() {
super(Version.LUCENE_43, STOP_WORDS_SET);
}

/**
* Set maximum allowed token length. If a token is seen that exceeds this length then it is
* discarded. This setting only takes effect the next time tokenStream or tokenStream is called.
*/
public void setMaxTokenLength(int length) {
maxTokenLength = length;
}

public int getMaxTokenLength() {
return maxTokenLength;
}

@Override
protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, stopwords);
tok = new KStemFilter(tok);
return new TokenStreamComponents(src, tok) {
@Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(KrovetzAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
}
26 changes: 24 additions & 2 deletions src/main/java/org/clueweb/util/AnalyzerFactory.java
Original file line number Diff line number Diff line change
@@ -1,24 +1,46 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.util;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;
import org.clueweb.dictionary.KrovetzAnalyzer;
import org.clueweb.dictionary.PorterAnalyzer;

public class AnalyzerFactory {

public static Analyzer getAnalyzer(String analyzerType) {
if (analyzerType.equals("standard")) {
return new org.apache.lucene.analysis.standard.StandardAnalyzer(Version.LUCENE_43);
return new StandardAnalyzer(Version.LUCENE_43);
}

if (analyzerType.equals("porter")) {
return new PorterAnalyzer();
}

if (analyzerType.equals("krovetz")) {
return new KrovetzAnalyzer();
}

return null;
}

public static String getOptions() {
return "standard|porter";
return "standard|porter|krovetz";
}
}
16 changes: 16 additions & 0 deletions src/test/java/org/clueweb/data/PForDocVectorTest.java
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.data;

import static org.junit.Assert.assertEquals;
16 changes: 16 additions & 0 deletions src/test/java/org/clueweb/data/VByteDocVectorTest.java
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.data;

import static org.junit.Assert.assertEquals;
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.dictionary;

import static org.junit.Assert.assertEquals;
@@ -15,7 +31,7 @@

import com.google.common.base.Joiner;

public class PorterAnalyzerTest {
public class AnalyzerTest {
@Test
public void test1() throws Exception {
Analyzer analyzer = new PorterAnalyzer();
@@ -28,6 +44,16 @@ public void test1() throws Exception {

@Test
public void test2() throws Exception {
Analyzer analyzer = new KrovetzAnalyzer();
List<String> tokens = AnalyzerUtils.parse(analyzer,
"The U.S. Dept. of Justice has announced that Panasonic and its subsidiary Sanyo have been fined $56.5 million for their roles in price fixing conspiracies involving battery cells and car parts.");

System.out.println(Joiner.on(",").join(tokens));
assertEquals(19, tokens.size());
}

@Test
public void test3() throws Exception {
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
List<String> tokens = AnalyzerUtils.parse(analyzer,
"The U.S. Dept. of Justice has announced that Panasonic and its subsidiary Sanyo have been fined $56.5 million for their roles in price fixing conspiracies involving battery cells and car parts.");
@@ -37,6 +63,6 @@ public void test2() throws Exception {
}

public static junit.framework.Test suite() {
return new JUnit4TestAdapter(PorterAnalyzerTest.class);
return new JUnit4TestAdapter(AnalyzerTest.class);
}
}