Tweaked regression for NTCIR-8 Monolingual Chinese to build directly …

…from LDC source (#822) Previously, regression was using the incorrect collection and needed an extra Python script to convert from TREC doc format to JSON.
castorini · Oct 11, 2019 · 445bb45 · 445bb45
1 parent c87824f
commit 445bb45
Show file tree

Hide file tree

Showing 10 changed files with 100 additions and 188 deletions.
diff --git a/README.md b/README.md
@@ -75,6 +75,7 @@ Note that these regressions capture the "out of the box" experience, based on [_
 + [Regressions for the MS MARCO Passage Task](docs/regressions-msmarco-passage.md)
 + [Regressions for the MS MARCO Passage Task with Doc2query expansion](docs/regressions-msmarco-passage-doc2query.md)
 + [Regressions for the MS MARCO Document Task](docs/regressions-msmarco-doc.md)
++ [Regressions for NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual)](docs/regressions-ntcir8-zh.md)
 
 Other experiments:
 

diff --git a/docs/experiments-ntcir8-zh.md b/docs/experiments-ntcir8-zh.md
diff --git a/docs/regressions-msmarco-doc.md b/docs/regressions-msmarco-doc.md
@@ -26,21 +26,21 @@ The regression experiments here evaluate on the 5193 dev set questions; see [thi
 After indexing has completed, you should be able to perform retrieval as follows:
 
 ```
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default.topics.msmarco-doc.dev.txt -bm25 &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default.topics.msmarco-doc.dev.txt -bm25 &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default+rm3.topics.msmarco-doc.dev.txt -bm25 -rm3 &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default+rm3.topics.msmarco-doc.dev.txt -bm25 -rm3 &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default+ax.topics.msmarco-doc.dev.txt -bm25 -axiom -rerankCutoff 20 -axiom.deterministic &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default+ax.topics.msmarco-doc.dev.txt -bm25 -axiom -rerankCutoff 20 -axiom.deterministic &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default+prf.topics.msmarco-doc.dev.txt -bm25 -bm25prf &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default+prf.topics.msmarco-doc.dev.txt -bm25 -bm25prf &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned+rm3.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 -rm3 &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned+rm3.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 -rm3 &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned+ax.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 -axiom -rerankCutoff 20 -axiom.deterministic &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned+ax.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 -axiom -rerankCutoff 20 -axiom.deterministic &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned+prf.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 -bm25prf &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned+prf.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 -bm25prf &
 
 ```
 

diff --git a/docs/regressions-msmarco-passage-doc2query.md b/docs/regressions-msmarco-passage-doc2query.md
@@ -33,13 +33,13 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi
 After indexing has completed, you should be able to perform retrieval as follows:
 
 ```
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25 &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25 &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -rm3 &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -rm3 &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-tuned.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-tuned.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -rm3 &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -rm3 &
 
 ```
 

diff --git a/docs/regressions-msmarco-passage.md b/docs/regressions-msmarco-passage.md
@@ -27,21 +27,21 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi
 After indexing has completed, you should be able to perform retrieval as follows:
 
 ```
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25 &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25 &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -rm3 &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -rm3 &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+ax.topics.msmarco-passage.dev-subset.txt -bm25 -axiom -rerankCutoff 20 -axiom.deterministic &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+ax.topics.msmarco-passage.dev-subset.txt -bm25 -axiom -rerankCutoff 20 -axiom.deterministic &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+prf.topics.msmarco-passage.dev-subset.txt -bm25 -bm25prf &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+prf.topics.msmarco-passage.dev-subset.txt -bm25 -bm25prf &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -rm3 &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -rm3 &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+ax.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -axiom -rerankCutoff 20 -axiom.deterministic &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+ax.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -axiom -rerankCutoff 20 -axiom.deterministic &
 
-nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+prf.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -bm25prf &
+nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+prf.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -bm25prf &
 
 ```
 

diff --git a/docs/regressions-ntcir8-zh.md b/docs/regressions-ntcir8-zh.md
@@ -0,0 +1,55 @@
+# Anserini: Regressions for [NTCIR-8 Monolingual Chinese](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html)
+
+This page documents regression experiments for [NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual topics)](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html).
+The description of the document collection can be found in the [NTCIR-8 data page](http://research.nii.ac.jp/ntcir/permission/ntcir-8/perm-en-ACLIA.html): Xinhua articles from 2002-2005, totalling 308,845 documents, from [LDC2007T38: Chinese Gigaword Third Edition](https://catalog.ldc.upenn.edu/LDC2007T38).
+We build the index directly from the raw LDC data: `data/xin_cmn/xin_cmn_200[2-5]*` (48 files).
+
+## Indexing
+
+Typical indexing command:
+
+```
+nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection \
+-generator LuceneDocumentGenerator -threads 16 -input /path/to/ntcir8-zh -index \
+lucene-index.ntcir8-zh.pos+docvectors+rawdocs -storePositions -storeDocvectors \
+-storeRawDocs -language zh -uniqueDocid -optimize >& \
+log.ntcir8-zh.pos+docvectors+rawdocs &
+```
+
+The directory `/path/to/ntcir8-zh/` should be a directory containing the collection, 48 gzipped files matching the pattern `xin_cmn_200[2-5]*` from LDC2007T38.
+
+For additional details, see explanation of [common indexing options](common-indexing-options.md).
+
+## Retrieval
+
+Topics and qrels are stored in `src/main/resources/topics-and-qrels/`.
+The regression experiments here evaluate on the 73 questions.
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.ntcir8-zh.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.ntcir8zh.eval.txt -output run.ntcir8-zh.bm25.topics.ntcir8zh.eval.txt -language zh -bm25 &
+
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.ntcir8.eval.txt run.ntcir8-zh.bm25.topics.ntcir8zh.eval.txt
+
+```
+
+## Effectiveness
+
+With the above commands, you should be able to replicate the following results:
+
+MAP                                     | BM25      |
+:---------------------------------------|-----------|
+[NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual)](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html)| 0.4014    |
+
+
+P30                                     | BM25      |
+:---------------------------------------|-----------|
+[NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual)](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html)| 0.3365    |
+
+
diff --git a/docs/regressions.md b/docs/regressions.md
@@ -45,6 +45,8 @@ nohup python src/main/python/run_regression.py --collection car17v2.0 >& log.car
 nohup python src/main/python/run_regression.py --collection msmarco-passage >& log.msmarco-passage &
 nohup python src/main/python/run_regression.py --collection msmarco-passage-doc2query >& log.msmarco-passage-doc2query &
 nohup python src/main/python/run_regression.py --collection msmarco-doc >& log.msmarco-doc &
+
+nohup python src/main/python/run_regression.py --collection ntcir8-zh >& log.ntcir8-zh &
 ```
 
 Copy and paste the following lines into console on `tuna` to run the regressions from the raw collection, which includes building indexes from scratch (note difference is the additional `--index` option):
@@ -71,6 +73,8 @@ nohup python src/main/python/run_regression.py --index --collection car17v2.0 >&
 nohup python src/main/python/run_regression.py --index --collection msmarco-passage >& log.msmarco-passage &
 nohup python src/main/python/run_regression.py --index --collection msmarco-passage-doc2query >& log.msmarco-passage-doc2query &
 nohup python src/main/python/run_regression.py --index --collection msmarco-doc >& log.msmarco-doc &
+
+nohup python src/main/python/run_regression.py --index --collection ntcir8-zh >& log.ntcir8-zh &
 ```
 
 Watch out: the full `cw12` regress takes a couple days to run and generates a 12TB index!