Skip to content

Commit

Permalink
Tweaked regression for NTCIR-8 Monolingual Chinese to build directly …
Browse files Browse the repository at this point in the history
…from LDC source (#822)

Previously, regression was using the incorrect collection and needed an extra Python script to convert from TREC doc format to JSON.
  • Loading branch information
lintool authored Oct 11, 2019
1 parent c87824f commit 445bb45
Show file tree
Hide file tree
Showing 10 changed files with 100 additions and 188 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ Note that these regressions capture the "out of the box" experience, based on [_
+ [Regressions for the MS MARCO Passage Task](docs/regressions-msmarco-passage.md)
+ [Regressions for the MS MARCO Passage Task with Doc2query expansion](docs/regressions-msmarco-passage-doc2query.md)
+ [Regressions for the MS MARCO Document Task](docs/regressions-msmarco-doc.md)
+ [Regressions for NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual)](docs/regressions-ntcir8-zh.md)

Other experiments:

Expand Down
52 changes: 0 additions & 52 deletions docs/experiments-ntcir8-zh.md

This file was deleted.

16 changes: 8 additions & 8 deletions docs/regressions-msmarco-doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,21 +26,21 @@ The regression experiments here evaluate on the 5193 dev set questions; see [thi
After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default.topics.msmarco-doc.dev.txt -bm25 &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default.topics.msmarco-doc.dev.txt -bm25 &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default+rm3.topics.msmarco-doc.dev.txt -bm25 -rm3 &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default+rm3.topics.msmarco-doc.dev.txt -bm25 -rm3 &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default+ax.topics.msmarco-doc.dev.txt -bm25 -axiom -rerankCutoff 20 -axiom.deterministic &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default+ax.topics.msmarco-doc.dev.txt -bm25 -axiom -rerankCutoff 20 -axiom.deterministic &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default+prf.topics.msmarco-doc.dev.txt -bm25 -bm25prf &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-default+prf.topics.msmarco-doc.dev.txt -bm25 -bm25prf &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned+rm3.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 -rm3 &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned+rm3.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 -rm3 &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned+ax.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 -axiom -rerankCutoff 20 -axiom.deterministic &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned+ax.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 -axiom -rerankCutoff 20 -axiom.deterministic &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned+prf.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 -bm25prf &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output run.msmarco-doc.bm25-tuned+prf.topics.msmarco-doc.dev.txt -bm25 -k1 3.44 -b 0.87 -bm25prf &
```

Expand Down
8 changes: 4 additions & 4 deletions docs/regressions-msmarco-passage-doc2query.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,13 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi
After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25 &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25 &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -rm3 &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -rm3 &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-tuned.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-tuned.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -rm3 &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage-doc2query.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage-doc2query.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -rm3 &
```

Expand Down
16 changes: 8 additions & 8 deletions docs/regressions-msmarco-passage.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,21 +27,21 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi
After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25 &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25 &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -rm3 &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -rm3 &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+ax.topics.msmarco-passage.dev-subset.txt -bm25 -axiom -rerankCutoff 20 -axiom.deterministic &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+ax.topics.msmarco-passage.dev-subset.txt -bm25 -axiom -rerankCutoff 20 -axiom.deterministic &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+prf.topics.msmarco-passage.dev-subset.txt -bm25 -bm25prf &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+prf.topics.msmarco-passage.dev-subset.txt -bm25 -bm25prf &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -rm3 &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -rm3 &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+ax.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -axiom -rerankCutoff 20 -axiom.deterministic &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+ax.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -axiom -rerankCutoff 20 -axiom.deterministic &
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+prf.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -bm25prf &
nohup target/appassembler/bin/SearchCollection -topicreader TsvInt -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+prf.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.68 -bm25prf &
```

Expand Down
55 changes: 55 additions & 0 deletions docs/regressions-ntcir8-zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Anserini: Regressions for [NTCIR-8 Monolingual Chinese](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html)

This page documents regression experiments for [NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual topics)](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html).
The description of the document collection can be found in the [NTCIR-8 data page](http://research.nii.ac.jp/ntcir/permission/ntcir-8/perm-en-ACLIA.html): Xinhua articles from 2002-2005, totalling 308,845 documents, from [LDC2007T38: Chinese Gigaword Third Edition](https://catalog.ldc.upenn.edu/LDC2007T38).
We build the index directly from the raw LDC data: `data/xin_cmn/xin_cmn_200[2-5]*` (48 files).

## Indexing

Typical indexing command:

```
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection \
-generator LuceneDocumentGenerator -threads 16 -input /path/to/ntcir8-zh -index \
lucene-index.ntcir8-zh.pos+docvectors+rawdocs -storePositions -storeDocvectors \
-storeRawDocs -language zh -uniqueDocid -optimize >& \
log.ntcir8-zh.pos+docvectors+rawdocs &
```

The directory `/path/to/ntcir8-zh/` should be a directory containing the collection, 48 gzipped files matching the pattern `xin_cmn_200[2-5]*` from LDC2007T38.

For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

Topics and qrels are stored in `src/main/resources/topics-and-qrels/`.
The regression experiments here evaluate on the 73 questions.

After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.ntcir8-zh.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.ntcir8zh.eval.txt -output run.ntcir8-zh.bm25.topics.ntcir8zh.eval.txt -language zh -bm25 &
```

Evaluation can be performed using `trec_eval`:

```
eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.ntcir8.eval.txt run.ntcir8-zh.bm25.topics.ntcir8zh.eval.txt
```

## Effectiveness

With the above commands, you should be able to replicate the following results:

MAP | BM25 |
:---------------------------------------|-----------|
[NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual)](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html)| 0.4014 |


P30 | BM25 |
:---------------------------------------|-----------|
[NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual)](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html)| 0.3365 |


4 changes: 4 additions & 0 deletions docs/regressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ nohup python src/main/python/run_regression.py --collection car17v2.0 >& log.car
nohup python src/main/python/run_regression.py --collection msmarco-passage >& log.msmarco-passage &
nohup python src/main/python/run_regression.py --collection msmarco-passage-doc2query >& log.msmarco-passage-doc2query &
nohup python src/main/python/run_regression.py --collection msmarco-doc >& log.msmarco-doc &
nohup python src/main/python/run_regression.py --collection ntcir8-zh >& log.ntcir8-zh &
```

Copy and paste the following lines into console on `tuna` to run the regressions from the raw collection, which includes building indexes from scratch (note difference is the additional `--index` option):
Expand All @@ -71,6 +73,8 @@ nohup python src/main/python/run_regression.py --index --collection car17v2.0 >&
nohup python src/main/python/run_regression.py --index --collection msmarco-passage >& log.msmarco-passage &
nohup python src/main/python/run_regression.py --index --collection msmarco-passage-doc2query >& log.msmarco-passage-doc2query &
nohup python src/main/python/run_regression.py --index --collection msmarco-doc >& log.msmarco-doc &
nohup python src/main/python/run_regression.py --index --collection ntcir8-zh >& log.ntcir8-zh &
```

Watch out: the full `cw12` regress takes a couple days to run and generates a 12TB index!
Loading

0 comments on commit 445bb45

Please sign in to comment.