-
Notifications
You must be signed in to change notification settings - Fork 470
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Tweaked regression for NTCIR-8 Monolingual Chinese to build directly …
…from LDC source (#822) Previously, regression was using the incorrect collection and needed an extra Python script to convert from TREC doc format to JSON.
- Loading branch information
Showing
10 changed files
with
100 additions
and
188 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Anserini: Regressions for [NTCIR-8 Monolingual Chinese](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html) | ||
|
||
This page documents regression experiments for [NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual topics)](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html). | ||
The description of the document collection can be found in the [NTCIR-8 data page](http://research.nii.ac.jp/ntcir/permission/ntcir-8/perm-en-ACLIA.html): Xinhua articles from 2002-2005, totalling 308,845 documents, from [LDC2007T38: Chinese Gigaword Third Edition](https://catalog.ldc.upenn.edu/LDC2007T38). | ||
We build the index directly from the raw LDC data: `data/xin_cmn/xin_cmn_200[2-5]*` (48 files). | ||
|
||
## Indexing | ||
|
||
Typical indexing command: | ||
|
||
``` | ||
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection \ | ||
-generator LuceneDocumentGenerator -threads 16 -input /path/to/ntcir8-zh -index \ | ||
lucene-index.ntcir8-zh.pos+docvectors+rawdocs -storePositions -storeDocvectors \ | ||
-storeRawDocs -language zh -uniqueDocid -optimize >& \ | ||
log.ntcir8-zh.pos+docvectors+rawdocs & | ||
``` | ||
|
||
The directory `/path/to/ntcir8-zh/` should be a directory containing the collection, 48 gzipped files matching the pattern `xin_cmn_200[2-5]*` from LDC2007T38. | ||
|
||
For additional details, see explanation of [common indexing options](common-indexing-options.md). | ||
|
||
## Retrieval | ||
|
||
Topics and qrels are stored in `src/main/resources/topics-and-qrels/`. | ||
The regression experiments here evaluate on the 73 questions. | ||
|
||
After indexing has completed, you should be able to perform retrieval as follows: | ||
|
||
``` | ||
nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.ntcir8-zh.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.ntcir8zh.eval.txt -output run.ntcir8-zh.bm25.topics.ntcir8zh.eval.txt -language zh -bm25 & | ||
``` | ||
|
||
Evaluation can be performed using `trec_eval`: | ||
|
||
``` | ||
eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.ntcir8.eval.txt run.ntcir8-zh.bm25.topics.ntcir8zh.eval.txt | ||
``` | ||
|
||
## Effectiveness | ||
|
||
With the above commands, you should be able to replicate the following results: | ||
|
||
MAP | BM25 | | ||
:---------------------------------------|-----------| | ||
[NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual)](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html)| 0.4014 | | ||
|
||
|
||
P30 | BM25 | | ||
:---------------------------------------|-----------| | ||
[NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual)](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html)| 0.3365 | | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.