Skip to content

Commit

Permalink
Fixed MS MARCO docs to latest version of pyserini==0.9.3.0 (#1238)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored May 28, 2020
1 parent f3bf7d2 commit 2b8453c
Show file tree
Hide file tree
Showing 2 changed files with 41 additions and 47 deletions.
40 changes: 19 additions & 21 deletions docs/experiments-msmarco-passage.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ We also have a [separate page](experiments-doc2query.md) describing document exp
We're going to use `msmarco-passage/` as the working directory.
First, we need to download and extract the MS MARCO passage dataset:

```
```bash
mkdir collections/msmarco-passage
mkdir indexes/msmarco-passage

Expand All @@ -21,17 +21,17 @@ To confirm, `collectionandqueries.tar.gz` should have MD5 checksum of `31644046b

Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files (which have one json object per line):

```
python ./src/main/python/msmarco/convert_collection_to_jsonl.py \
```bash
python src/main/python/msmarco/convert_collection_to_jsonl.py \
--collection_path collections/msmarco-passage/collection.tsv --output_folder collections/msmarco-passage/collection_jsonl
```

The above script should generate 9 jsonl files in `collections/msmarco-passage/collection_jsonl`, each with 1M lines (except for the last one, which should have 841,823 lines).

We can now index these docs as a `JsonCollection` using Anserini:

```
sh ./target/appassembler/bin/IndexCollection -collection JsonCollection \
```bash
sh target/appassembler/bin/IndexCollection -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator -threads 9 -input collections/msmarco-passage/collection_jsonl \
-index indexes/msmarco-passage/lucene-index-msmarco -storePositions -storeDocvectors -storeRaw
```
Expand All @@ -43,17 +43,17 @@ The indexing speed may vary... on a modern desktop with an SSD, indexing takes l

Since queries of the set are too many (+100k), it would take a long time to retrieve all of them. To speed this up, we use only the queries that are in the qrels file:

```
python ./src/main/python/msmarco/filter_queries.py --qrels collections/msmarco-passage/qrels.dev.small.tsv \
--queries msmarco-passage/queries.dev.tsv --output_queries collections/msmarco-passage/queries.dev.small.tsv
```bash
python src/main/python/msmarco/filter_queries.py --qrels collections/msmarco-passage/qrels.dev.small.tsv \
--queries collections/msmarco-passage/queries.dev.tsv --output_queries collections/msmarco-passage/queries.dev.small.tsv
```

The output queries file should contain 6980 lines.

We can now retrieve this smaller set of queries:

```
python ./src/main/python/msmarco/retrieve.py --hits 1000 --threads 1 \
```bash
python src/main/python/msmarco/retrieve.py --hits 1000 --threads 1 \
--index indexes/msmarco-passage/lucene-index-msmarco --qid_queries collections/msmarco-passage/queries.dev.small.tsv \
--output runs/run.msmarco-passage.dev.small.tsv
```
Expand All @@ -67,8 +67,8 @@ On a modern desktop with an SSD, we can get ~0.06 s/query (taking about seven mi

Alternatively, we can run the same script implemented in Java, which is a bit faster:

```
./target/appassembler/bin/SearchMsmarco -hits 1000 -threads 1 \
```bash
sh target/appassembler/bin/SearchMsmarco -hits 1000 -threads 1 \
-index indexes/msmarco-passage/lucene-index-msmarco -qid_queries collections/msmarco-passage/queries.dev.small.tsv \
-output runs/run.msmarco-passage.dev.small.tsv
```
Expand All @@ -77,8 +77,8 @@ Similarly, we can perform multithreaded retrieval by changing the `-threads` arg

Finally, we can evaluate the retrieved documents using this the official MS MARCO evaluation script:

```
python ./src/main/python/msmarco/msmarco_eval.py \
```bash
python src/main/python/msmarco/msmarco_eval.py \
collections/msmarco-passage/qrels.dev.small.tsv runs/run.msmarco-passage.dev.small.tsv
```

Expand All @@ -94,18 +94,18 @@ QueriesRanked: 6980
We can also use the official TREC evaluation tool, `trec_eval`, to compute other metrics than MRR@10.
For that we first need to convert runs and qrels files to the TREC format:

```
python ./src/main/python/msmarco/convert_msmarco_to_trec_run.py \
```bash
python src/main/python/msmarco/convert_msmarco_to_trec_run.py \
--input_run runs/run.msmarco-passage.dev.small.tsv --output_run runs/run.msmarco-passage.dev.small.trec

python ./src/main/python/msmarco/convert_msmarco_to_trec_qrels.py \
python src/main/python/msmarco/convert_msmarco_to_trec_qrels.py \
--input_qrels collections/msmarco-passage/qrels.dev.small.tsv --output_qrels collections/msmarco-passage/qrels.dev.small.trec
```

And run the `trec_eval` tool:

```
./eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
```bash
eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.dev.small.trec
```

Expand Down Expand Up @@ -145,8 +145,6 @@ Setting | MRR@10 | MAP | Recall@1000 |
Default (`k1=0.9`, `b=0.4`) | 0.1839 | 0.1925 | 0.8526
Tuned (`k1=0.82`, `b=0.72`) | 0.1875 | 0.1956 | 0.8578



## Replication Log

+ Results replicated by [@ronakice](https://github.com/ronakice) on 2019-08-12 (commit [`5b29d16`](https://github.com/castorini/anserini/commit/5b29d1654abc5e8a014c2230da990ab2f91fb340))
Expand Down
48 changes: 22 additions & 26 deletions src/main/python/msmarco/retrieve.py
Original file line number Diff line number Diff line change
@@ -1,36 +1,32 @@
# -*- coding: utf-8 -*-
'''
Anserini: A Lucene toolkit for replicable information retrieval research
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
'''
#
# Pyserini: Python interface to the Anserini IR toolkit built on Lucene
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import time

# Pyserini setup
import os, sys
sys.path += ['src/main/python']

from pyserini.search import pysearch

if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Retrieve MS MARCO Passages.')
parser.add_argument('--qid_queries', required=True, default='', help='query id - query mapping file')
parser.add_argument('--output', required=True, default='', help='output filee')
parser.add_argument('--index', required=True, default='', help='index path')
parser.add_argument('--hits', default=10, help='number of hits to retrieve')
parser.add_argument('--k1', default=0.82, help='BM25 k1 parameter')
parser.add_argument('--b', default=0.68, help='BM25 b parameter')
parser.add_argument('--hits', default=10, type=int, help='number of hits to retrieve')
parser.add_argument('--k1', default=0.82, type=float, help='BM25 k1 parameter')
parser.add_argument('--b', default=0.68, type=float, help='BM25 b parameter')
# See our MS MARCO documentation to understand how these parameter values were tuned.
parser.add_argument('--rm3', action='store_true', default=False, help='use RM3')
parser.add_argument('--fbTerms', default=10, type=int, help='RM3 parameter: number of expansion terms')
Expand All @@ -43,10 +39,10 @@
total_start_time = time.time()

searcher = pysearch.SimpleSearcher(args.index)
searcher.set_bm25_similarity(float(args.k1), float(args.b))
searcher.set_bm25(args.k1, args.b)
print('Initializing BM25, setting k1={} and b={}'.format(args.k1, args.b), flush=True)
if args.rm3:
searcher.set_rm3_reranker(args.fbTerms, args.fbDocs, args.originalQueryWeight)
searcher.set_rm3(args.fbTerms, args.fbDocs, args.originalQueryWeight)
print('Initializing RM3, setting fbTerms={}, fbDocs={} and originalQueryWeight={}'.format(args.fbTerms, args.fbDocs, args.originalQueryWeight), flush=True)

if args.threads == 1:
Expand All @@ -55,7 +51,7 @@
start_time = time.time()
for line_number, line in enumerate(open(args.qid_queries, 'r', encoding='utf8')):
qid, query = line.strip().split('\t')
hits = searcher.search(query.encode('utf8'), int(args.hits))
hits = searcher.search(query, args.hits)
if line_number % 100 == 0:
time_per_query = (time.time() - start_time) / (line_number + 1)
print('Retrieving query {} ({:0.3f} s/query)'.format(line_number, time_per_query), flush=True)
Expand All @@ -72,7 +68,7 @@
qids.append(qid)
queries.append(query)

results = searcher.batch_search(queries, qids, args.hits, -1, args.threads)
results = searcher.batch_search(queries, qids, args.hits, args.threads)

with open(args.output, 'w') as fout:
for qid in qids:
Expand Down

0 comments on commit 2b8453c

Please sign in to comment.