Skip to content

Latest commit

 

History

History
128 lines (90 loc) · 6.63 KB

experiments-msmarco-document.md

File metadata and controls

128 lines (90 loc) · 6.63 KB

PyGaggle: Baselines on MS MARCO Document Retrieval

This page contains instructions for running various neural reranking baselines on the MS MARCO document ranking task. Note that there is also a separate MS MARCO passage ranking task (dev subset) and a separate MS MARCO passage ranking task (entrie dev set).

Prior to running this, we suggest looking at our first-stage BM25 ranking instructions. We rerank the BM25 run files that contain ~1000 documents per query using monoT5. monoT5 is a pointwise reranker. This means that each document is scored independently using T5.

Since it can take many hours to run these models on all of the 5193 queries from the MS MARCO dev set, we will instead use a subset of 50 queries randomly sampled from the dev set.

Note 1: Run the following instructions at root of this repo. Note 2: Make sure that you have access to a GPU Note 3: Installation must have been done from source and make sure the anserini-eval submodule is pulled. To do this, first clone the repository recursively.

git clone --recursive https://github.com/castorini/pygaggle.git

Then install PyGaggle using:

pip install pygaggle/

Models

Data Prep

We're first going to download the queries, qrels and run files corresponding to the MS MARCO set considered. The run file is generated by following the BM25 ranking instructions. We'll store all these files in the data directory.

wget https://www.dropbox.com/s/8lvdkgzjjctxhzy/msmarco_doc_ans_small.zip -P data

To confirm, msmarco_doc_ans_small.zip should have MD5 checksum of aeed5902c23611e21eaa156d908c4748.

Next, we extract the contents into data.

unzip data/msmarco_doc_ans_small.zip -d data

msmarco_doc_ans_small contains two disjoint sets, fh and sh, and each set has 25 queries.

Let's download the pre-built MS MARCO index :

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/index-msmarco-doc-20201117-f87c94.tar.gz

index-msmarco-doc-20201117-f87c94.tar.gz should have MD5 checksum of ac747860e7a37aed37cc30ed3990f273. Then, we can extract it into into indexes:

tar xvfz index-msmarco-doc-20201117-f87c94.tar.gz -C indexes
rm index-msmarco-doc-20201117-f87c94.tar.gz

Now, we can begin with re-ranking the set.

Re-Ranking with monoT5

Let us now re-rank the first half:

python -um pygaggle.run.evaluate_document_ranker --split dev \
                                                --method t5 \
                                                --model castorini/monot5-base-msmarco \
                                                --dataset data/msmarco_doc_ans_small/fh \
                                                --model-type t5-base \
                                                --task msmarco \
                                                --index-dir indexes/index-msmarco-doc-20201117-f87c94 \
                                                --batch-size 32 \
                                                --output-file runs/run.monot5.doc_fh.dev.tsv

The following output will be visible after it has finished:

precision@1 0.2
recall@3  0.56
recall@50  0.84
recall@1000 0.88
mrr     0.38882
mrr@10   0.38271

It takes about 5 hours to re-rank this subset on MS MARCO using a P100. It is worth noting again that you might need to modify the batch size to best fit the GPU at hand.

Upon completion, the re-ranked run file runs/run.monot5.doc_fh.dev.tsv will be available in the runs directory.

We can modify the argument for --dataset to data/msmarco_doc_ans_small/sh to re-rank the second half of the dataset, and don't forget to change output file name.

The results are as follows:

precision@1 0.28
recall@3  0.32
recall@50  0.8
recall@1000 0.88
mrr     0.33617
mrr@10   0.31978

If you were able to replicate these results, please submit a PR adding to the replication log!

Replication Log