Update of TCT-ColBERTv2 docs for MS MARCO V2 (#736)

castorini · Aug 14, 2021 · 9b4ec11 · 9b4ec11
1 parent b3676b1
commit 9b4ec11
Show file tree

Hide file tree

Showing 5 changed files with 71 additions and 70 deletions.
diff --git a/README.md b/README.md
@@ -387,18 +387,19 @@ With Pyserini, it's easy to [reproduce](docs/reproducibility.md) runs on a numbe
 
 + [Reproducing runs directly from the Python package](docs/pypi-reproduction.md)
 + [Reproducing Robust04 baselines for ad hoc retrieval](docs/experiments-robust04.md)
-+ [Reproducing the BM25 baseline for MS MARCO Passage Ranking](docs/experiments-msmarco-passage.md)
-+ [Reproducing the BM25 baseline for MS MARCO Document Ranking](docs/experiments-msmarco-doc.md)
-+ [Reproducing the multi-field BM25 baseline for MS MARCO Document Ranking from Elasticsearch](docs/experiments-elastic.md)
++ [Reproducing the BM25 baseline for MS MARCO (V1) Passage Ranking](docs/experiments-msmarco-passage.md)
++ [Reproducing the BM25 baseline for MS MARCO (V1) Document Ranking](docs/experiments-msmarco-doc.md)
++ [Reproducing the multi-field BM25 baseline for MS MARCO (V1) Document Ranking from Elasticsearch](docs/experiments-elastic.md)
 + [Reproducing BM25 baselines on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2.md)
-+ [Reproducing DeepImpact experiments for MS MARCO Passage Ranking](docs/experiments-deepimpact.md)
-+ [Reproducing uniCOIL experiments for MS MARCO Passage Ranking](docs/experiments-unicoil.md)
++ [Reproducing DeepImpact experiments for MS MARCO (V1) Passage Ranking](docs/experiments-deepimpact.md)
++ [Reproducing uniCOIL experiments for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil.md)
 + [Reproducing uniCOIL experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-unicoil.md)
 
 ### Dense Retrieval
 
-+ [Reproducing TCT-ColBERTv2 experiments](docs/experiments-tct_colbert-v2.md)
-+ [Reproducing TCT-ColBERTv1 experiments](docs/experiments-tct_colbert.md)
++ [Reproducing TCT-ColBERTv1 experiments on the MS MARCO (V1) Collections](docs/experiments-tct_colbert.md)
++ [Reproducing TCT-ColBERTv2 experiments on the MS MARCO (V1) Collections](docs/experiments-tct_colbert-v2.md)
++ [Reproducing TCT-ColBERTv2 experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-tct_colbert-v2.md)
 + [Reproducing DPR experiments](docs/experiments-dpr.md)
 + [Reproducing ANCE experiments](docs/experiments-ance.md)
 + [Reproducing DistilBERT KD experiments](docs/experiments-distilbert_kd.md)

diff --git a/docs/experiments-msmarco-v2-tct_colbert-v2.md b/docs/experiments-msmarco-v2-tct_colbert-v2.md
@@ -1,91 +1,93 @@
-# Pyserini: Baseline for MS MARCO V2: TCT-ColBERT-V2
+# Pyserini: TCT-ColBERTv2 for MS MARCO (V2) Collections
 
-This guide provides instructions to reproduce the family of TCT-ColBERT-V2 dense retrieval models described in the following paper:
+This guide provides instructions to reproduce experiments using TCT-ColBERTv2 dense retrieval models on the MS MARCO (V2) collections.
+The model is described in the following paper:
 
-> Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. [In-Batch Negatives for Knowledge Distillation with Tightly-CoupledTeachers for Dense Retrieval.](https://cs.uwaterloo.ca/~jimmylin/publications/Lin_etal_2021_RepL4NLP.pdf) _RepL4NLP 2021_.
+> Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. [In-Batch Negatives for Knowledge Distillation with Tightly-CoupledTeachers for Dense Retrieval.](https://aclanthology.org/2021.repl4nlp-1.17/) _Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)_, pages 163-173, August 2021.
 
+At present, all indexes are referenced as absolute paths on our Waterloo machine `orca`, so these results are not broadly reproducible.
+We are working on figuring out ways to distribute the indexes.
 
-## Data Prep
-<!-- # Anserini: Guide to Working with the MS MARCO V2 Collections -->
-
-<!-- This guide presents information for working with V2 of the MS MARCO passage and document test collections. -->
-
-If you're having issues downloading the collection via `wget`, try using [AzCopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10).
-
-
-1. We use [augmented passage collection](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#passage-collection-augmented) and [segmented document collection](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#document-collection-segmented)
-2. Currently, the prebuilt index is on our Waterloo machine `orca`.
-3. We only encode `title`, `headings`, and `passage` (or `segment`) for passage (or document) collections.
+For the TREC 2021 Deep Learning Track, we applied our TCT-ColBERTv2 model trained on MS MARCO (V1) in a zero-shot manner.
+Specifically, we applied inference over the MS MARCO V2 [passage corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#passage-collection) and [segmented document corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#document-collection-segmented) to obtain the dense vectors.
 
 Let's prepare our environment variables:
 
 ```bash
-export PSG_INDEX="/store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-passage-v2-augmented"
-export DOC_INDEX="/store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-doc-v2-segmented"
-export ENCODER="castorini/tct_colbert-v2-hnp-msmarco"
+export PASSAGE_INDEX0="/store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-passage-v2-augmented"
+export DOC_INDEX0="/store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-doc-v2-segmented"
+export ENCODER0="castorini/tct_colbert-v2-hnp-msmarco"
 ```
 
-## MS MARCO Passage V2
+## Passage V2
 
 Dense retrieval with TCT-ColBERT-V2, brute-force index:
 
 ```bash
 $ python -m pyserini.dsearch --topics collections/passv2_dev_queries.tsv \
-                             --index ${PSG_INDEX} \
-                             --encoder ${ENCODER} \
+                             --index ${PASSAGE_INDEX0} \
+                             --encoder ${ENCODER0} \
                              --batch-size 144 \
                              --threads 36 \
-                             --output runs/run.msmarco-passage-v2-augmented.tct_colbert-v2-hnp.0shot.top1k.dev1.trec \
+                             --output runs/run.msmarco-passage-v2-augmented.tct_colbert-v2-hnp.0shot.dev1.trec \
                              --output-format trec
 ```
 
-To evaluate:
-
-We use the official TREC evaluation tool `trec_eval` to compute metrics.
-> Note: There are duplicated passages in msmarco v2, the following results will be different from using `--output-format msmarco` with `pyserini.eval.convert_msmarco_run_to_trec_run` because of tie breaking.
+To evaluate using `trec_eval`:
 
 ```bash
-$ python -m pyserini.eval.trec_eval -c -m recall.10,100,1000 -mmap -m -m recip_rank collections/passv2_dev_qrels.uniq.tsv runs/run.msmarco-passage-v2-augmented.tct_colbert-v2-hnp.0shot.top1k.dev1.trec
+$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank collections/passv2_dev_qrels.tsv runs/run.msmarco-passage-v2-augmented.tct_colbert-v2-hnp.0shot.dev1.trec
+Results:
+map                   	all	0.1461
+recip_rank            	all	0.1473
+
+$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 collections/passv2_dev_qrels.tsv runs/run.msmarco-passage-v2-augmented.tct_colbert-v2-hnp.0shot.dev1.trec
 Results:
-map                     all     0.1472
-recip_rank              all     0.1483
-recall_10               all     0.2743
-recall_100              all     0.5873
-recall_1000             all     0.8321
+recall_100            	all	0.5873
+recall_1000           	all	0.8321
 ```
 
-## MS MARCO Document V2
+We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
+However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
 
-Dense retrieval with TCT-ColBERT-V2, brute-force index:
+Because there are duplicate passages in MS MARCO V2 collections, score differences might be observed due to tie-breaking effects.
+For example, if we output in MS MARCO format `--output-format msmarco` and then convert to TREC format with `pyserini.eval.convert_msmarco_run_to_trec_run`, the scores will be different.
 
+## Document V2
 
-```bash
+Dense retrieval with TCT-ColBERT-V2, brute-force index:
 
+```bash
 $ python -m pyserini.dsearch --topics collections/docv2_dev_queries.tsv \
-                             --index ${DOC_INDEX} \
-                             --encoder ${ENCODER} \
+                             --index ${DOC_INDEX0} \
+                             --encoder ${ENCODER0} \
                              --batch-size 144 \
                              --threads 36 \
-                             --hits 1000 \
-                             --max-passage-hits 100 \
+                             --hits 10000 \
+                             --max-passage-hits 1000 \
                              --max-passage \
-                             --output runs/run.msmarco-document-v2-segmented.tct_colbert-v2-hnp.0shot.maxp.top100.dev1.trec \
+                             --output runs/run.msmarco-document-v2-segmented.tct_colbert-v2-hnp.0shot.dev1.trec \
                              --output-format trec
 ```
 
-To evaluate:
-
-We use the official TREC evaluation tool `trec_eval` to compute metrics. 
+To evaluate using `trec_eval`:
 
 ```bash
-$ python -m pyserini.eval.trec_eval -c -m recall.10,100 -mmap -m -m recip_rank collections/docv2_dev_qrels.tsv runs/run.msmarco-document-v2-segmented.tct_colbert-v2-hnp.0shot.maxp.top100.dev1.trec
+$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank collections/docv2_dev_qrels.tsv runs/run.msmarco-document-v2-segmented.tct_colbert-v2-hnp.0shot.dev1.trec
 Results:
-map                     all     0.2440
-recip_rank              all     0.2464
-recall_10               all     0.4784
-recall_100              all     0.7873
+map                   	all	0.2440
+recip_rank            	all	0.2464
+
+$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 collections/docv2_dev_qrels.tsv runs/run.msmarco-document-v2-segmented.tct_colbert-v2-hnp.0shot.dev1.trec
+Results:
+recall_100            	all	0.7873
+recall_1000           	all	0.9161
 ```
 
+We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
+However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
+
+Same comment about duplicate passages and score ties applies here as well.
 
 ## Reproduction Log[*](reproducibility.md)
 
diff --git a/docs/experiments-msmarco-v2-unicoil.md b/docs/experiments-msmarco-v2-unicoil.md
@@ -15,7 +15,7 @@ Thus, we applied uniCOIL without expansions in a zero-shot manner using the mode
 
 Specifically, we applied inference over the MS MARCO V2 [passage corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#passage-collection) and [segmented document corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#document-collection-segmented) to obtain the term weights.
 
-### Passage V2 Corpus
+### Passage V2
 
 Sparse retrieval with uniCOIL:
 
@@ -48,7 +48,7 @@ recall_1000           	all	0.7013
 Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
 However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
 
-### Document V2 Corpus
+### Document V2
 
 Sparse retrieval with uniCOIL:
 
@@ -82,15 +82,15 @@ recall_100            	all	0.7190
 recall_1000           	all	0.8813
 ```
 
-Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
+We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
 However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
 
 ## Zero-Shot uniCOIL + Dense Retrieval Hybrid
 
-Note that there are duplicate passages in MS MARCO V2 collections, so score differences might be observed due to tie-breaking effects.
+Because there are duplicate passages in MS MARCO V2 collections, score differences might be observed due to tie-breaking effects.
 For example, if we output in MS MARCO format `--output-format msmarco` and then convert to TREC format with `pyserini.eval.convert_msmarco_run_to_trec_run`, the scores will be different.
 
-### Passage V2 Corpus
+### Passage V2
 
 Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 zero-shot):
 
@@ -148,7 +148,7 @@ recall_100            	all	0.6701
 recall_1000           	all	0.8748
 ```
 
-### Document V2 Corpus
+### Document V2
 
 Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 zero-shot):
 

diff --git a/docs/experiments-tct_colbert-v2.md b/docs/experiments-tct_colbert-v2.md
@@ -1,8 +1,8 @@
-# Pyserini: Reproducing TCT-ColBERT-V2 Results
+# Pyserini: TCT-ColBERTv2 for MS MARCO (V1) Collections
 
 This guide provides instructions to reproduce the family of TCT-ColBERT-V2 dense retrieval models described in the following paper:
 
-> Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. [In-Batch Negatives for Knowledge Distillation with Tightly-CoupledTeachers for Dense Retrieval.](https://cs.uwaterloo.ca/~jimmylin/publications/Lin_etal_2021_RepL4NLP.pdf) _RepL4NLP 2021_.
+> Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. [In-Batch Negatives for Knowledge Distillation with Tightly-CoupledTeachers for Dense Retrieval.](https://aclanthology.org/2021.repl4nlp-1.17/) _Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)_, pages 163-173, August 2021.
 
 Since dense retrieval depends on neural networks, Pyserini requires a more complex set of dependencies to use this feature.
 See [package installation notes](../README.md#package-installation) for more details.
@@ -25,7 +25,7 @@ Summary of results (figures from the paper are in parentheses):
 
 The slight differences between the reproduced scores and those reported in the paper can be attributed to TensorFlow implementations in the published paper vs. PyTorch implementations here in this reproduction guide.
 
-## TCT_ColBERT-V2
+### TCT_ColBERT-V2
 
 Dense retrieval with TCT-ColBERT, brute-force index:
 
@@ -61,7 +61,7 @@ map                     all     0.3509
 recall_1000             all     0.9670
 ```
 
-## TCT_ColBERT-V2-HN
+### TCT_ColBERT-V2-HN
 
 ```bash
 $ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
@@ -88,7 +88,7 @@ map                     all     0.3608
 recall_1000             all     0.9708
 ```
 
-## TCT_ColBERT-V2-HN+
+### TCT_ColBERT-V2-HN+
 
 ```bash
 $ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
@@ -119,7 +119,6 @@ To perform on-the-fly query encoding with our [pretrained encoder model](https:/
 Query encoding will run on the CPU by default.
 To perform query encoding on the GPU, use the option `--device cuda:0`.
 
-
 ### Hybrid Dense-Sparse Retrieval with TCT_ColBERT-V2-HN+
 
 Hybrid retrieval with dense-sparse representations (without document expansion):
@@ -295,7 +294,6 @@ ndcg_cut_10             all     0.6592
 ```
 
 
-
 ## Reproduction Log[*](reproducibility.md)
 
 + Results reproduced by [@lintool](https://github.com/lintool) on 2021-07-01 (commit [`b1576a2`](https://github.com/castorini/pyserini/commit/b1576a2c3e899349be12e897f92f3ad75ec82d6f))
diff --git a/docs/experiments-tct_colbert.md b/docs/experiments-tct_colbert.md
@@ -1,4 +1,4 @@
-# Pyserini: Reproducing TCT-ColBERT Results
+# Pyserini: TCT-ColBERT for MS MARCO (V1) Collections
 
 This guide provides instructions to reproduce the TCT-ColBERT dense retrieval model described in the following paper:
 
@@ -23,7 +23,7 @@ Summary of results:
 | TCT-ColBERT (brute-force index) + BoW BM25 | 0.3529 | 0.3594 | 0.9698 |
 | TCT-ColBERT (brute-force index) + BM25 w/ doc2query-T5 | 0.3647 | 0.3711 | 0.9751 |
 
-## Dense Retrieval
+### Dense Retrieval
 
 Dense retrieval with TCT-ColBERT, brute-force index:
 
@@ -91,7 +91,7 @@ recall_1000             all     0.9618
 Follow the same instructions above to perform on-the-fly query encoding.
 The caveat about minor differences in score applies here as well.
 
-## Hybrid Dense-Sparse Retrieval
+### Hybrid Dense-Sparse Retrieval
 
 Hybrid retrieval with dense-sparse representations (without document expansion):
 - dense retrieval with TCT-ColBERT, brute force index.