From 9b4ec1156c0ec2d3084105cb5b4a6edde6101a44 Mon Sep 17 00:00:00 2001
From: Jimmy Lin <jimmylin@uwaterloo.ca>
Date: Sat, 14 Aug 2021 16:47:53 -0400
Subject: [PATCH] Update of TCT-ColBERTv2 docs for MS MARCO V2 (#736)

---
 README.md                                     | 15 +--
 docs/experiments-msmarco-v2-tct_colbert-v2.md | 96 ++++++++++---------
 docs/experiments-msmarco-v2-unicoil.md        | 12 +--
 docs/experiments-tct_colbert-v2.md            | 12 +--
 docs/experiments-tct_colbert.md               |  6 +-
 5 files changed, 71 insertions(+), 70 deletions(-)

diff --git a/README.md b/README.md
index 616e53a23..1b4912574 100644
--- a/README.md
+++ b/README.md
@@ -387,18 +387,19 @@ With Pyserini, it's easy to [reproduce](docs/reproducibility.md) runs on a numbe
 
 + [Reproducing runs directly from the Python package](docs/pypi-reproduction.md)
 + [Reproducing Robust04 baselines for ad hoc retrieval](docs/experiments-robust04.md)
-+ [Reproducing the BM25 baseline for MS MARCO Passage Ranking](docs/experiments-msmarco-passage.md)
-+ [Reproducing the BM25 baseline for MS MARCO Document Ranking](docs/experiments-msmarco-doc.md)
-+ [Reproducing the multi-field BM25 baseline for MS MARCO Document Ranking from Elasticsearch](docs/experiments-elastic.md)
++ [Reproducing the BM25 baseline for MS MARCO (V1) Passage Ranking](docs/experiments-msmarco-passage.md)
++ [Reproducing the BM25 baseline for MS MARCO (V1) Document Ranking](docs/experiments-msmarco-doc.md)
++ [Reproducing the multi-field BM25 baseline for MS MARCO (V1) Document Ranking from Elasticsearch](docs/experiments-elastic.md)
 + [Reproducing BM25 baselines on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2.md)
-+ [Reproducing DeepImpact experiments for MS MARCO Passage Ranking](docs/experiments-deepimpact.md)
-+ [Reproducing uniCOIL experiments for MS MARCO Passage Ranking](docs/experiments-unicoil.md)
++ [Reproducing DeepImpact experiments for MS MARCO (V1) Passage Ranking](docs/experiments-deepimpact.md)
++ [Reproducing uniCOIL experiments for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil.md)
 + [Reproducing uniCOIL experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-unicoil.md)
 
 ### Dense Retrieval
 
-+ [Reproducing TCT-ColBERTv2 experiments](docs/experiments-tct_colbert-v2.md)
-+ [Reproducing TCT-ColBERTv1 experiments](docs/experiments-tct_colbert.md)
++ [Reproducing TCT-ColBERTv1 experiments on the MS MARCO (V1) Collections](docs/experiments-tct_colbert.md)
++ [Reproducing TCT-ColBERTv2 experiments on the MS MARCO (V1) Collections](docs/experiments-tct_colbert-v2.md)
++ [Reproducing TCT-ColBERTv2 experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-tct_colbert-v2.md)
 + [Reproducing DPR experiments](docs/experiments-dpr.md)
 + [Reproducing ANCE experiments](docs/experiments-ance.md)
 + [Reproducing DistilBERT KD experiments](docs/experiments-distilbert_kd.md)
diff --git a/docs/experiments-msmarco-v2-tct_colbert-v2.md b/docs/experiments-msmarco-v2-tct_colbert-v2.md
index 87944edc0..69f5c080c 100644
--- a/docs/experiments-msmarco-v2-tct_colbert-v2.md
+++ b/docs/experiments-msmarco-v2-tct_colbert-v2.md
@@ -1,91 +1,93 @@
-# Pyserini: Baseline for MS MARCO V2: TCT-ColBERT-V2
+# Pyserini: TCT-ColBERTv2 for MS MARCO (V2) Collections
 
-This guide provides instructions to reproduce the family of TCT-ColBERT-V2 dense retrieval models described in the following paper:
+This guide provides instructions to reproduce experiments using TCT-ColBERTv2 dense retrieval models on the MS MARCO (V2) collections.
+The model is described in the following paper:
 
-> Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. [In-Batch Negatives for Knowledge Distillation with Tightly-CoupledTeachers for Dense Retrieval.](https://cs.uwaterloo.ca/~jimmylin/publications/Lin_etal_2021_RepL4NLP.pdf) _RepL4NLP 2021_.
+> Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. [In-Batch Negatives for Knowledge Distillation with Tightly-CoupledTeachers for Dense Retrieval.](https://aclanthology.org/2021.repl4nlp-1.17/) _Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)_, pages 163-173, August 2021.
 
+At present, all indexes are referenced as absolute paths on our Waterloo machine `orca`, so these results are not broadly reproducible.
+We are working on figuring out ways to distribute the indexes.
 
-## Data Prep
-<!-- # Anserini: Guide to Working with the MS MARCO V2 Collections -->
-
-<!-- This guide presents information for working with V2 of the MS MARCO passage and document test collections. -->
-
-If you're having issues downloading the collection via `wget`, try using [AzCopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10).
-
-
-1. We use [augmented passage collection](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#passage-collection-augmented) and [segmented document collection](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#document-collection-segmented)
-2. Currently, the prebuilt index is on our Waterloo machine `orca`.
-3. We only encode `title`, `headings`, and `passage` (or `segment`) for passage (or document) collections.
+For the TREC 2021 Deep Learning Track, we applied our TCT-ColBERTv2 model trained on MS MARCO (V1) in a zero-shot manner.
+Specifically, we applied inference over the MS MARCO V2 [passage corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#passage-collection) and [segmented document corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#document-collection-segmented) to obtain the dense vectors.
 
 Let's prepare our environment variables:
 
 ```bash
-export PSG_INDEX="/store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-passage-v2-augmented"
-export DOC_INDEX="/store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-doc-v2-segmented"
-export ENCODER="castorini/tct_colbert-v2-hnp-msmarco"
+export PASSAGE_INDEX0="/store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-passage-v2-augmented"
+export DOC_INDEX0="/store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-doc-v2-segmented"
+export ENCODER0="castorini/tct_colbert-v2-hnp-msmarco"
 ```
 
-## MS MARCO Passage V2
+## Passage V2
 
 Dense retrieval with TCT-ColBERT-V2, brute-force index:
 
 ```bash
 $ python -m pyserini.dsearch --topics collections/passv2_dev_queries.tsv \
-                             --index ${PSG_INDEX} \
-                             --encoder ${ENCODER} \
+                             --index ${PASSAGE_INDEX0} \
+                             --encoder ${ENCODER0} \
                              --batch-size 144 \
                              --threads 36 \
-                             --output runs/run.msmarco-passage-v2-augmented.tct_colbert-v2-hnp.0shot.top1k.dev1.trec \
+                             --output runs/run.msmarco-passage-v2-augmented.tct_colbert-v2-hnp.0shot.dev1.trec \
                              --output-format trec
 ```
 
-To evaluate:
-
-We use the official TREC evaluation tool `trec_eval` to compute metrics.
-> Note: There are duplicated passages in msmarco v2, the following results will be different from using `--output-format msmarco` with `pyserini.eval.convert_msmarco_run_to_trec_run` because of tie breaking.
+To evaluate using `trec_eval`:
 
 ```bash
-$ python -m pyserini.eval.trec_eval -c -m recall.10,100,1000 -mmap -m -m recip_rank collections/passv2_dev_qrels.uniq.tsv runs/run.msmarco-passage-v2-augmented.tct_colbert-v2-hnp.0shot.top1k.dev1.trec
+$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank collections/passv2_dev_qrels.tsv runs/run.msmarco-passage-v2-augmented.tct_colbert-v2-hnp.0shot.dev1.trec
+Results:
+map                   	all	0.1461
+recip_rank            	all	0.1473
+
+$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 collections/passv2_dev_qrels.tsv runs/run.msmarco-passage-v2-augmented.tct_colbert-v2-hnp.0shot.dev1.trec
 Results:
-map                     all     0.1472
-recip_rank              all     0.1483
-recall_10               all     0.2743
-recall_100              all     0.5873
-recall_1000             all     0.8321
+recall_100            	all	0.5873
+recall_1000           	all	0.8321
 ```
 
-## MS MARCO Document V2
+We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
+However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
 
-Dense retrieval with TCT-ColBERT-V2, brute-force index:
+Because there are duplicate passages in MS MARCO V2 collections, score differences might be observed due to tie-breaking effects.
+For example, if we output in MS MARCO format `--output-format msmarco` and then convert to TREC format with `pyserini.eval.convert_msmarco_run_to_trec_run`, the scores will be different.
 
+## Document V2
 
-```bash
+Dense retrieval with TCT-ColBERT-V2, brute-force index:
 
+```bash
 $ python -m pyserini.dsearch --topics collections/docv2_dev_queries.tsv \
-                             --index ${DOC_INDEX} \
-                             --encoder ${ENCODER} \
+                             --index ${DOC_INDEX0} \
+                             --encoder ${ENCODER0} \
                              --batch-size 144 \
                              --threads 36 \
-                             --hits 1000 \
-                             --max-passage-hits 100 \
+                             --hits 10000 \
+                             --max-passage-hits 1000 \
                              --max-passage \
-                             --output runs/run.msmarco-document-v2-segmented.tct_colbert-v2-hnp.0shot.maxp.top100.dev1.trec \
+                             --output runs/run.msmarco-document-v2-segmented.tct_colbert-v2-hnp.0shot.dev1.trec \
                              --output-format trec
 ```
 
-To evaluate:
-
-We use the official TREC evaluation tool `trec_eval` to compute metrics. 
+To evaluate using `trec_eval`:
 
 ```bash
-$ python -m pyserini.eval.trec_eval -c -m recall.10,100 -mmap -m -m recip_rank collections/docv2_dev_qrels.tsv runs/run.msmarco-document-v2-segmented.tct_colbert-v2-hnp.0shot.maxp.top100.dev1.trec
+$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank collections/docv2_dev_qrels.tsv runs/run.msmarco-document-v2-segmented.tct_colbert-v2-hnp.0shot.dev1.trec
 Results:
-map                     all     0.2440
-recip_rank              all     0.2464
-recall_10               all     0.4784
-recall_100              all     0.7873
+map                   	all	0.2440
+recip_rank            	all	0.2464
+
+$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 collections/docv2_dev_qrels.tsv runs/run.msmarco-document-v2-segmented.tct_colbert-v2-hnp.0shot.dev1.trec
+Results:
+recall_100            	all	0.7873
+recall_1000           	all	0.9161
 ```
 
+We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
+However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
+
+Same comment about duplicate passages and score ties applies here as well.
 
 ## Reproduction Log[*](reproducibility.md)
 
diff --git a/docs/experiments-msmarco-v2-unicoil.md b/docs/experiments-msmarco-v2-unicoil.md
index 69ee623b5..9604f79e8 100644
--- a/docs/experiments-msmarco-v2-unicoil.md
+++ b/docs/experiments-msmarco-v2-unicoil.md
@@ -15,7 +15,7 @@ Thus, we applied uniCOIL without expansions in a zero-shot manner using the mode
 
 Specifically, we applied inference over the MS MARCO V2 [passage corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#passage-collection) and [segmented document corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#document-collection-segmented) to obtain the term weights.
 
-### Passage V2 Corpus
+### Passage V2
 
 Sparse retrieval with uniCOIL:
 
@@ -48,7 +48,7 @@ recall_1000           	all	0.7013
 Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
 However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
 
-### Document V2 Corpus
+### Document V2
 
 Sparse retrieval with uniCOIL:
 
@@ -82,15 +82,15 @@ recall_100            	all	0.7190
 recall_1000           	all	0.8813
 ```
 
-Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
+We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
 However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
 
 ## Zero-Shot uniCOIL + Dense Retrieval Hybrid
 
-Note that there are duplicate passages in MS MARCO V2 collections, so score differences might be observed due to tie-breaking effects.
+Because there are duplicate passages in MS MARCO V2 collections, score differences might be observed due to tie-breaking effects.
 For example, if we output in MS MARCO format `--output-format msmarco` and then convert to TREC format with `pyserini.eval.convert_msmarco_run_to_trec_run`, the scores will be different.
 
-### Passage V2 Corpus
+### Passage V2
 
 Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 zero-shot):
 
@@ -148,7 +148,7 @@ recall_100            	all	0.6701
 recall_1000           	all	0.8748
 ```
 
-### Document V2 Corpus
+### Document V2
 
 Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 zero-shot):
 
diff --git a/docs/experiments-tct_colbert-v2.md b/docs/experiments-tct_colbert-v2.md
index 612603a21..989a8fe87 100644
--- a/docs/experiments-tct_colbert-v2.md
+++ b/docs/experiments-tct_colbert-v2.md
@@ -1,8 +1,8 @@
-# Pyserini: Reproducing TCT-ColBERT-V2 Results
+# Pyserini: TCT-ColBERTv2 for MS MARCO (V1) Collections
 
 This guide provides instructions to reproduce the family of TCT-ColBERT-V2 dense retrieval models described in the following paper:
 
-> Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. [In-Batch Negatives for Knowledge Distillation with Tightly-CoupledTeachers for Dense Retrieval.](https://cs.uwaterloo.ca/~jimmylin/publications/Lin_etal_2021_RepL4NLP.pdf) _RepL4NLP 2021_.
+> Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. [In-Batch Negatives for Knowledge Distillation with Tightly-CoupledTeachers for Dense Retrieval.](https://aclanthology.org/2021.repl4nlp-1.17/) _Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)_, pages 163-173, August 2021.
 
 Since dense retrieval depends on neural networks, Pyserini requires a more complex set of dependencies to use this feature.
 See [package installation notes](../README.md#package-installation) for more details.
@@ -25,7 +25,7 @@ Summary of results (figures from the paper are in parentheses):
 
 The slight differences between the reproduced scores and those reported in the paper can be attributed to TensorFlow implementations in the published paper vs. PyTorch implementations here in this reproduction guide.
 
-## TCT_ColBERT-V2
+### TCT_ColBERT-V2
 
 Dense retrieval with TCT-ColBERT, brute-force index:
 
@@ -61,7 +61,7 @@ map                     all     0.3509
 recall_1000             all     0.9670
 ```
 
-## TCT_ColBERT-V2-HN
+### TCT_ColBERT-V2-HN
 
 ```bash
 $ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
@@ -88,7 +88,7 @@ map                     all     0.3608
 recall_1000             all     0.9708
 ```
 
-## TCT_ColBERT-V2-HN+
+### TCT_ColBERT-V2-HN+
 
 ```bash
 $ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
@@ -119,7 +119,6 @@ To perform on-the-fly query encoding with our [pretrained encoder model](https:/
 Query encoding will run on the CPU by default.
 To perform query encoding on the GPU, use the option `--device cuda:0`.
 
-
 ### Hybrid Dense-Sparse Retrieval with TCT_ColBERT-V2-HN+
 
 Hybrid retrieval with dense-sparse representations (without document expansion):
@@ -295,7 +294,6 @@ ndcg_cut_10             all     0.6592
 ```
 
 
-
 ## Reproduction Log[*](reproducibility.md)
 
 + Results reproduced by [@lintool](https://github.com/lintool) on 2021-07-01 (commit [`b1576a2`](https://github.com/castorini/pyserini/commit/b1576a2c3e899349be12e897f92f3ad75ec82d6f))
diff --git a/docs/experiments-tct_colbert.md b/docs/experiments-tct_colbert.md
index bd9f79089..661a0c43e 100644
--- a/docs/experiments-tct_colbert.md
+++ b/docs/experiments-tct_colbert.md
@@ -1,4 +1,4 @@
-# Pyserini: Reproducing TCT-ColBERT Results
+# Pyserini: TCT-ColBERT for MS MARCO (V1) Collections
 
 This guide provides instructions to reproduce the TCT-ColBERT dense retrieval model described in the following paper:
 
@@ -23,7 +23,7 @@ Summary of results:
 | TCT-ColBERT (brute-force index) + BoW BM25 | 0.3529 | 0.3594 | 0.9698 |
 | TCT-ColBERT (brute-force index) + BM25 w/ doc2query-T5 | 0.3647 | 0.3711 | 0.9751 |
 
-## Dense Retrieval
+### Dense Retrieval
 
 Dense retrieval with TCT-ColBERT, brute-force index:
 
@@ -91,7 +91,7 @@ recall_1000             all     0.9618
 Follow the same instructions above to perform on-the-fly query encoding.
 The caveat about minor differences in score applies here as well.
 
-## Hybrid Dense-Sparse Retrieval
+### Hybrid Dense-Sparse Retrieval
 
 Hybrid retrieval with dense-sparse representations (without document expansion):
 - dense retrieval with TCT-ColBERT, brute force index.