Skip to content

Commit

Permalink
Refactor MS MARCO v1 doc segmented uniCOIL regressions (#1854)
Browse files Browse the repository at this point in the history
Ref detailed discussions in #1853
  • Loading branch information
lintool authored Apr 23, 2022
1 parent 35d7801 commit b429218
Show file tree
Hide file tree
Showing 16 changed files with 75 additions and 77 deletions.
20 changes: 10 additions & 10 deletions docs/regressions-dl19-doc-segmented-unicoil-noexp.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Anserini Regressions: TREC 2019 Deep Learning Track (Document)

**Model**: uniCOIL (without any expansions) on segmented documents
**Model**: uniCOIL (without any expansions) on segmented documents (title/segment encoding)

This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (without any expansions) on the [TREC 2019 Deep Learning Track document ranking task](https://trec.nist.gov/data/deep2019.html).
The uniCOIL model is described in the following paper:
Expand All @@ -22,19 +22,19 @@ python src/main/python/run_regression.py --index --verify --search --regression

## Corpus

We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

Download the corpus and unpack into `collections/`:

```
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar -P collections/
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar -P collections/
tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/
tar xvf collections/msmarco-doc-segmented-unicoil-noexp.tar -C collections/
```

To confirm, `msmarco-doc-segmented-unicoil.tar` is 18 GB and has MD5 checksum `6a00e2c0c375cb1e52c83ae5ac377ebb`.
To confirm, `msmarco-doc-segmented-unicoil-noexp.tar` is 11 GB and has MD5 checksum `11b226e1cacd9c8ae0a660fd14cdd710`.

With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine:

Expand All @@ -59,7 +59,7 @@ target/appassembler/bin/IndexCollection \
>& logs/log.msmarco-doc-segmented-unicoil-noexp &
```

The directory `/path/to/msmarco-doc-segmented-unicoil/` should point to the corpus downloaded above.
The directory `/path/to/msmarco-doc-segmented-unicoil-noexp/` should point to the corpus downloaded above.

The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.
Upon completion, we should have an index with 20,545,677 documents.
Expand Down Expand Up @@ -98,22 +98,22 @@ With the above commands, you should be able to reproduce the following results:

| AP@100 | uniCOIL (no expansions)|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.2621 |
| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.2665 |


| nDCG@10 | uniCOIL (no expansions)|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.6118 |
| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.6349 |


| R@100 | uniCOIL (no expansions)|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.3956 |
| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.3943 |


| R@1000 | uniCOIL (no expansions)|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.6382 |
| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.6391 |

Note that in the official evaluation for document ranking, all runs were truncated to top-100 hits per query (whereas all top-1000 hits per query were retained for passage ranking).
Thus, average precision is computed to depth 100 (i.e., AP@100); nDCG@10 remains unaffected.
Expand Down
4 changes: 2 additions & 2 deletions docs/regressions-dl19-doc-segmented-unicoil.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Anserini Regressions: TREC 2019 Deep Learning Track (Document)

**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents
**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents (title/segment encoding)

This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (with doc2query-T5 expansions) on the [TREC 2019 Deep Learning Track document ranking task](https://trec.nist.gov/data/deep2019.html).
The uniCOIL model is described in the following paper:
Expand All @@ -22,7 +22,7 @@ python src/main/python/run_regression.py --index --verify --search --regression

## Corpus

We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

Expand Down
20 changes: 10 additions & 10 deletions docs/regressions-dl20-doc-segmented-unicoil-noexp.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Anserini Regressions: TREC 2020 Deep Learning Track (Document)

**Model**: uniCOIL (without any expansions) on segmented documents
**Model**: uniCOIL (without any expansions) on segmented documents (title/segment encoding)

This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (without any expansions) on the [TREC 2020 Deep Learning Track document ranking task](https://trec.nist.gov/data/deep2020.html).
The uniCOIL model is described in the following paper:
Expand All @@ -22,19 +22,19 @@ python src/main/python/run_regression.py --index --verify --search --regression

## Corpus

We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

Download the corpus and unpack into `collections/`:

```
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar -P collections/
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar -P collections/
tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/
tar xvf collections/msmarco-doc-segmented-unicoil-noexp.tar -C collections/
```

To confirm, `msmarco-doc-segmented-unicoil.tar` is 18 GB and has MD5 checksum `6a00e2c0c375cb1e52c83ae5ac377ebb`.
To confirm, `msmarco-doc-segmented-unicoil-noexp.tar` is 11 GB and has MD5 checksum `11b226e1cacd9c8ae0a660fd14cdd710`.

With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine:

Expand All @@ -59,7 +59,7 @@ target/appassembler/bin/IndexCollection \
>& logs/log.msmarco-doc-segmented-unicoil-noexp &
```

The directory `/path/to/msmarco-doc-segmented-unicoil/` should point to the corpus downloaded above.
The directory `/path/to/msmarco-doc-segmented-unicoil-noexp/` should point to the corpus downloaded above.

The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.
Upon completion, we should have an index with 20,545,677 documents.
Expand Down Expand Up @@ -98,22 +98,22 @@ With the above commands, you should be able to reproduce the following results:

| AP@100 | uniCOIL w/ doc2query-T5 expansion|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.3586 |
| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.3698 |


| nDCG@10 | uniCOIL w/ doc2query-T5 expansion|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.5632 |
| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.5893 |


| R@100 | uniCOIL w/ doc2query-T5 expansion|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.5932 |
| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.5872 |


| R@1000 | uniCOIL w/ doc2query-T5 expansion|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.7562 |
| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.7623 |

Note that in the official evaluation for document ranking, all runs were truncated to top-100 hits per query (whereas all top-1000 hits per query were retained for passage ranking).
Thus, average precision is computed to depth 100 (i.e., AP@100); nDCG@10 remains unaffected.
Expand Down
4 changes: 2 additions & 2 deletions docs/regressions-dl20-doc-segmented-unicoil.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Anserini Regressions: TREC 2020 Deep Learning Track (Document)

**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents
**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents (title/segment encoding)

This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (with doc2query-T5 expansions) on the [TREC 2020 Deep Learning Track document ranking task](https://trec.nist.gov/data/deep2020.html).
The uniCOIL model is described in the following paper:
Expand All @@ -22,7 +22,7 @@ python src/main/python/run_regression.py --index --verify --search --regression

## Corpus

We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

Expand Down
20 changes: 10 additions & 10 deletions docs/regressions-msmarco-doc-segmented-unicoil-noexp.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Anserini Regressions: MS MARCO Document Ranking

**Model**: uniCOIL (without any expansions) on segmented documents
**Model**: uniCOIL (without any expansions) on segmented documents (title/segment encoding)

This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (without any expansions) on the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking).
The uniCOIL model is described in the following paper:
Expand All @@ -22,19 +22,19 @@ python src/main/python/run_regression.py --index --verify --search --regression

## Corpus

We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

Download the corpus and unpack into `collections/`:

```
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar -P collections/
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar -P collections/
tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/
tar xvf collections/msmarco-doc-segmented-unicoil-noexp.tar -C collections/
```

To confirm, `msmarco-doc-segmented-unicoil.tar` is 18 GB and has MD5 checksum `6a00e2c0c375cb1e52c83ae5ac377ebb`.
To confirm, `msmarco-doc-segmented-unicoil-noexp.tar` is 11 GB and has MD5 checksum `11b226e1cacd9c8ae0a660fd14cdd710`.

With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine:

Expand All @@ -59,7 +59,7 @@ target/appassembler/bin/IndexCollection \
>& logs/log.msmarco-doc-segmented-unicoil-noexp &
```

The directory `/path/to/msmarco-doc-segmented-unicoil/` should point to the corpus downloaded above.
The directory `/path/to/msmarco-doc-segmented-unicoil-noexp/` should point to the corpus downloaded above.

The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.
Upon completion, we should have an index with 20,545,677 documents.
Expand Down Expand Up @@ -97,22 +97,22 @@ With the above commands, you should be able to reproduce the following results:

| AP@1000 | uniCOIL (no expansions)|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.3200 |
| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.3413 |


| RR@100 | uniCOIL (no expansions)|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.3195 |
| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.3409 |


| R@100 | uniCOIL (no expansions)|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.8398 |
| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.8639 |


| R@1000 | uniCOIL (no expansions)|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.9286 |
| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.9420 |

This model corresponds to the run named "uniCOIL-d2q" on the official MS MARCO Document Ranking Leaderboard, submitted 2021/09/16.
The following command generates a comparable run:
Expand Down
4 changes: 2 additions & 2 deletions docs/regressions-msmarco-doc-segmented-unicoil.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Anserini Regressions: MS MARCO Document Ranking

**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents
**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents (title/segment encoding)

This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (with doc2query-T5 expansions) on the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking).
The uniCOIL model is described in the following paper:
Expand All @@ -22,7 +22,7 @@ python src/main/python/run_regression.py --index --verify --search --regression

## Corpus

We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

Expand Down
Loading

0 comments on commit b429218

Please sign in to comment.