diff --git a/README.md b/README.md index 8b7be2c46d..0b56c7d70c 100644 --- a/README.md +++ b/README.md @@ -54,16 +54,16 @@ For the most part, these runs are based on [_default_ parameter settings](https: + Bag-of-words models: [baselines](docs/regressions-msmarco-passage.md), [doc2query](docs/regressions-msmarco-passage-doc2query.md), [doc2query-T5](docs/regressions-msmarco-passage-docTTTTTquery.md) + Sparse learned models: [DeepImpact](docs/regressions-msmarco-passage-deepimpact.md), [uniCOIL with doc2query-T5](docs/regressions-msmarco-passage-unicoil.md), [uniCOIL with TILDE](docs/regressions-msmarco-passage-unicoil-tilde-expansion.md), [SPLADEv2](docs/regressions-msmarco-passage-distill-splade-max.md) + Regressions for MS MARCO (V1) Document Ranking: - + Per doc method: [baselines](docs/regressions-msmarco-doc.md), [doc2query-T5](docs/regressions-msmarco-doc-docTTTTTquery-per-doc.md) - + Per passage method: [baselines](docs/regressions-msmarco-doc-per-passage.md) ([v2](docs/regressions-msmarco-doc-per-passage-v2.md), [v3](docs/regressions-msmarco-doc-per-passage-v3.md))[*](docs/experiments-msmarco-doc-doc2query-details.md), [doc2query-T5](docs/regressions-msmarco-doc-docTTTTTquery-per-passage.md) ([v3](docs/regressions-msmarco-doc-docTTTTTquery-per-passage-v3.md))[*](docs/experiments-msmarco-doc-doc2query-details.md) + + Complete doc[*](docs/experiments-msmarco-doc-doc2query-details.md): [baselines](docs/regressions-msmarco-doc.md), [doc2query-T5](docs/regressions-msmarco-doc-docTTTTTquery.md) + + Segmented doc[*](docs/experiments-msmarco-doc-doc2query-details.md): [baselines](docs/regressions-msmarco-doc-segmented.md), [doc2query-T5](docs/regressions-msmarco-doc-segmented-docTTTTTquery.md) + Regressions for TREC 2019 Deep Learning Track: + Passage ranking: [baselines](docs/regressions-dl19-passage.md), [doc2query-T5](docs/regressions-dl19-passage-docTTTTTquery.md) - + Document ranking, per doc method: [baselines](docs/regressions-dl19-doc.md), [doc2query-T5](docs/regressions-dl19-doc-docTTTTTquery-per-doc.md) - + Document ranking, per passage method: [baselines](docs/regressions-dl19-doc-per-passage.md), [doc2query-T5](docs/regressions-dl19-doc-docTTTTTquery-per-passage.md) + + Document ranking, complete doc[*](docs/experiments-msmarco-doc-doc2query-details.md): [baselines](docs/regressions-dl19-doc.md), [doc2query-T5](docs/regressions-dl19-doc-docTTTTTquery.md) + + Document ranking, segmented doc[*](docs/experiments-msmarco-doc-doc2query-details.md): [baselines](docs/regressions-dl19-doc-segmented.md), [doc2query-T5](docs/regressions-dl19-doc-segmented-docTTTTTquery.md) + Regressions for TREC 2020 Deep Learning Track: + Passage ranking: [baselines](docs/regressions-dl20-passage.md), [doc2query-T5](docs/regressions-dl20-passage-docTTTTTquery.md) - + Document ranking, per doc method: [baselines](docs/regressions-dl20-doc.md), [doc2query-T5](docs/regressions-dl20-doc-docTTTTTquery-per-doc.md) - + Document ranking, per passage method: [baselines](docs/regressions-dl20-doc-per-passage.md), [doc2query-T5](docs/regressions-dl20-doc-docTTTTTquery-per-passage.md) + + Document ranking, complete doc[*](docs/experiments-msmarco-doc-doc2query-details.md): [baselines](docs/regressions-dl20-doc.md), [doc2query-T5](docs/regressions-dl20-doc-docTTTTTquery.md) + + Document ranking, segmented doc[*](docs/experiments-msmarco-doc-doc2query-details.md): [baselines](docs/regressions-dl20-doc-segmented.md), [doc2query-T5](docs/regressions-dl20-doc-segmented-docTTTTTquery.md) + Regressions for MS MARCO (V2) Passage Ranking: + Bag-of-words models: [baselines](docs/regressions-msmarco-v2-passage.md), [on augmented corpus](docs/regressions-msmarco-v2-passage-augmented.md) + Sparse learned models: [uniCOIL noexp zero-shot](docs/regressions-msmarco-v2-passage-unicoil-noexp-0shot.md) diff --git a/docs/experiments-msmarco-doc-doc2query-details.md b/docs/experiments-msmarco-doc-doc2query-details.md index c39bb63129..02c0989a44 100644 --- a/docs/experiments-msmarco-doc-doc2query-details.md +++ b/docs/experiments-msmarco-doc-doc2query-details.md @@ -1,4 +1,4 @@ -# Anserini: Reproducibility Notes for MS MARCO V1 Doc Ranking +# Anserini: Reproducibility Notes for MS MARCO V1

Reproducibility is hard.

— Jimmy Lin (@lintool) November 11, 2021
@@ -22,8 +22,22 @@ This was for dense retrieval experiments, as we were not aware of the doc2query- It is very likely, but we cannot know for sure, that this was the same segmentation that generated the original doc2query-T5 expansions. Fortunately, Xueguang was able to save a copy of this segmented corpus. -So, now we have: +--- -+ `doc-per-passage-v2`: materialized corpus with 20,545,677 segments. -+ `doc-per-passage-v3`: same as above, except with URL. Note that bag-of-words search over this variant yields higher effectiveness than above, but for input to an encoder, you probably don't want to include the URL. -+ `doc-docTTTTTquery-per-doc-v3`: `doc-per-passage-v3`, but with the doc2query-T5 expansions added in. +In January 2022, we completely refactored the doc2query-T5 expansion data for the MS MARCO (V1) corpora. +They are now available as Huggingface Datasets: + ++ [`msmarco_v1_passage_doc2query-t5_expansions`](https://huggingface.co/datasets/castorini/msmarco_v1_passage_doc2query-t5_expansions): passage expansions ++ [`msmarco_v1_doc_doc2query-t5_expansions`](https://huggingface.co/datasets/castorini/msmarco_v1_doc_doc2query-t5_expansions): document expansions ++ [`msmarco_v1_doc_segmented_doc2query-t5_expansions`](https://huggingface.co/datasets/castorini/msmarco_v1_doc_segmented_doc2query-t5_expansions): document segment expansions + +So now we have the following new regressions: + ++ `msmarco-doc`: document corpus in Anserini's jsonl format with 3,213,835 documents. Each contains URL, title, body, delimited by newlines. ++ `msmarco-doc-docTTTTTquery`: same as above, but with docTTTTTquery expansions, delimited by another newline. ++ `msmarco-segmented`: segmented document corpus in Anserini's jsonl format with 20,545,677 segments. Each contains URL, title, segment, delimited by newlines. ++ `msmarco-segmented-docTTTTTquery`: same as above, but with docTTTTTquery expansions, delimited by another newline. + +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match, this may be the reason. + +*TODO:* Circle back and add links to scripts once everything has been verified and checked in. diff --git a/docs/regressions-backgroundlinking18.md b/docs/regressions-backgroundlinking18.md index 67c1aa8001..8ee485daef 100644 --- a/docs/regressions-backgroundlinking18.md +++ b/docs/regressions-backgroundlinking18.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection WashingtonPostCollection \ -input /path/to/wapo.v2 \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -generator WashingtonPostGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw \ >& logs/log.wapo.v2 & @@ -34,19 +34,19 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -topics src/main/resources/topics-and-qrels/topics.backgroundlinking18.txt -topicreader BackgroundLinking \ -output runs/run.wapo.v2.bm25.topics.backgroundlinking18.txt \ -backgroundlinking -backgroundlinking.k 100 -bm25 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -topics src/main/resources/topics-and-qrels/topics.backgroundlinking18.txt -topicreader BackgroundLinking \ -output runs/run.wapo.v2.bm25+rm3.topics.backgroundlinking18.txt \ -backgroundlinking -backgroundlinking.k 100 -bm25 -rm3 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -topics src/main/resources/topics-and-qrels/topics.backgroundlinking18.txt -topicreader BackgroundLinking \ -output runs/run.wapo.v2.bm25+rm3+df.topics.backgroundlinking18.txt \ -backgroundlinking -backgroundlinking.datefilter -backgroundlinking.k 100 -bm25 -rm3 -hits 100 & diff --git a/docs/regressions-backgroundlinking19.md b/docs/regressions-backgroundlinking19.md index dbeec9cd53..c57965ea71 100644 --- a/docs/regressions-backgroundlinking19.md +++ b/docs/regressions-backgroundlinking19.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection WashingtonPostCollection \ -input /path/to/wapo.v2 \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -generator WashingtonPostGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw \ >& logs/log.wapo.v2 & @@ -34,19 +34,19 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -topics src/main/resources/topics-and-qrels/topics.backgroundlinking19.txt -topicreader BackgroundLinking \ -output runs/run.wapo.v2.bm25.topics.backgroundlinking19.txt \ -backgroundlinking -backgroundlinking.k 100 -bm25 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -topics src/main/resources/topics-and-qrels/topics.backgroundlinking19.txt -topicreader BackgroundLinking \ -output runs/run.wapo.v2.bm25+rm3.topics.backgroundlinking19.txt \ -backgroundlinking -backgroundlinking.k 100 -bm25 -rm3 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -topics src/main/resources/topics-and-qrels/topics.backgroundlinking19.txt -topicreader BackgroundLinking \ -output runs/run.wapo.v2.bm25+rm3+df.topics.backgroundlinking19.txt \ -backgroundlinking -backgroundlinking.datefilter -backgroundlinking.k 100 -bm25 -rm3 -hits 100 & diff --git a/docs/regressions-backgroundlinking20.md b/docs/regressions-backgroundlinking20.md index 9905891d01..b2fcec96c3 100644 --- a/docs/regressions-backgroundlinking20.md +++ b/docs/regressions-backgroundlinking20.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection WashingtonPostCollection \ -input /path/to/wapo.v3 \ - -index indexes/lucene-index.wapo.v3 \ + -index indexes/lucene-index.wapo.v3/ \ -generator WashingtonPostGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw \ >& logs/log.wapo.v3 & @@ -34,19 +34,19 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v3 \ + -index indexes/lucene-index.wapo.v3/ \ -topics src/main/resources/topics-and-qrels/topics.backgroundlinking20.txt -topicreader BackgroundLinking \ -output runs/run.wapo.v3.bm25.topics.backgroundlinking20.txt \ -backgroundlinking -backgroundlinking.k 100 -bm25 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v3 \ + -index indexes/lucene-index.wapo.v3/ \ -topics src/main/resources/topics-and-qrels/topics.backgroundlinking20.txt -topicreader BackgroundLinking \ -output runs/run.wapo.v3.bm25+rm3.topics.backgroundlinking20.txt \ -backgroundlinking -backgroundlinking.k 100 -bm25 -rm3 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v3 \ + -index indexes/lucene-index.wapo.v3/ \ -topics src/main/resources/topics-and-qrels/topics.backgroundlinking20.txt -topicreader BackgroundLinking \ -output runs/run.wapo.v3.bm25+rm3+df.topics.backgroundlinking20.txt \ -backgroundlinking -backgroundlinking.datefilter -backgroundlinking.k 100 -bm25 -rm3 -hits 100 & diff --git a/docs/regressions-car17v1.5.md b/docs/regressions-car17v1.5.md index 2049086e4d..16684b2f77 100644 --- a/docs/regressions-car17v1.5.md +++ b/docs/regressions-car17v1.5.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection CarCollection \ -input /path/to/car-paragraphCorpus.v1.5 \ - -index indexes/lucene-index.car-paragraphCorpus.v1.5 \ + -index indexes/lucene-index.car-paragraphCorpus.v1.5/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw \ >& logs/log.car-paragraphCorpus.v1.5 & @@ -35,37 +35,37 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v1.5 \ + -index indexes/lucene-index.car-paragraphCorpus.v1.5/ \ -topics src/main/resources/topics-and-qrels/topics.car17v1.5.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v1.5.bm25.topics.car17v1.5.benchmarkY1test.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v1.5 \ + -index indexes/lucene-index.car-paragraphCorpus.v1.5/ \ -topics src/main/resources/topics-and-qrels/topics.car17v1.5.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v1.5.bm25+rm3.topics.car17v1.5.benchmarkY1test.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v1.5 \ + -index indexes/lucene-index.car-paragraphCorpus.v1.5/ \ -topics src/main/resources/topics-and-qrels/topics.car17v1.5.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v1.5.bm25+ax.topics.car17v1.5.benchmarkY1test.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v1.5 \ + -index indexes/lucene-index.car-paragraphCorpus.v1.5/ \ -topics src/main/resources/topics-and-qrels/topics.car17v1.5.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v1.5.ql.topics.car17v1.5.benchmarkY1test.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v1.5 \ + -index indexes/lucene-index.car-paragraphCorpus.v1.5/ \ -topics src/main/resources/topics-and-qrels/topics.car17v1.5.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v1.5.ql+rm3.topics.car17v1.5.benchmarkY1test.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v1.5 \ + -index indexes/lucene-index.car-paragraphCorpus.v1.5/ \ -topics src/main/resources/topics-and-qrels/topics.car17v1.5.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v1.5.ql+ax.topics.car17v1.5.benchmarkY1test.txt \ -qld -axiom -axiom.deterministic -rerankCutoff 20 & diff --git a/docs/regressions-car17v2.0-doc2query.md b/docs/regressions-car17v2.0-doc2query.md index 0dd87a2e3e..21889cde9f 100644 --- a/docs/regressions-car17v2.0-doc2query.md +++ b/docs/regressions-car17v2.0-doc2query.md @@ -18,7 +18,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonCollection \ -input /path/to/car-paragraphCorpus.v2.0-doc2query \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query/ \ -generator DefaultLuceneDocumentGenerator \ -threads 30 -storePositions -storeDocvectors -storeRaw \ >& logs/log.car-paragraphCorpus.v2.0-doc2query & @@ -41,37 +41,37 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query/ \ -topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v2.0-doc2query.bm25.topics.car17v2.0.benchmarkY1test.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query/ \ -topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v2.0-doc2query.bm25+rm3.topics.car17v2.0.benchmarkY1test.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query/ \ -topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v2.0-doc2query.bm25+ax.topics.car17v2.0.benchmarkY1test.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query/ \ -topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v2.0-doc2query.ql.topics.car17v2.0.benchmarkY1test.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query/ \ -topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v2.0-doc2query.ql+rm3.topics.car17v2.0.benchmarkY1test.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query/ \ -topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v2.0-doc2query.ql+ax.topics.car17v2.0.benchmarkY1test.txt \ -qld -axiom -axiom.deterministic -rerankCutoff 20 & diff --git a/docs/regressions-car17v2.0.md b/docs/regressions-car17v2.0.md index 4ece511e3a..176f6e307f 100644 --- a/docs/regressions-car17v2.0.md +++ b/docs/regressions-car17v2.0.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection CarCollection \ -input /path/to/car-paragraphCorpus.v2.0 \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0 \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw \ >& logs/log.car-paragraphCorpus.v2.0 & @@ -35,37 +35,37 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0 \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0/ \ -topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v2.0.bm25.topics.car17v2.0.benchmarkY1test.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0 \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0/ \ -topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v2.0.bm25+rm3.topics.car17v2.0.benchmarkY1test.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0 \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0/ \ -topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v2.0.bm25+ax.topics.car17v2.0.benchmarkY1test.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0 \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0/ \ -topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v2.0.ql.topics.car17v2.0.benchmarkY1test.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0 \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0/ \ -topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v2.0.ql+rm3.topics.car17v2.0.benchmarkY1test.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.car-paragraphCorpus.v2.0 \ + -index indexes/lucene-index.car-paragraphCorpus.v2.0/ \ -topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt -topicreader Car \ -output runs/run.car-paragraphCorpus.v2.0.ql+ax.topics.car17v2.0.benchmarkY1test.txt \ -qld -axiom -axiom.deterministic -rerankCutoff 20 & diff --git a/docs/regressions-clef06-fr.md b/docs/regressions-clef06-fr.md index 988de66e5a..e50ee2cd56 100644 --- a/docs/regressions-clef06-fr.md +++ b/docs/regressions-clef06-fr.md @@ -14,7 +14,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonCollection \ -input /path/to/clef06-fr \ - -index indexes/lucene-index.clef06-fr \ + -index indexes/lucene-index.clef06-fr/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -storePositions -storeDocvectors -storeRaw -language fr \ >& logs/log.clef06-fr & @@ -37,7 +37,7 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.clef06-fr \ + -index indexes/lucene-index.clef06-fr/ \ -topics src/main/resources/topics-and-qrels/topics.clef06fr.mono.fr.txt -topicreader TsvString \ -output runs/run.clef06-fr.bm25.topics.clef06fr.mono.fr.txt \ -bm25 -language fr & diff --git a/docs/regressions-core17.md b/docs/regressions-core17.md index 364e516bb2..39fb774772 100644 --- a/docs/regressions-core17.md +++ b/docs/regressions-core17.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection NewYorkTimesCollection \ -input /path/to/nyt \ - -index indexes/lucene-index.nyt \ + -index indexes/lucene-index.nyt/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -storePositions -storeDocvectors -storeRaw \ >& logs/log.nyt & @@ -34,37 +34,37 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.nyt \ + -index indexes/lucene-index.nyt/ \ -topics src/main/resources/topics-and-qrels/topics.core17.txt -topicreader Trec \ -output runs/run.nyt.bm25.topics.core17.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.nyt \ + -index indexes/lucene-index.nyt/ \ -topics src/main/resources/topics-and-qrels/topics.core17.txt -topicreader Trec \ -output runs/run.nyt.bm25+rm3.topics.core17.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.nyt \ + -index indexes/lucene-index.nyt/ \ -topics src/main/resources/topics-and-qrels/topics.core17.txt -topicreader Trec \ -output runs/run.nyt.bm25+ax.topics.core17.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.nyt \ + -index indexes/lucene-index.nyt/ \ -topics src/main/resources/topics-and-qrels/topics.core17.txt -topicreader Trec \ -output runs/run.nyt.ql.topics.core17.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.nyt \ + -index indexes/lucene-index.nyt/ \ -topics src/main/resources/topics-and-qrels/topics.core17.txt -topicreader Trec \ -output runs/run.nyt.ql+rm3.topics.core17.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.nyt \ + -index indexes/lucene-index.nyt/ \ -topics src/main/resources/topics-and-qrels/topics.core17.txt -topicreader Trec \ -output runs/run.nyt.ql+ax.topics.core17.txt \ -qld -axiom -axiom.deterministic -rerankCutoff 20 & diff --git a/docs/regressions-core18.md b/docs/regressions-core18.md index 01bdf605a5..29b08d8eae 100644 --- a/docs/regressions-core18.md +++ b/docs/regressions-core18.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection WashingtonPostCollection \ -input /path/to/wapo.v2 \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -generator WashingtonPostGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw \ >& logs/log.wapo.v2 & @@ -34,37 +34,37 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -topics src/main/resources/topics-and-qrels/topics.core18.txt -topicreader Trec \ -output runs/run.wapo.v2.bm25.topics.core18.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -topics src/main/resources/topics-and-qrels/topics.core18.txt -topicreader Trec \ -output runs/run.wapo.v2.bm25+rm3.topics.core18.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -topics src/main/resources/topics-and-qrels/topics.core18.txt -topicreader Trec \ -output runs/run.wapo.v2.bm25+ax.topics.core18.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -topics src/main/resources/topics-and-qrels/topics.core18.txt -topicreader Trec \ -output runs/run.wapo.v2.ql.topics.core18.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -topics src/main/resources/topics-and-qrels/topics.core18.txt -topicreader Trec \ -output runs/run.wapo.v2.ql+rm3.topics.core18.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wapo.v2 \ + -index indexes/lucene-index.wapo.v2/ \ -topics src/main/resources/topics-and-qrels/topics.core18.txt -topicreader Trec \ -output runs/run.wapo.v2.ql+ax.topics.core18.txt \ -qld -axiom -axiom.deterministic -rerankCutoff 20 & diff --git a/docs/regressions-cw09b.md b/docs/regressions-cw09b.md index 070574c23c..44bb5f5b1f 100644 --- a/docs/regressions-cw09b.md +++ b/docs/regressions-cw09b.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection ClueWeb09Collection \ -input /path/to/cw09b \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -generator DefaultLuceneDocumentGenerator \ -threads 44 -storePositions -storeDocvectors -storeRaw \ >& logs/log.cw09b & @@ -39,97 +39,97 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.51-100.txt -topicreader Webxml \ -output runs/run.cw09b.bm25.topics.web.51-100.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.101-150.txt -topicreader Webxml \ -output runs/run.cw09b.bm25.topics.web.101-150.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.151-200.txt -topicreader Webxml \ -output runs/run.cw09b.bm25.topics.web.151-200.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.51-100.txt -topicreader Webxml \ -output runs/run.cw09b.bm25+rm3.topics.web.51-100.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.101-150.txt -topicreader Webxml \ -output runs/run.cw09b.bm25+rm3.topics.web.101-150.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.151-200.txt -topicreader Webxml \ -output runs/run.cw09b.bm25+rm3.topics.web.151-200.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.51-100.txt -topicreader Webxml \ -output runs/run.cw09b.bm25+ax.topics.web.51-100.txt \ -bm25 -axiom -axiom.deterministic -axiom.beta 0.1 -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.101-150.txt -topicreader Webxml \ -output runs/run.cw09b.bm25+ax.topics.web.101-150.txt \ -bm25 -axiom -axiom.deterministic -axiom.beta 0.1 -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.151-200.txt -topicreader Webxml \ -output runs/run.cw09b.bm25+ax.topics.web.151-200.txt \ -bm25 -axiom -axiom.deterministic -axiom.beta 0.1 -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.51-100.txt -topicreader Webxml \ -output runs/run.cw09b.ql.topics.web.51-100.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.101-150.txt -topicreader Webxml \ -output runs/run.cw09b.ql.topics.web.101-150.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.151-200.txt -topicreader Webxml \ -output runs/run.cw09b.ql.topics.web.151-200.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.51-100.txt -topicreader Webxml \ -output runs/run.cw09b.ql+rm3.topics.web.51-100.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.101-150.txt -topicreader Webxml \ -output runs/run.cw09b.ql+rm3.topics.web.101-150.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.151-200.txt -topicreader Webxml \ -output runs/run.cw09b.ql+rm3.topics.web.151-200.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.51-100.txt -topicreader Webxml \ -output runs/run.cw09b.ql+ax.topics.web.51-100.txt \ -qld -axiom -axiom.deterministic -axiom.beta 0.1 -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.101-150.txt -topicreader Webxml \ -output runs/run.cw09b.ql+ax.topics.web.101-150.txt \ -qld -axiom -axiom.deterministic -axiom.beta 0.1 -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw09b \ + -index indexes/lucene-index.cw09b/ \ -topics src/main/resources/topics-and-qrels/topics.web.151-200.txt -topicreader Webxml \ -output runs/run.cw09b.ql+ax.topics.web.151-200.txt \ -qld -axiom -axiom.deterministic -axiom.beta 0.1 -rerankCutoff 20 & diff --git a/docs/regressions-cw12.md b/docs/regressions-cw12.md index 1497c025af..be46e58f81 100644 --- a/docs/regressions-cw12.md +++ b/docs/regressions-cw12.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection ClueWeb12Collection \ -input /path/to/cw12 \ - -index indexes/lucene-index.cw12 \ + -index indexes/lucene-index.cw12/ \ -generator DefaultLuceneDocumentGenerator \ -threads 44 -storePositions -storeDocvectors -storeRaw \ >& logs/log.cw12 & @@ -35,45 +35,45 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12 \ + -index indexes/lucene-index.cw12/ \ -topics src/main/resources/topics-and-qrels/topics.web.201-250.txt -topicreader Webxml \ -output runs/run.cw12.bm25.topics.web.201-250.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12 \ + -index indexes/lucene-index.cw12/ \ -topics src/main/resources/topics-and-qrels/topics.web.251-300.txt -topicreader Webxml \ -output runs/run.cw12.bm25.topics.web.251-300.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12 \ + -index indexes/lucene-index.cw12/ \ -topics src/main/resources/topics-and-qrels/topics.web.201-250.txt -topicreader Webxml \ -output runs/run.cw12.bm25+rm3.topics.web.201-250.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12 \ + -index indexes/lucene-index.cw12/ \ -topics src/main/resources/topics-and-qrels/topics.web.251-300.txt -topicreader Webxml \ -output runs/run.cw12.bm25+rm3.topics.web.251-300.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12 \ + -index indexes/lucene-index.cw12/ \ -topics src/main/resources/topics-and-qrels/topics.web.201-250.txt -topicreader Webxml \ -output runs/run.cw12.ql.topics.web.201-250.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12 \ + -index indexes/lucene-index.cw12/ \ -topics src/main/resources/topics-and-qrels/topics.web.251-300.txt -topicreader Webxml \ -output runs/run.cw12.ql.topics.web.251-300.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12 \ + -index indexes/lucene-index.cw12/ \ -topics src/main/resources/topics-and-qrels/topics.web.201-250.txt -topicreader Webxml \ -output runs/run.cw12.ql+rm3.topics.web.201-250.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12 \ + -index indexes/lucene-index.cw12/ \ -topics src/main/resources/topics-and-qrels/topics.web.251-300.txt -topicreader Webxml \ -output runs/run.cw12.ql+rm3.topics.web.251-300.txt \ -qld -rm3 & diff --git a/docs/regressions-cw12b13.md b/docs/regressions-cw12b13.md index eef15ab17f..a00d7b1550 100644 --- a/docs/regressions-cw12b13.md +++ b/docs/regressions-cw12b13.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection ClueWeb12Collection \ -input /path/to/cw12b13 \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -generator DefaultLuceneDocumentGenerator \ -threads 44 -storePositions -storeDocvectors -storeRaw \ >& logs/log.cw12b13 & @@ -35,67 +35,67 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -topics src/main/resources/topics-and-qrels/topics.web.201-250.txt -topicreader Webxml \ -output runs/run.cw12b13.bm25.topics.web.201-250.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -topics src/main/resources/topics-and-qrels/topics.web.251-300.txt -topicreader Webxml \ -output runs/run.cw12b13.bm25.topics.web.251-300.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -topics src/main/resources/topics-and-qrels/topics.web.201-250.txt -topicreader Webxml \ -output runs/run.cw12b13.bm25+rm3.topics.web.201-250.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -topics src/main/resources/topics-and-qrels/topics.web.251-300.txt -topicreader Webxml \ -output runs/run.cw12b13.bm25+rm3.topics.web.251-300.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -topics src/main/resources/topics-and-qrels/topics.web.201-250.txt -topicreader Webxml \ -output runs/run.cw12b13.bm25+ax.topics.web.201-250.txt \ -bm25 -axiom -axiom.deterministic -axiom.beta 0.1 -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -topics src/main/resources/topics-and-qrels/topics.web.251-300.txt -topicreader Webxml \ -output runs/run.cw12b13.bm25+ax.topics.web.251-300.txt \ -bm25 -axiom -axiom.deterministic -axiom.beta 0.1 -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -topics src/main/resources/topics-and-qrels/topics.web.201-250.txt -topicreader Webxml \ -output runs/run.cw12b13.ql.topics.web.201-250.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -topics src/main/resources/topics-and-qrels/topics.web.251-300.txt -topicreader Webxml \ -output runs/run.cw12b13.ql.topics.web.251-300.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -topics src/main/resources/topics-and-qrels/topics.web.201-250.txt -topicreader Webxml \ -output runs/run.cw12b13.ql+rm3.topics.web.201-250.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -topics src/main/resources/topics-and-qrels/topics.web.251-300.txt -topicreader Webxml \ -output runs/run.cw12b13.ql+rm3.topics.web.251-300.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -topics src/main/resources/topics-and-qrels/topics.web.201-250.txt -topicreader Webxml \ -output runs/run.cw12b13.ql+ax.topics.web.201-250.txt \ -qld -axiom -axiom.deterministic -axiom.beta 0.1 -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.cw12b13 \ + -index indexes/lucene-index.cw12b13/ \ -topics src/main/resources/topics-and-qrels/topics.web.251-300.txt -topicreader Webxml \ -output runs/run.cw12b13.ql+ax.topics.web.251-300.txt \ -qld -axiom -axiom.deterministic -axiom.beta 0.1 -rerankCutoff 20 & diff --git a/docs/regressions-disk12.md b/docs/regressions-disk12.md index 4a3daa9bb3..7bcd7a61d1 100644 --- a/docs/regressions-disk12.md +++ b/docs/regressions-disk12.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection TrecCollection \ -input /path/to/disk12 \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -storePositions -storeDocvectors -storeRaw \ >& logs/log.disk12 & @@ -37,97 +37,97 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.51-100.txt -topicreader Trec \ -output runs/run.disk12.bm25.topics.adhoc.51-100.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.101-150.txt -topicreader Trec \ -output runs/run.disk12.bm25.topics.adhoc.101-150.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.151-200.txt -topicreader Trec \ -output runs/run.disk12.bm25.topics.adhoc.151-200.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.51-100.txt -topicreader Trec \ -output runs/run.disk12.bm25+rm3.topics.adhoc.51-100.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.101-150.txt -topicreader Trec \ -output runs/run.disk12.bm25+rm3.topics.adhoc.101-150.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.151-200.txt -topicreader Trec \ -output runs/run.disk12.bm25+rm3.topics.adhoc.151-200.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.51-100.txt -topicreader Trec \ -output runs/run.disk12.bm25+ax.topics.adhoc.51-100.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.101-150.txt -topicreader Trec \ -output runs/run.disk12.bm25+ax.topics.adhoc.101-150.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.151-200.txt -topicreader Trec \ -output runs/run.disk12.bm25+ax.topics.adhoc.151-200.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.51-100.txt -topicreader Trec \ -output runs/run.disk12.ql.topics.adhoc.51-100.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.101-150.txt -topicreader Trec \ -output runs/run.disk12.ql.topics.adhoc.101-150.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.151-200.txt -topicreader Trec \ -output runs/run.disk12.ql.topics.adhoc.151-200.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.51-100.txt -topicreader Trec \ -output runs/run.disk12.ql+rm3.topics.adhoc.51-100.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.101-150.txt -topicreader Trec \ -output runs/run.disk12.ql+rm3.topics.adhoc.101-150.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.151-200.txt -topicreader Trec \ -output runs/run.disk12.ql+rm3.topics.adhoc.151-200.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.51-100.txt -topicreader Trec \ -output runs/run.disk12.ql+ax.topics.adhoc.51-100.txt \ -qld -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.101-150.txt -topicreader Trec \ -output runs/run.disk12.ql+ax.topics.adhoc.101-150.txt \ -qld -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk12 \ + -index indexes/lucene-index.disk12/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.151-200.txt -topicreader Trec \ -output runs/run.disk12.ql+ax.topics.adhoc.151-200.txt \ -qld -axiom -axiom.deterministic -rerankCutoff 20 & diff --git a/docs/regressions-disk45.md b/docs/regressions-disk45.md index d4d34013a4..a0bc17c295 100644 --- a/docs/regressions-disk45.md +++ b/docs/regressions-disk45.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection TrecCollection \ -input /path/to/disk45 \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -storePositions -storeDocvectors -storeRaw \ >& logs/log.disk45 & @@ -36,97 +36,97 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt -topicreader Trec \ -output runs/run.disk45.bm25.topics.adhoc.351-400.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt -topicreader Trec \ -output runs/run.disk45.bm25.topics.adhoc.401-450.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.robust04.txt -topicreader Trec \ -output runs/run.disk45.bm25.topics.robust04.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt -topicreader Trec \ -output runs/run.disk45.bm25+rm3.topics.adhoc.351-400.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt -topicreader Trec \ -output runs/run.disk45.bm25+rm3.topics.adhoc.401-450.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.robust04.txt -topicreader Trec \ -output runs/run.disk45.bm25+rm3.topics.robust04.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt -topicreader Trec \ -output runs/run.disk45.bm25+ax.topics.adhoc.351-400.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt -topicreader Trec \ -output runs/run.disk45.bm25+ax.topics.adhoc.401-450.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.robust04.txt -topicreader Trec \ -output runs/run.disk45.bm25+ax.topics.robust04.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt -topicreader Trec \ -output runs/run.disk45.ql.topics.adhoc.351-400.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt -topicreader Trec \ -output runs/run.disk45.ql.topics.adhoc.401-450.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.robust04.txt -topicreader Trec \ -output runs/run.disk45.ql.topics.robust04.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt -topicreader Trec \ -output runs/run.disk45.ql+rm3.topics.adhoc.351-400.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt -topicreader Trec \ -output runs/run.disk45.ql+rm3.topics.adhoc.401-450.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.robust04.txt -topicreader Trec \ -output runs/run.disk45.ql+rm3.topics.robust04.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt -topicreader Trec \ -output runs/run.disk45.ql+ax.topics.adhoc.351-400.txt \ -qld -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt -topicreader Trec \ -output runs/run.disk45.ql+ax.topics.adhoc.401-450.txt \ -qld -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.disk45 \ + -index indexes/lucene-index.disk45/ \ -topics src/main/resources/topics-and-qrels/topics.robust04.txt -topicreader Trec \ -output runs/run.disk45.ql+ax.topics.robust04.txt \ -qld -axiom -axiom.deterministic -rerankCutoff 20 & diff --git a/docs/regressions-dl19-doc-docTTTTTquery-per-doc.md b/docs/regressions-dl19-doc-docTTTTTquery.md similarity index 69% rename from docs/regressions-dl19-doc-docTTTTTquery-per-doc.md rename to docs/regressions-dl19-doc-docTTTTTquery.md index 8ff81c8743..704e4309b9 100644 --- a/docs/regressions-dl19-doc-docTTTTTquery-per-doc.md +++ b/docs/regressions-dl19-doc-docTTTTTquery.md @@ -1,4 +1,4 @@ -# Anserini: Regressions for [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) w/ per-doc docTTTTTquery +# Anserini: Regressions for [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) w/ docTTTTTquery This page describes experiments, integrated into Anserini's regression testing framework, for the TREC 2019 Deep Learning Track (Document Ranking Task) on the MS MARCO document collection using relevance judgments from NIST. @@ -10,10 +10,14 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is treated as a unit of indexing + **Expansion Condition:** doc2query-T5 -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. -The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl19-doc-docTTTTTquery-per-doc.yaml). -Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl19-doc-docTTTTTquery-per-doc.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl19-doc-docTTTTTquery.yaml). +Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl19-doc-docTTTTTquery.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. + +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. ## Indexing @@ -22,14 +26,15 @@ Typical indexing command: ``` target/appassembler/bin/IndexCollection \ -collection JsonCollection \ - -input /path/to/msmarco-doc-docTTTTTquery-per-doc \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -input /path/to/msmarco-doc-docTTTTTquery \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -generator DefaultLuceneDocumentGenerator \ - -threads 1 -storePositions -storeDocvectors -storeRaw \ - >& logs/log.msmarco-doc-docTTTTTquery-per-doc & + -threads 7 -storePositions -storeDocvectors -storeRaw \ + >& logs/log.msmarco-doc-docTTTTTquery & ``` -The directory `/path/to/msmarco-doc-docTTTTTquery-per-doc/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-docTTTTTquery/` should be a directory containing the expanded document corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -43,40 +48,40 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-default.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-docTTTTTquery.bm25-default.topics.dl19-doc.txt \ -bm25 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-default+rm3.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-docTTTTTquery.bm25-default+rm3.topics.dl19-doc.txt \ -bm25 -rm3 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-tuned.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-docTTTTTquery.bm25-tuned.topics.dl19-doc.txt \ -bm25 -bm25.k1 4.68 -bm25.b 0.87 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-tuned+rm3.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-docTTTTTquery.bm25-tuned+rm3.topics.dl19-doc.txt \ -bm25 -bm25.k1 4.68 -bm25.b 0.87 -rm3 -hits 100 & ``` Evaluation can be performed using `trec_eval`: ``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-default.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-docTTTTTquery.bm25-default.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-default+rm3.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-docTTTTTquery.bm25-default+rm3.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-tuned.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-docTTTTTquery.bm25-tuned.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-tuned+rm3.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-docTTTTTquery.bm25-tuned+rm3.topics.dl19-doc.txt ``` ## Effectiveness @@ -85,7 +90,7 @@ With the above commands, you should be able to reproduce the following results: MAP | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------| -[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.2699 | 0.3044 | 0.2620 | 0.2812 | +[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.2700 | 0.3045 | 0.2620 | 0.2814 | R@100 | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | @@ -95,7 +100,7 @@ R@100 | BM25 (default)| +RM3 | BM25 (tune nDCG@10 | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------| -[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.5968 | 0.5895 | 0.5967 | 0.6075 | +[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.5968 | 0.5897 | 0.5972 | 0.6080 | Explanation of settings: diff --git a/docs/regressions-dl19-doc-docTTTTTquery-per-passage.md b/docs/regressions-dl19-doc-segmented-docTTTTTquery.md similarity index 63% rename from docs/regressions-dl19-doc-docTTTTTquery-per-passage.md rename to docs/regressions-dl19-doc-segmented-docTTTTTquery.md index c9fafd645b..7503748baf 100644 --- a/docs/regressions-dl19-doc-docTTTTTquery-per-passage.md +++ b/docs/regressions-dl19-doc-segmented-docTTTTTquery.md @@ -1,4 +1,4 @@ -# Anserini: Regressions for [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) w/ per-passage docTTTTTquery +# Anserini: Regressions for [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) Segmented w/ docTTTTTquery This page describes experiments, integrated into Anserini's regression testing framework, for the TREC 2019 Deep Learning Track (Document Ranking Task) on the MS MARCO document collection using relevance judgments from NIST. @@ -10,11 +10,15 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing + **Expansion Condition:** doc2query-T5 -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. +In the passage (i.e., segment) indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl19-doc-docTTTTTquery-per-passage.yaml). -Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl19-doc-docTTTTTquery-per-passage.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl19-doc-segmented-docTTTTTquery.yaml). +Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl19-doc-segmented-docTTTTTquery.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. + +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. ## Indexing @@ -23,14 +27,15 @@ Typical indexing command: ``` target/appassembler/bin/IndexCollection \ -collection JsonCollection \ - -input /path/to/msmarco-doc-docTTTTTquery-per-passage \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -input /path/to/msmarco-doc-segmented-docTTTTTquery \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -generator DefaultLuceneDocumentGenerator \ - -threads 1 -storePositions -storeDocvectors -storeRaw \ - >& logs/log.msmarco-doc-docTTTTTquery-per-passage & + -threads 16 -storePositions -storeDocvectors -storeRaw \ + >& logs/log.msmarco-doc-segmented-docTTTTTquery & ``` -The directory `/path/to/msmarco-doc-docTTTTTquery-per-passage/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-segmented-docTTTTTquery/` should be a directory containing the expanded segmented corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -44,40 +49,40 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-default.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-default.topics.dl19-doc.txt \ -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-default+rm3.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-default+rm3.topics.dl19-doc.txt \ -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-tuned.topics.dl19-doc.txt \ -bm25 -bm25.k1 2.56 -bm25.b 0.59 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned+rm3.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-tuned+rm3.topics.dl19-doc.txt \ -bm25 -bm25.k1 2.56 -bm25.b 0.59 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & ``` Evaluation can be performed using `trec_eval`: ``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-default.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-default.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-default+rm3.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-default+rm3.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-tuned.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned+rm3.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-tuned+rm3.topics.dl19-doc.txt ``` ## Effectiveness @@ -86,17 +91,17 @@ With the above commands, you should be able to reproduce the following results: MAP | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------| -[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.2791 | 0.3025 | 0.2655 | 0.2895 | +[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.2798 | 0.3021 | 0.2658 | 0.2893 | R@100 | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------| -[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.4092 | 0.4394 | 0.4020 | 0.4235 | +[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.4093 | 0.4392 | 0.4026 | 0.4237 | nDCG@10 | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------| -[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.6099 | 0.6318 | 0.6271 | 0.6256 | +[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.6119 | 0.6297 | 0.6273 | 0.6239 | Explanation of settings: diff --git a/docs/regressions-dl19-doc-per-passage.md b/docs/regressions-dl19-doc-segmented.md similarity index 67% rename from docs/regressions-dl19-doc-per-passage.md rename to docs/regressions-dl19-doc-segmented.md index 257d8bd76e..ad90df2ce3 100644 --- a/docs/regressions-dl19-doc-per-passage.md +++ b/docs/regressions-dl19-doc-segmented.md @@ -1,4 +1,4 @@ -# Anserini: Regressions for [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) +# Anserini: Regressions for [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) Segmented This page describes experiments, integrated into Anserini's regression testing framework, for the TREC 2019 Deep Learning Track (Document Ranking Task) on the MS MARCO document collection using relevance judgments from NIST. @@ -10,11 +10,15 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing + **Expansion Condition:** none -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. +In the passage (i.e., segment) indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl19-doc-per-passage.yaml). -Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl19-doc-per-passage.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl19-doc-segmented.yaml). +Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl19-doc-segmented.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. + +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. ## Indexing @@ -23,14 +27,15 @@ Typical indexing command: ``` target/appassembler/bin/IndexCollection \ -collection JsonCollection \ - -input /path/to/msmarco-doc-per-passage \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -input /path/to/msmarco-doc-segmented \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -generator DefaultLuceneDocumentGenerator \ - -threads 1 -storePositions -storeDocvectors -storeRaw \ - >& logs/log.msmarco-doc-per-passage & + -threads 16 -storePositions -storeDocvectors -storeRaw \ + >& logs/log.msmarco-doc-segmented & ``` -The directory `/path/to/msmarco-doc-per-passage/` should be a directory containing the segmented paragraph collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-segmented/` should be a directory containing the segmented corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -44,72 +49,72 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-default.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-segmented.bm25-default.topics.dl19-doc.txt \ -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-default+rm3.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-segmented.bm25-default+rm3.topics.dl19-doc.txt \ -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-default+ax.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-segmented.bm25-default+ax.topics.dl19-doc.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-default+prf.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-segmented.bm25-default+prf.topics.dl19-doc.txt \ -bm25 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-tuned.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-segmented.bm25-tuned.topics.dl19-doc.txt \ -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-tuned+rm3.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-segmented.bm25-tuned+rm3.topics.dl19-doc.txt \ -bm25 -bm25.k1 2.16 -bm25.b 0.61 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-tuned+ax.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-segmented.bm25-tuned+ax.topics.dl19-doc.txt \ -bm25 -bm25.k1 2.16 -bm25.b 0.61 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-tuned+prf.topics.dl19-doc.txt \ + -output runs/run.msmarco-doc-segmented.bm25-tuned+prf.topics.dl19-doc.txt \ -bm25 -bm25.k1 2.16 -bm25.b 0.61 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & ``` Evaluation can be performed using `trec_eval`: ``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-per-passage.bm25-default.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-segmented.bm25-default.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-per-passage.bm25-default+rm3.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-segmented.bm25-default+rm3.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-per-passage.bm25-default+ax.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-segmented.bm25-default+ax.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-per-passage.bm25-default+prf.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-segmented.bm25-default+prf.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-per-passage.bm25-tuned.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-segmented.bm25-tuned.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-per-passage.bm25-tuned+rm3.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-segmented.bm25-tuned+rm3.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-per-passage.bm25-tuned+ax.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-segmented.bm25-tuned+ax.topics.dl19-doc.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-per-passage.bm25-tuned+prf.topics.dl19-doc.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.dl19-doc.txt runs/run.msmarco-doc-segmented.bm25-tuned+prf.topics.dl19-doc.txt ``` ## Effectiveness @@ -118,17 +123,17 @@ With the above commands, you should be able to reproduce the following results: MAP | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.2441 | 0.2880 | 0.3015 | 0.2821 | 0.2394 | 0.2656 | 0.2934 | 0.2838 | +[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.2449 | 0.2884 | 0.2981 | 0.2827 | 0.2398 | 0.2658 | 0.2975 | 0.2828 | R@100 | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.3840 | 0.4356 | 0.4501 | 0.4477 | 0.3903 | 0.4126 | 0.4437 | 0.4362 | +[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.3840 | 0.4355 | 0.4490 | 0.4476 | 0.3903 | 0.4133 | 0.4491 | 0.4361 | nDCG@10 | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.5276 | 0.5750 | 0.5590 | 0.5591 | 0.5364 | 0.5379 | 0.5546 | 0.5478 | +[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.5302 | 0.5764 | 0.5556 | 0.5599 | 0.5389 | 0.5405 | 0.5574 | 0.5476 | Explanation of settings: diff --git a/docs/regressions-dl19-doc.md b/docs/regressions-dl19-doc.md index cd8ef510a6..071445ed05 100644 --- a/docs/regressions-dl19-doc.md +++ b/docs/regressions-dl19-doc.md @@ -10,26 +10,31 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is treated as a unit of indexing + **Expansion Condition:** none -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl19-doc.yaml). Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl19-doc.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: ``` target/appassembler/bin/IndexCollection \ - -collection CleanTrecCollection \ + -collection JsonCollection \ -input /path/to/msmarco-doc \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -generator DefaultLuceneDocumentGenerator \ - -threads 1 -storePositions -storeDocvectors -storeRaw \ + -threads 7 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-doc & ``` -The directory `/path/to/msmarco-doc/` should be a directory containing the official document collection (a single file), in TREC format. +The directory `/path/to/msmarco-doc/` should be a directory containing the document corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -43,49 +48,49 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-default.topics.dl19-doc.txt \ -bm25 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-default+rm3.topics.dl19-doc.txt \ -bm25 -rm3 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-default+ax.topics.dl19-doc.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-default+prf.topics.dl19-doc.txt \ -bm25 -bm25prf -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-tuned.topics.dl19-doc.txt \ -bm25 -bm25.k1 3.44 -bm25.b 0.87 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-tuned+rm3.topics.dl19-doc.txt \ -bm25 -bm25.k1 3.44 -bm25.b 0.87 -rm3 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-tuned+ax.topics.dl19-doc.txt \ -bm25 -bm25.k1 3.44 -bm25.b 0.87 -axiom -axiom.deterministic -rerankCutoff 20 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-doc.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-tuned+prf.topics.dl19-doc.txt \ -bm25 -bm25.k1 3.44 -bm25.b 0.87 -bm25prf -hits 100 & @@ -117,17 +122,17 @@ With the above commands, you should be able to reproduce the following results: MAP | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.2443 | 0.2772 | 0.2452 | 0.2541 | 0.2318 | 0.2700 | 0.2816 | 0.2758 | +[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.2434 | 0.2774 | 0.2454 | 0.2541 | 0.2311 | 0.2684 | 0.2792 | 0.2774 | R@100 | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.3948 | 0.4189 | 0.3945 | 0.4004 | 0.3862 | 0.4193 | 0.4399 | 0.4287 | +[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.3949 | 0.4189 | 0.3946 | 0.4003 | 0.3853 | 0.4186 | 0.4378 | 0.4295 | nDCG@10 | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.5190 | 0.5169 | 0.4730 | 0.5105 | 0.5140 | 0.5485 | 0.5245 | 0.5280 | +[DL19 (Doc)](https://trec.nist.gov/data/deep2019.html)| 0.5176 | 0.5170 | 0.4732 | 0.5107 | 0.5139 | 0.5445 | 0.5203 | 0.5294 | Explanation of settings: diff --git a/docs/regressions-dl19-passage-docTTTTTquery.md b/docs/regressions-dl19-passage-docTTTTTquery.md index fceebed0f0..98dd75084b 100644 --- a/docs/regressions-dl19-passage-docTTTTTquery.md +++ b/docs/regressions-dl19-passage-docTTTTTquery.md @@ -17,7 +17,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonCollection \ -input /path/to/msmarco-passage-docTTTTTquery \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -generator DefaultLuceneDocumentGenerator \ -threads 9 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-passage-docTTTTTquery & @@ -38,37 +38,37 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-default.topics.dl19-passage.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-default+rm3.topics.dl19-passage.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-tuned.topics.dl19-passage.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-tuned+rm3.topics.dl19-passage.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-tuned2.topics.dl19-passage.txt \ -bm25 -bm25.k1 2.18 -bm25.b 0.86 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-tuned2+rm3.topics.dl19-passage.txt \ -bm25 -bm25.k1 2.18 -bm25.b 0.86 -rm3 & diff --git a/docs/regressions-dl19-passage.md b/docs/regressions-dl19-passage.md index 7589791834..aeb07f5201 100644 --- a/docs/regressions-dl19-passage.md +++ b/docs/regressions-dl19-passage.md @@ -16,7 +16,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonCollection \ -input /path/to/msmarco-passage \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -generator DefaultLuceneDocumentGenerator \ -threads 9 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-passage & @@ -37,49 +37,49 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-default.topics.dl19-passage.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-default+rm3.topics.dl19-passage.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-default+ax.topics.dl19-passage.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-default+prf.topics.dl19-passage.txt \ -bm25 -bm25prf & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-tuned.topics.dl19-passage.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-tuned+rm3.topics.dl19-passage.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-tuned+ax.topics.dl19-passage.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-tuned+prf.topics.dl19-passage.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -bm25prf & diff --git a/docs/regressions-dl20-doc-docTTTTTquery-per-doc.md b/docs/regressions-dl20-doc-docTTTTTquery.md similarity index 71% rename from docs/regressions-dl20-doc-docTTTTTquery-per-doc.md rename to docs/regressions-dl20-doc-docTTTTTquery.md index 121e5d9de9..2a131f3d29 100644 --- a/docs/regressions-dl20-doc-docTTTTTquery-per-doc.md +++ b/docs/regressions-dl20-doc-docTTTTTquery.md @@ -1,4 +1,4 @@ -# Anserini: Regressions for [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) w/ per-doc docTTTTTquery +# Anserini: Regressions for [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) w/ docTTTTTquery This page describes experiments, integrated into Anserini's regression testing framework, for the TREC 2020 Deep Learning Track (Document Ranking Task) on the MS MARCO document collection using relevance judgments from NIST. @@ -10,10 +10,14 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is treated as a unit of indexing + **Expansion Condition:** doc2query-T5 -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. -The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl20-doc-docTTTTTquery-per-doc.yaml). -Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl20-doc-docTTTTTquery-per-doc.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl20-doc-docTTTTTquery.yaml). +Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl20-doc-docTTTTTquery.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. + +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. ## Indexing @@ -22,14 +26,15 @@ Typical indexing command: ``` target/appassembler/bin/IndexCollection \ -collection JsonCollection \ - -input /path/to/msmarco-doc-docTTTTTquery-per-doc \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -input /path/to/msmarco-doc-docTTTTTquery \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -generator DefaultLuceneDocumentGenerator \ - -threads 1 -storePositions -storeDocvectors -storeRaw \ - >& logs/log.msmarco-doc-docTTTTTquery-per-doc & + -threads 7 -storePositions -storeDocvectors -storeRaw \ + >& logs/log.msmarco-doc-docTTTTTquery & ``` -The directory `/path/to/msmarco-doc-docTTTTTquery-per-doc/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-docTTTTTquery/` should be a directory containing the expanded document corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -43,40 +48,40 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-default.topics.dl20.txt \ + -output runs/run.msmarco-doc-docTTTTTquery.bm25-default.topics.dl20.txt \ -bm25 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-default+rm3.topics.dl20.txt \ + -output runs/run.msmarco-doc-docTTTTTquery.bm25-default+rm3.topics.dl20.txt \ -bm25 -rm3 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-tuned.topics.dl20.txt \ + -output runs/run.msmarco-doc-docTTTTTquery.bm25-tuned.topics.dl20.txt \ -bm25 -bm25.k1 4.68 -bm25.b 0.87 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-tuned+rm3.topics.dl20.txt \ + -output runs/run.msmarco-doc-docTTTTTquery.bm25-tuned+rm3.topics.dl20.txt \ -bm25 -bm25.k1 4.68 -bm25.b 0.87 -rm3 -hits 100 & ``` Evaluation can be performed using `trec_eval`: ``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-default.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-docTTTTTquery.bm25-default.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-default+rm3.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-docTTTTTquery.bm25-default+rm3.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-tuned.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-docTTTTTquery.bm25-tuned.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-tuned+rm3.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-docTTTTTquery.bm25-tuned+rm3.topics.dl20.txt ``` ## Effectiveness @@ -85,7 +90,7 @@ With the above commands, you should be able to reproduce the following results: MAP | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.4230 | 0.4228 | 0.4098 | 0.4104 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.4230 | 0.4229 | 0.4099 | 0.4104 | nDCG@10 | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | @@ -100,7 +105,7 @@ MRR | BM25 (default)| +RM3 | BM25 (tune R@100 | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.6412 | 0.6555 | 0.6178 | 0.6127 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.6414 | 0.6555 | 0.6178 | 0.6127 | Explanation of settings: diff --git a/docs/regressions-dl20-doc-docTTTTTquery-per-passage.md b/docs/regressions-dl20-doc-segmented-docTTTTTquery.md similarity index 66% rename from docs/regressions-dl20-doc-docTTTTTquery-per-passage.md rename to docs/regressions-dl20-doc-segmented-docTTTTTquery.md index 7b64bc21e5..d59268c9c4 100644 --- a/docs/regressions-dl20-doc-docTTTTTquery-per-passage.md +++ b/docs/regressions-dl20-doc-segmented-docTTTTTquery.md @@ -1,4 +1,4 @@ -# Anserini: Regressions for [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) w/ per-passage docTTTTTquery +# Anserini: Regressions for [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) Segmented w/ docTTTTTquery This page describes experiments, integrated into Anserini's regression testing framework, for the TREC 2020 Deep Learning Track (Document Ranking Task) on the MS MARCO document collection using relevance judgments from NIST. @@ -10,11 +10,15 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing + **Expansion Condition:** doc2query-T5 -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. +In the passage (i.e., segment) indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl20-doc-docTTTTTquery-per-passage.yaml). -Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl20-doc-docTTTTTquery-per-passage.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl20-doc-segmented-docTTTTTquery.yaml). +Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl20-doc-segmented-docTTTTTquery.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. + +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. ## Indexing @@ -23,14 +27,15 @@ Typical indexing command: ``` target/appassembler/bin/IndexCollection \ -collection JsonCollection \ - -input /path/to/msmarco-doc-docTTTTTquery-per-passage \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -input /path/to/msmarco-doc-segmented-docTTTTTquery \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -generator DefaultLuceneDocumentGenerator \ - -threads 1 -storePositions -storeDocvectors -storeRaw \ - >& logs/log.msmarco-doc-docTTTTTquery-per-passage & + -threads 16 -storePositions -storeDocvectors -storeRaw \ + >& logs/log.msmarco-doc-segmented-docTTTTTquery & ``` -The directory `/path/to/msmarco-doc-docTTTTTquery-per-passage/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-segmented-docTTTTTquery/` should be a directory containing the expanded segmented corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -44,40 +49,40 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-default.topics.dl20.txt \ + -output runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-default.topics.dl20.txt \ -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-default+rm3.topics.dl20.txt \ + -output runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-default+rm3.topics.dl20.txt \ -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned.topics.dl20.txt \ + -output runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-tuned.topics.dl20.txt \ -bm25 -bm25.k1 2.56 -bm25.b 0.59 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned+rm3.topics.dl20.txt \ + -output runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-tuned+rm3.topics.dl20.txt \ -bm25 -bm25.k1 2.56 -bm25.b 0.59 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & ``` Evaluation can be performed using `trec_eval`: ``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-default.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-default.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-default+rm3.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-default+rm3.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-tuned.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned+rm3.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-tuned+rm3.topics.dl20.txt ``` ## Effectiveness @@ -86,12 +91,12 @@ With the above commands, you should be able to reproduce the following results: MAP | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.4150 | 0.4269 | 0.4042 | 0.4023 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.4150 | 0.4268 | 0.4047 | 0.4025 | nDCG@10 | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.5957 | 0.5848 | 0.5931 | 0.5723 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.5957 | 0.5850 | 0.5943 | 0.5724 | MRR | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | @@ -101,7 +106,7 @@ MRR | BM25 (default)| +RM3 | BM25 (tune R@100 | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.6201 | 0.6443 | 0.6192 | 0.6392 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.6201 | 0.6443 | 0.6195 | 0.6394 | Explanation of settings: diff --git a/docs/regressions-dl20-doc-per-passage.md b/docs/regressions-dl20-doc-segmented.md similarity index 66% rename from docs/regressions-dl20-doc-per-passage.md rename to docs/regressions-dl20-doc-segmented.md index b587a3bf18..1ba926308f 100644 --- a/docs/regressions-dl20-doc-per-passage.md +++ b/docs/regressions-dl20-doc-segmented.md @@ -1,4 +1,4 @@ -# Anserini: Regressions for [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) +# Anserini: Regressions for [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) Segmented This page describes experiments, integrated into Anserini's regression testing framework, for the TREC 2020 Deep Learning Track (Document Ranking Task) on the MS MARCO document collection using relevance judgments from NIST. @@ -10,11 +10,15 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing + **Expansion Condition:** none -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. +In the passage (i.e., segment) indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl20-doc-per-passage.yaml). -Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl20-doc-per-passage.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl20-doc-segmented.yaml). +Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl20-doc-segmented.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. + +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. ## Indexing @@ -23,14 +27,15 @@ Typical indexing command: ``` target/appassembler/bin/IndexCollection \ -collection JsonCollection \ - -input /path/to/msmarco-doc-per-passage \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -input /path/to/msmarco-doc-segmented \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -generator DefaultLuceneDocumentGenerator \ - -threads 1 -storePositions -storeDocvectors -storeRaw \ - >& logs/log.msmarco-doc-per-passage & + -threads 16 -storePositions -storeDocvectors -storeRaw \ + >& logs/log.msmarco-doc-segmented & ``` -The directory `/path/to/msmarco-doc-per-passage/` should be a directory containing the segmented paragraph collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-segmented/` should be a directory containing the segmented corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -44,72 +49,72 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-default.topics.dl20.txt \ + -output runs/run.msmarco-doc-segmented.bm25-default.topics.dl20.txt \ -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-default+rm3.topics.dl20.txt \ + -output runs/run.msmarco-doc-segmented.bm25-default+rm3.topics.dl20.txt \ -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-default+ax.topics.dl20.txt \ + -output runs/run.msmarco-doc-segmented.bm25-default+ax.topics.dl20.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-default+prf.topics.dl20.txt \ + -output runs/run.msmarco-doc-segmented.bm25-default+prf.topics.dl20.txt \ -bm25 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-tuned.topics.dl20.txt \ + -output runs/run.msmarco-doc-segmented.bm25-tuned.topics.dl20.txt \ -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-tuned+rm3.topics.dl20.txt \ + -output runs/run.msmarco-doc-segmented.bm25-tuned+rm3.topics.dl20.txt \ -bm25 -bm25.k1 2.16 -bm25.b 0.61 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-tuned+ax.topics.dl20.txt \ + -output runs/run.msmarco-doc-segmented.bm25-tuned+ax.topics.dl20.txt \ -bm25 -bm25.k1 2.16 -bm25.b 0.61 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-tuned+prf.topics.dl20.txt \ + -output runs/run.msmarco-doc-segmented.bm25-tuned+prf.topics.dl20.txt \ -bm25 -bm25.k1 2.16 -bm25.b 0.61 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 & ``` Evaluation can be performed using `trec_eval`: ``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-per-passage.bm25-default.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-segmented.bm25-default.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-per-passage.bm25-default+rm3.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-segmented.bm25-default+rm3.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-per-passage.bm25-default+ax.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-segmented.bm25-default+ax.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-per-passage.bm25-default+prf.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-segmented.bm25-default+prf.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-per-passage.bm25-tuned.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-segmented.bm25-tuned.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-per-passage.bm25-tuned+rm3.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-segmented.bm25-tuned+rm3.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-per-passage.bm25-tuned+ax.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-segmented.bm25-tuned+ax.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-per-passage.bm25-tuned+prf.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m ndcg_cut.10 -c -m recip_rank -c -m recall.100 src/main/resources/topics-and-qrels/qrels.dl20-doc.txt runs/run.msmarco-doc-segmented.bm25-tuned+prf.topics.dl20.txt ``` ## Effectiveness @@ -118,22 +123,22 @@ With the above commands, you should be able to reproduce the following results: MAP | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.3584 | 0.3769 | 0.3854 | 0.3672 | 0.3456 | 0.3471 | 0.3495 | 0.3629 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.3586 | 0.3774 | 0.3868 | 0.3686 | 0.3458 | 0.3472 | 0.3486 | 0.3627 | nDCG@10 | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.5271 | 0.5159 | 0.5250 | 0.5217 | 0.5213 | 0.4983 | 0.4942 | 0.5260 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.5281 | 0.5179 | 0.5227 | 0.5238 | 0.5213 | 0.4979 | 0.4948 | 0.5251 | MRR | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.8479 | 0.8136 | 0.8123 | 0.7911 | 0.8684 | 0.7807 | 0.8102 | 0.8478 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.8479 | 0.8136 | 0.8028 | 0.7911 | 0.8684 | 0.7807 | 0.8019 | 0.8478 | R@100 | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.5823 | 0.6224 | 0.6332 | 0.5994 | 0.5715 | 0.6013 | 0.6086 | 0.6064 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.5823 | 0.6224 | 0.6362 | 0.6012 | 0.5723 | 0.6025 | 0.6114 | 0.6048 | Explanation of settings: diff --git a/docs/regressions-dl20-doc.md b/docs/regressions-dl20-doc.md index 9d44ade798..8bd43b3758 100644 --- a/docs/regressions-dl20-doc.md +++ b/docs/regressions-dl20-doc.md @@ -10,26 +10,31 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is treated as a unit of indexing + **Expansion Condition:** none -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/dl20-doc.yaml). Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/dl20-doc.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: ``` target/appassembler/bin/IndexCollection \ - -collection CleanTrecCollection \ + -collection JsonCollection \ -input /path/to/msmacro-doc \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -generator DefaultLuceneDocumentGenerator \ - -threads 1 -storePositions -storeDocvectors -storeRaw \ + -threads 7 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmacro-doc & ``` -The directory `/path/to/msmarco-doc/` should be a directory containing the official document collection (a single file), in TREC format. +The directory `/path/to/msmarco-doc/` should be a directory containing the document corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -43,37 +48,37 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmacro-doc.bm25-default.topics.dl20.txt \ -bm25 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmacro-doc.bm25-default+rm3.topics.dl20.txt \ -bm25 -rm3 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmacro-doc.bm25-tuned.topics.dl20.txt \ -bm25 -bm25.k1 3.44 -bm25.b 0.87 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmacro-doc.bm25-tuned+rm3.topics.dl20.txt \ -bm25 -bm25.k1 3.44 -bm25.b 0.87 -rm3 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmacro-doc.bm25-tuned2.topics.dl20.txt \ -bm25 -bm25.k1 4.46 -bm25.b 0.82 -hits 100 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmacro-doc.bm25-tuned2+rm3.topics.dl20.txt \ -bm25 -bm25.k1 4.46 -bm25.b 0.82 -rm3 -hits 100 & @@ -101,22 +106,22 @@ With the above commands, you should be able to reproduce the following results: MAP | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | BM25 (tuned2)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.3791 | 0.4006 | 0.3630 | 0.3588 | 0.3583 | 0.3618 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.3793 | 0.4014 | 0.3631 | 0.3592 | 0.3581 | 0.3619 | nDCG@10 | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | BM25 (tuned2)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.5271 | 0.5248 | 0.5087 | 0.5117 | 0.5078 | 0.5202 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.5286 | 0.5225 | 0.5070 | 0.5124 | 0.5061 | 0.5238 | MRR | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | BM25 (tuned2)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.8521 | 0.8541 | 0.8641 | 0.8188 | 0.8541 | 0.8458 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.8521 | 0.8541 | 0.8641 | 0.8186 | 0.8522 | 0.8582 | R@100 | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | BM25 (tuned2)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------| -[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.6110 | 0.6392 | 0.5926 | 0.5983 | 0.5860 | 0.5998 | +[DL20 (Doc)](https://trec.nist.gov/data/deep2020.html)| 0.6110 | 0.6414 | 0.5935 | 0.5977 | 0.5860 | 0.5995 | Explanation of settings: diff --git a/docs/regressions-dl20-passage-docTTTTTquery.md b/docs/regressions-dl20-passage-docTTTTTquery.md index 7bfe7b4232..fd1771189b 100644 --- a/docs/regressions-dl20-passage-docTTTTTquery.md +++ b/docs/regressions-dl20-passage-docTTTTTquery.md @@ -17,7 +17,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonCollection \ -input /path/to/msmarco-passage-docTTTTTquery \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -generator DefaultLuceneDocumentGenerator \ -threads 9 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-passage-docTTTTTquery & @@ -38,37 +38,37 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-default.topics.dl20.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-default+rm3.topics.dl20.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-tuned.topics.dl20.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-tuned+rm3.topics.dl20.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-tuned2.topics.dl20.txt \ -bm25 -bm25.k1 2.18 -bm25.b 0.86 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-tuned2+rm3.topics.dl20.txt \ -bm25 -bm25.k1 2.18 -bm25.b 0.86 -rm3 & diff --git a/docs/regressions-dl20-passage.md b/docs/regressions-dl20-passage.md index 6e99131df4..d346ba9abe 100644 --- a/docs/regressions-dl20-passage.md +++ b/docs/regressions-dl20-passage.md @@ -16,7 +16,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonCollection \ -input /path/to/msmarco-passage \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -generator DefaultLuceneDocumentGenerator \ -threads 9 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-passage & @@ -37,49 +37,49 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-default.topics.dl20.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-default+rm3.topics.dl20.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-default+ax.topics.dl20.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-default+prf.topics.dl20.txt \ -bm25 -bm25prf & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-tuned.topics.dl20.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-tuned+rm3.topics.dl20.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-tuned+ax.topics.dl20.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-tuned+prf.topics.dl20.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -bm25prf & diff --git a/docs/regressions-dl21-doc-segmented-unicoil-noexp-0shot.md b/docs/regressions-dl21-doc-segmented-unicoil-noexp-0shot.md index ff38bbb9f4..adf55fbd98 100644 --- a/docs/regressions-dl21-doc-segmented-unicoil-noexp-0shot.md +++ b/docs/regressions-dl21-doc-segmented-unicoil-noexp-0shot.md @@ -21,7 +21,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-v2-doc-segmented-unicoil-noexp-0shot \ - -index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot \ + -index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot/ \ -generator DefaultLuceneDocumentGenerator \ -threads 18 -impact -pretokenized \ >& logs/log.msmarco-v2-doc-segmented-unicoil-noexp-0shot & @@ -41,7 +41,7 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot \ + -index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.unicoil-noexp.0shot.tsv.gz -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented-unicoil-noexp-0shot.unicoil-noexp-0shot.topics.dl21.unicoil-noexp.0shot.tsv.gz \ -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 -impact -pretokenized & diff --git a/docs/regressions-dl21-doc-segmented.md b/docs/regressions-dl21-doc-segmented.md index c1d8021aeb..ae1670692d 100644 --- a/docs/regressions-dl21-doc-segmented.md +++ b/docs/regressions-dl21-doc-segmented.md @@ -25,7 +25,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MsMarcoV2DocCollection \ -input /path/to/msmarco-v2-doc-segmented \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -generator DefaultLuceneDocumentGenerator \ -threads 18 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-v2-doc-segmented & @@ -46,25 +46,25 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented.bm25-default.topics.dl21.txt \ -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented.bm25-default+rm3.topics.dl21.txt \ -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented.bm25-default+ax.topics.dl21.txt \ -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented.bm25-default+prf.topics.dl21.txt \ -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 -bm25 -bm25prf & diff --git a/docs/regressions-dl21-doc.md b/docs/regressions-dl21-doc.md index 1fddb1bd49..33e15b71a1 100644 --- a/docs/regressions-dl21-doc.md +++ b/docs/regressions-dl21-doc.md @@ -25,7 +25,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MsMarcoV2DocCollection \ -input /path/to/msmarco-v2-doc \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -generator DefaultLuceneDocumentGenerator \ -threads 18 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-v2-doc & @@ -46,25 +46,25 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc.bm25-default.topics.dl21.txt \ -hits 1000 -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc.bm25-default+rm3.topics.dl21.txt \ -hits 1000 -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc.bm25-default+ax.topics.dl21.txt \ -hits 1000 -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc.bm25-default+prf.topics.dl21.txt \ -hits 1000 -bm25 -bm25prf & diff --git a/docs/regressions-dl21-passage-augmented.md b/docs/regressions-dl21-passage-augmented.md index baf66c709d..7f741f8b31 100644 --- a/docs/regressions-dl21-passage-augmented.md +++ b/docs/regressions-dl21-passage-augmented.md @@ -20,7 +20,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MsMarcoV2PassageCollection \ -input /path/to/msmarco-v2-passage-augmented \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -generator DefaultLuceneDocumentGenerator \ -threads 18 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-v2-passage-augmented & @@ -41,25 +41,25 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-augmented.bm25-default.topics.dl21.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-augmented.bm25-default+rm3.topics.dl21.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-augmented.bm25-default+ax.topics.dl21.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-augmented.bm25-default+prf.topics.dl21.txt \ -bm25 -bm25prf & diff --git a/docs/regressions-dl21-passage-unicoil-noexp-0shot.md b/docs/regressions-dl21-passage-unicoil-noexp-0shot.md index ece55f7d87..72d270a653 100644 --- a/docs/regressions-dl21-passage-unicoil-noexp-0shot.md +++ b/docs/regressions-dl21-passage-unicoil-noexp-0shot.md @@ -21,7 +21,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-v2-passage-unicoil-noexp-0shot \ - -index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot \ + -index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ \ -generator DefaultLuceneDocumentGenerator \ -threads 18 -impact -pretokenized \ >& logs/log.msmarco-v2-passage-unicoil-noexp-0shot & @@ -41,7 +41,7 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot \ + -index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.unicoil-noexp.0shot.tsv.gz -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-unicoil-noexp-0shot.unicoil-noexp-0shot.topics.dl21.unicoil-noexp.0shot.tsv.gz \ -impact -pretokenized & diff --git a/docs/regressions-dl21-passage.md b/docs/regressions-dl21-passage.md index fdcd1ca289..ec3bb511ac 100644 --- a/docs/regressions-dl21-passage.md +++ b/docs/regressions-dl21-passage.md @@ -20,7 +20,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MsMarcoV2PassageCollection \ -input /path/to/msmarco-v2-passage \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -generator DefaultLuceneDocumentGenerator \ -threads 18 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-v2-passage & @@ -41,25 +41,25 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage.bm25-default.topics.dl21.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage.bm25-default+rm3.topics.dl21.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage.bm25-default+ax.topics.dl21.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -topics src/main/resources/topics-and-qrels/topics.dl21.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage.bm25-default+prf.topics.dl21.txt \ -bm25 -bm25prf & diff --git a/docs/regressions-fever.md b/docs/regressions-fever.md index 0f1d341a1c..daaae55207 100644 --- a/docs/regressions-fever.md +++ b/docs/regressions-fever.md @@ -13,7 +13,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection FeverParagraphCollection \ -input /path/to/fever \ - -index indexes/lucene-index.fever-paragraph \ + -index indexes/lucene-index.fever-paragraph/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw \ >& logs/log.fever & @@ -33,13 +33,13 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.fever-paragraph \ + -index indexes/lucene-index.fever-paragraph/ \ -topics src/main/resources/topics-and-qrels/topics.fever.dev.txt -topicreader TsvInt \ -output runs/run.fever.bm25-default.topics.fever.dev.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.fever-paragraph \ + -index indexes/lucene-index.fever-paragraph/ \ -topics src/main/resources/topics-and-qrels/topics.fever.dev.txt -topicreader TsvInt \ -output runs/run.fever.bm25-tuned.topics.fever.dev.txt \ -bm25 -bm25.k1 0.9 -bm25.b 0.1 & diff --git a/docs/regressions-fire12-bn.md b/docs/regressions-fire12-bn.md index 635d26dad4..07c212fe98 100644 --- a/docs/regressions-fire12-bn.md +++ b/docs/regressions-fire12-bn.md @@ -14,7 +14,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection CleanTrecCollection \ -input /path/to/fire12-bn \ - -index indexes/lucene-index.fire12-bn \ + -index indexes/lucene-index.fire12-bn/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -storePositions -storeDocvectors -storeRaw -language bn \ >& logs/log.fire12-bn & @@ -36,7 +36,7 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.fire12-bn \ + -index indexes/lucene-index.fire12-bn/ \ -topics src/main/resources/topics-and-qrels/topics.fire12bn.176-225.txt -topicreader Trec \ -output runs/run.fire12-bn.bm25.topics.fire12bn.176-225.txt \ -bm25 -language bn & diff --git a/docs/regressions-fire12-en.md b/docs/regressions-fire12-en.md index 4224afa714..1ca9638db6 100644 --- a/docs/regressions-fire12-en.md +++ b/docs/regressions-fire12-en.md @@ -14,7 +14,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection CleanTrecCollection \ -input /path/to/fire12-en \ - -index indexes/lucene-index.fire12-en \ + -index indexes/lucene-index.fire12-en/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -storePositions -storeDocvectors -storeRaw -language en \ >& logs/log.fire12-en & @@ -36,7 +36,7 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.fire12-en \ + -index indexes/lucene-index.fire12-en/ \ -topics src/main/resources/topics-and-qrels/topics.fire12en.176-225.txt -topicreader Trec \ -output runs/run.fire12-en.bm25.topics.fire12en.176-225.txt \ -bm25 -language en & diff --git a/docs/regressions-fire12-hi.md b/docs/regressions-fire12-hi.md index a36557e2d7..8208dbc686 100644 --- a/docs/regressions-fire12-hi.md +++ b/docs/regressions-fire12-hi.md @@ -14,7 +14,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection CleanTrecCollection \ -input /path/to/fire12-hi \ - -index indexes/lucene-index.fire12-hi \ + -index indexes/lucene-index.fire12-hi/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -storePositions -storeDocvectors -storeRaw -language hi \ >& logs/log.fire12-hi & @@ -36,7 +36,7 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.fire12-hi \ + -index indexes/lucene-index.fire12-hi/ \ -topics src/main/resources/topics-and-qrels/topics.fire12hi.176-225.txt -topicreader Trec \ -output runs/run.fire12-hi.bm25.topics.fire12hi.176-225.txt \ -bm25 -language hi & diff --git a/docs/regressions-gov2.md b/docs/regressions-gov2.md index 980aa0a037..91bad5b258 100644 --- a/docs/regressions-gov2.md +++ b/docs/regressions-gov2.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection TrecwebCollection \ -input /path/to/gov2 \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -generator DefaultLuceneDocumentGenerator \ -threads 44 -storePositions -storeDocvectors -storeRaw \ >& logs/log.gov2 & @@ -37,97 +37,97 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte04.701-750.txt -topicreader Trec \ -output runs/run.gov2.bm25.topics.terabyte04.701-750.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte05.751-800.txt -topicreader Trec \ -output runs/run.gov2.bm25.topics.terabyte05.751-800.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte06.801-850.txt -topicreader Trec \ -output runs/run.gov2.bm25.topics.terabyte06.801-850.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte04.701-750.txt -topicreader Trec \ -output runs/run.gov2.bm25+rm3.topics.terabyte04.701-750.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte05.751-800.txt -topicreader Trec \ -output runs/run.gov2.bm25+rm3.topics.terabyte05.751-800.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte06.801-850.txt -topicreader Trec \ -output runs/run.gov2.bm25+rm3.topics.terabyte06.801-850.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte04.701-750.txt -topicreader Trec \ -output runs/run.gov2.bm25+ax.topics.terabyte04.701-750.txt \ -bm25 -axiom -axiom.beta 0.1 -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte05.751-800.txt -topicreader Trec \ -output runs/run.gov2.bm25+ax.topics.terabyte05.751-800.txt \ -bm25 -axiom -axiom.beta 0.1 -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte06.801-850.txt -topicreader Trec \ -output runs/run.gov2.bm25+ax.topics.terabyte06.801-850.txt \ -bm25 -axiom -axiom.beta 0.1 -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte04.701-750.txt -topicreader Trec \ -output runs/run.gov2.ql.topics.terabyte04.701-750.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte05.751-800.txt -topicreader Trec \ -output runs/run.gov2.ql.topics.terabyte05.751-800.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte06.801-850.txt -topicreader Trec \ -output runs/run.gov2.ql.topics.terabyte06.801-850.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte04.701-750.txt -topicreader Trec \ -output runs/run.gov2.ql+rm3.topics.terabyte04.701-750.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte05.751-800.txt -topicreader Trec \ -output runs/run.gov2.ql+rm3.topics.terabyte05.751-800.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte06.801-850.txt -topicreader Trec \ -output runs/run.gov2.ql+rm3.topics.terabyte06.801-850.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte04.701-750.txt -topicreader Trec \ -output runs/run.gov2.ql+ax.topics.terabyte04.701-750.txt \ -qld -axiom -axiom.beta 0.1 -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte05.751-800.txt -topicreader Trec \ -output runs/run.gov2.ql+ax.topics.terabyte05.751-800.txt \ -qld -axiom -axiom.beta 0.1 -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.gov2 \ + -index indexes/lucene-index.gov2/ \ -topics src/main/resources/topics-and-qrels/topics.terabyte06.801-850.txt -topicreader Trec \ -output runs/run.gov2.ql+ax.topics.terabyte06.801-850.txt \ -qld -axiom -axiom.beta 0.1 -axiom.deterministic -rerankCutoff 20 & diff --git a/docs/regressions-mb11.md b/docs/regressions-mb11.md index 812c8b23d6..41b5ece3ab 100644 --- a/docs/regressions-mb11.md +++ b/docs/regressions-mb11.md @@ -15,7 +15,7 @@ Indexing the Tweets2011 collection: target/appassembler/bin/IndexCollection \ -collection TweetCollection \ -input /path/to/mb11 \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -generator TweetGenerator \ -threads 44 -storePositions -storeDocvectors -storeRaw -uniqueDocid -tweet.keepUrls -tweet.stemming \ >& logs/log.mb11 & @@ -43,67 +43,67 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2011.txt -topicreader Microblog \ -output runs/run.mb11.bm25.topics.microblog2011.txt \ -searchtweets -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2012.txt -topicreader Microblog \ -output runs/run.mb11.bm25.topics.microblog2012.txt \ -searchtweets -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2011.txt -topicreader Microblog \ -output runs/run.mb11.bm25+rm3.topics.microblog2011.txt \ -searchtweets -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2012.txt -topicreader Microblog \ -output runs/run.mb11.bm25+rm3.topics.microblog2012.txt \ -searchtweets -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2011.txt -topicreader Microblog \ -output runs/run.mb11.bm25+ax.topics.microblog2011.txt \ -searchtweets -bm25 -axiom -axiom.beta 1.0 -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2012.txt -topicreader Microblog \ -output runs/run.mb11.bm25+ax.topics.microblog2012.txt \ -searchtweets -bm25 -axiom -axiom.beta 1.0 -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2011.txt -topicreader Microblog \ -output runs/run.mb11.ql.topics.microblog2011.txt \ -searchtweets -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2012.txt -topicreader Microblog \ -output runs/run.mb11.ql.topics.microblog2012.txt \ -searchtweets -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2011.txt -topicreader Microblog \ -output runs/run.mb11.ql+rm3.topics.microblog2011.txt \ -searchtweets -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2012.txt -topicreader Microblog \ -output runs/run.mb11.ql+rm3.topics.microblog2012.txt \ -searchtweets -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2011.txt -topicreader Microblog \ -output runs/run.mb11.ql+ax.topics.microblog2011.txt \ -searchtweets -qld -axiom -axiom.beta 1.0 -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb11 \ + -index indexes/lucene-index.mb11/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2012.txt -topicreader Microblog \ -output runs/run.mb11.ql+ax.topics.microblog2012.txt \ -searchtweets -qld -axiom -axiom.beta 1.0 -axiom.deterministic -rerankCutoff 20 & diff --git a/docs/regressions-mb13.md b/docs/regressions-mb13.md index b3b5692d2a..0fff7a1fcb 100644 --- a/docs/regressions-mb13.md +++ b/docs/regressions-mb13.md @@ -15,7 +15,7 @@ Indexing the Tweets2013 collection: target/appassembler/bin/IndexCollection \ -collection TweetCollection \ -input /path/to/mb13 \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -generator TweetGenerator \ -threads 44 -storePositions -storeDocvectors -storeRaw -uniqueDocid -optimize -tweet.keepUrls -tweet.stemming \ >& logs/log.mb13 & @@ -43,67 +43,67 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2013.txt -topicreader Microblog \ -output runs/run.mb13.bm25.topics.microblog2013.txt \ -searchtweets -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2014.txt -topicreader Microblog \ -output runs/run.mb13.bm25.topics.microblog2014.txt \ -searchtweets -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2013.txt -topicreader Microblog \ -output runs/run.mb13.bm25+rm3.topics.microblog2013.txt \ -searchtweets -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2014.txt -topicreader Microblog \ -output runs/run.mb13.bm25+rm3.topics.microblog2014.txt \ -searchtweets -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2013.txt -topicreader Microblog \ -output runs/run.mb13.bm25+ax.topics.microblog2013.txt \ -searchtweets -bm25 -axiom -axiom.beta 1.0 -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2014.txt -topicreader Microblog \ -output runs/run.mb13.bm25+ax.topics.microblog2014.txt \ -searchtweets -bm25 -axiom -axiom.beta 1.0 -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2013.txt -topicreader Microblog \ -output runs/run.mb13.ql.topics.microblog2013.txt \ -searchtweets -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2014.txt -topicreader Microblog \ -output runs/run.mb13.ql.topics.microblog2014.txt \ -searchtweets -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2013.txt -topicreader Microblog \ -output runs/run.mb13.ql+rm3.topics.microblog2013.txt \ -searchtweets -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2014.txt -topicreader Microblog \ -output runs/run.mb13.ql+rm3.topics.microblog2014.txt \ -searchtweets -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2013.txt -topicreader Microblog \ -output runs/run.mb13.ql+ax.topics.microblog2013.txt \ -searchtweets -qld -axiom -axiom.beta 1.0 -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mb13 \ + -index indexes/lucene-index.mb13/ \ -topics src/main/resources/topics-and-qrels/topics.microblog2014.txt -topicreader Microblog \ -output runs/run.mb13.ql+ax.topics.microblog2014.txt \ -searchtweets -qld -axiom -axiom.beta 1.0 -axiom.deterministic -rerankCutoff 20 & diff --git a/docs/regressions-mrtydi-v1.1-ar.md b/docs/regressions-mrtydi-v1.1-ar.md index 8c85707539..e9a98b1db3 100644 --- a/docs/regressions-mrtydi-v1.1-ar.md +++ b/docs/regressions-mrtydi-v1.1-ar.md @@ -13,7 +13,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MrTyDiCollection \ -input /path/to/mrtydi-v1.1-ar \ - -index indexes/lucene-index.mrtydi-v1.1-arabic \ + -index indexes/lucene-index.mrtydi-v1.1-arabic/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw -language ar \ >& logs/log.mrtydi-v1.1-ar & @@ -28,17 +28,17 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-arabic \ + -index indexes/lucene-index.mrtydi-v1.1-arabic/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ar.train.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-ar.bm25.topics.mrtydi-v1.1-ar.train.txt.gz \ -bm25 -hits 100 -language ar & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-arabic \ + -index indexes/lucene-index.mrtydi-v1.1-arabic/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ar.dev.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-ar.bm25.topics.mrtydi-v1.1-ar.dev.txt.gz \ -bm25 -hits 100 -language ar & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-arabic \ + -index indexes/lucene-index.mrtydi-v1.1-arabic/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ar.test.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-ar.bm25.topics.mrtydi-v1.1-ar.test.txt.gz \ -bm25 -hits 100 -language ar & diff --git a/docs/regressions-mrtydi-v1.1-bn.md b/docs/regressions-mrtydi-v1.1-bn.md index 45354cf198..f27a1dcadd 100644 --- a/docs/regressions-mrtydi-v1.1-bn.md +++ b/docs/regressions-mrtydi-v1.1-bn.md @@ -13,7 +13,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MrTyDiCollection \ -input /path/to/mrtydi-v1.1-bn \ - -index indexes/lucene-index.mrtydi-v1.1-bengali \ + -index indexes/lucene-index.mrtydi-v1.1-bengali/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw -language bn \ >& logs/log.mrtydi-v1.1-bn & @@ -28,17 +28,17 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-bengali \ + -index indexes/lucene-index.mrtydi-v1.1-bengali/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-bn.train.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-bn.bm25.topics.mrtydi-v1.1-bn.train.txt.gz \ -bm25 -hits 100 -language bn & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-bengali \ + -index indexes/lucene-index.mrtydi-v1.1-bengali/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-bn.dev.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-bn.bm25.topics.mrtydi-v1.1-bn.dev.txt.gz \ -bm25 -hits 100 -language bn & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-bengali \ + -index indexes/lucene-index.mrtydi-v1.1-bengali/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-bn.test.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-bn.bm25.topics.mrtydi-v1.1-bn.test.txt.gz \ -bm25 -hits 100 -language bn & diff --git a/docs/regressions-mrtydi-v1.1-en.md b/docs/regressions-mrtydi-v1.1-en.md index 0cf3e41f32..65dfc491c9 100644 --- a/docs/regressions-mrtydi-v1.1-en.md +++ b/docs/regressions-mrtydi-v1.1-en.md @@ -13,7 +13,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MrTyDiCollection \ -input /path/to/mrtydi-v1.1-en \ - -index indexes/lucene-index.mrtydi-v1.1-english \ + -index indexes/lucene-index.mrtydi-v1.1-english/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw -language en \ >& logs/log.mrtydi-v1.1-en & @@ -28,17 +28,17 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-english \ + -index indexes/lucene-index.mrtydi-v1.1-english/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-en.train.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-en.bm25.topics.mrtydi-v1.1-en.train.txt.gz \ -bm25 -hits 100 -language en & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-english \ + -index indexes/lucene-index.mrtydi-v1.1-english/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-en.dev.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-en.bm25.topics.mrtydi-v1.1-en.dev.txt.gz \ -bm25 -hits 100 -language en & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-english \ + -index indexes/lucene-index.mrtydi-v1.1-english/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-en.test.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-en.bm25.topics.mrtydi-v1.1-en.test.txt.gz \ -bm25 -hits 100 -language en & diff --git a/docs/regressions-mrtydi-v1.1-fi.md b/docs/regressions-mrtydi-v1.1-fi.md index df7c366795..c6a666b478 100644 --- a/docs/regressions-mrtydi-v1.1-fi.md +++ b/docs/regressions-mrtydi-v1.1-fi.md @@ -13,7 +13,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MrTyDiCollection \ -input /path/to/mrtydi-v1.1-fi \ - -index indexes/lucene-index.mrtydi-v1.1-finnish \ + -index indexes/lucene-index.mrtydi-v1.1-finnish/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw -language fi \ >& logs/log.mrtydi-v1.1-fi & @@ -28,17 +28,17 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-finnish \ + -index indexes/lucene-index.mrtydi-v1.1-finnish/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-fi.train.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-fi.bm25.topics.mrtydi-v1.1-fi.train.txt.gz \ -bm25 -hits 100 -language fi & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-finnish \ + -index indexes/lucene-index.mrtydi-v1.1-finnish/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-fi.dev.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-fi.bm25.topics.mrtydi-v1.1-fi.dev.txt.gz \ -bm25 -hits 100 -language fi & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-finnish \ + -index indexes/lucene-index.mrtydi-v1.1-finnish/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-fi.test.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-fi.bm25.topics.mrtydi-v1.1-fi.test.txt.gz \ -bm25 -hits 100 -language fi & diff --git a/docs/regressions-mrtydi-v1.1-id.md b/docs/regressions-mrtydi-v1.1-id.md index 5f4a73881b..fe9ad21dca 100644 --- a/docs/regressions-mrtydi-v1.1-id.md +++ b/docs/regressions-mrtydi-v1.1-id.md @@ -13,7 +13,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MrTyDiCollection \ -input /path/to/mrtydi-v1.1-id \ - -index indexes/lucene-index.mrtydi-v1.1-indonesian \ + -index indexes/lucene-index.mrtydi-v1.1-indonesian/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw -language id \ >& logs/log.mrtydi-v1.1-id & @@ -28,17 +28,17 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-indonesian \ + -index indexes/lucene-index.mrtydi-v1.1-indonesian/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-id.train.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-id.bm25.topics.mrtydi-v1.1-id.train.txt.gz \ -bm25 -hits 100 -language id & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-indonesian \ + -index indexes/lucene-index.mrtydi-v1.1-indonesian/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-id.dev.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-id.bm25.topics.mrtydi-v1.1-id.dev.txt.gz \ -bm25 -hits 100 -language id & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-indonesian \ + -index indexes/lucene-index.mrtydi-v1.1-indonesian/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-id.test.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-id.bm25.topics.mrtydi-v1.1-id.test.txt.gz \ -bm25 -hits 100 -language id & diff --git a/docs/regressions-mrtydi-v1.1-ja.md b/docs/regressions-mrtydi-v1.1-ja.md index 2fc54626df..74a88188f6 100644 --- a/docs/regressions-mrtydi-v1.1-ja.md +++ b/docs/regressions-mrtydi-v1.1-ja.md @@ -13,7 +13,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MrTyDiCollection \ -input /path/to/mrtydi-v1.1-ja \ - -index indexes/lucene-index.mrtydi-v1.1-japanese \ + -index indexes/lucene-index.mrtydi-v1.1-japanese/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw -language ja \ >& logs/log.mrtydi-v1.1-ja & @@ -28,17 +28,17 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-japanese \ + -index indexes/lucene-index.mrtydi-v1.1-japanese/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ja.train.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-ja.bm25.topics.mrtydi-v1.1-ja.train.txt.gz \ -bm25 -hits 100 -language ja & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-japanese \ + -index indexes/lucene-index.mrtydi-v1.1-japanese/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ja.dev.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-ja.bm25.topics.mrtydi-v1.1-ja.dev.txt.gz \ -bm25 -hits 100 -language ja & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-japanese \ + -index indexes/lucene-index.mrtydi-v1.1-japanese/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ja.test.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-ja.bm25.topics.mrtydi-v1.1-ja.test.txt.gz \ -bm25 -hits 100 -language ja & diff --git a/docs/regressions-mrtydi-v1.1-ko.md b/docs/regressions-mrtydi-v1.1-ko.md index 086b9f7013..e751484625 100644 --- a/docs/regressions-mrtydi-v1.1-ko.md +++ b/docs/regressions-mrtydi-v1.1-ko.md @@ -13,7 +13,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MrTyDiCollection \ -input /path/to/mrtydi-v1.1-ko \ - -index indexes/lucene-index.mrtydi-v1.1-korean \ + -index indexes/lucene-index.mrtydi-v1.1-korean/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw -language ko \ >& logs/log.mrtydi-v1.1-ko & @@ -28,17 +28,17 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-korean \ + -index indexes/lucene-index.mrtydi-v1.1-korean/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ko.train.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-ko.bm25.topics.mrtydi-v1.1-ko.train.txt.gz \ -bm25 -hits 100 -language ko & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-korean \ + -index indexes/lucene-index.mrtydi-v1.1-korean/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ko.dev.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-ko.bm25.topics.mrtydi-v1.1-ko.dev.txt.gz \ -bm25 -hits 100 -language ko & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-korean \ + -index indexes/lucene-index.mrtydi-v1.1-korean/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ko.test.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-ko.bm25.topics.mrtydi-v1.1-ko.test.txt.gz \ -bm25 -hits 100 -language ko & diff --git a/docs/regressions-mrtydi-v1.1-ru.md b/docs/regressions-mrtydi-v1.1-ru.md index 3a3fa9ba64..674d6955d9 100644 --- a/docs/regressions-mrtydi-v1.1-ru.md +++ b/docs/regressions-mrtydi-v1.1-ru.md @@ -13,7 +13,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MrTyDiCollection \ -input /path/to/mrtydi-v1.1-ru \ - -index indexes/lucene-index.mrtydi-v1.1-russian \ + -index indexes/lucene-index.mrtydi-v1.1-russian/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw -language ru \ >& logs/log.mrtydi-v1.1-ru & @@ -28,17 +28,17 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-russian \ + -index indexes/lucene-index.mrtydi-v1.1-russian/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ru.train.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-ru.bm25.topics.mrtydi-v1.1-ru.train.txt.gz \ -bm25 -hits 100 -language ru & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-russian \ + -index indexes/lucene-index.mrtydi-v1.1-russian/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ru.dev.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-ru.bm25.topics.mrtydi-v1.1-ru.dev.txt.gz \ -bm25 -hits 100 -language ru & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-russian \ + -index indexes/lucene-index.mrtydi-v1.1-russian/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ru.test.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-ru.bm25.topics.mrtydi-v1.1-ru.test.txt.gz \ -bm25 -hits 100 -language ru & diff --git a/docs/regressions-mrtydi-v1.1-sw.md b/docs/regressions-mrtydi-v1.1-sw.md index a32c9a331d..7c11768aff 100644 --- a/docs/regressions-mrtydi-v1.1-sw.md +++ b/docs/regressions-mrtydi-v1.1-sw.md @@ -13,7 +13,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MrTyDiCollection \ -input /path/to/mrtydi-v1.1-sw \ - -index indexes/lucene-index.mrtydi-v1.1-swahili \ + -index indexes/lucene-index.mrtydi-v1.1-swahili/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw -pretokenized \ >& logs/log.mrtydi-v1.1-sw & @@ -28,17 +28,17 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-swahili \ + -index indexes/lucene-index.mrtydi-v1.1-swahili/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-sw.train.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-sw.bm25.topics.mrtydi-v1.1-sw.train.txt.gz \ -bm25 -hits 100 -pretokenized & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-swahili \ + -index indexes/lucene-index.mrtydi-v1.1-swahili/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-sw.dev.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-sw.bm25.topics.mrtydi-v1.1-sw.dev.txt.gz \ -bm25 -hits 100 -pretokenized & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-swahili \ + -index indexes/lucene-index.mrtydi-v1.1-swahili/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-sw.test.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-sw.bm25.topics.mrtydi-v1.1-sw.test.txt.gz \ -bm25 -hits 100 -pretokenized & diff --git a/docs/regressions-mrtydi-v1.1-te.md b/docs/regressions-mrtydi-v1.1-te.md index 4cce65e874..bfce2b842b 100644 --- a/docs/regressions-mrtydi-v1.1-te.md +++ b/docs/regressions-mrtydi-v1.1-te.md @@ -13,7 +13,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MrTyDiCollection \ -input /path/to/mrtydi-v1.1-te \ - -index indexes/lucene-index.mrtydi-v1.1-telugu \ + -index indexes/lucene-index.mrtydi-v1.1-telugu/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw -pretokenized \ >& logs/log.mrtydi-v1.1-te & @@ -28,17 +28,17 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-telugu \ + -index indexes/lucene-index.mrtydi-v1.1-telugu/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-te.train.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-te.bm25.topics.mrtydi-v1.1-te.train.txt.gz \ -bm25 -hits 100 -pretokenized & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-telugu \ + -index indexes/lucene-index.mrtydi-v1.1-telugu/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-te.dev.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-te.bm25.topics.mrtydi-v1.1-te.dev.txt.gz \ -bm25 -hits 100 -pretokenized & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-telugu \ + -index indexes/lucene-index.mrtydi-v1.1-telugu/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-te.test.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-te.bm25.topics.mrtydi-v1.1-te.test.txt.gz \ -bm25 -hits 100 -pretokenized & diff --git a/docs/regressions-mrtydi-v1.1-th.md b/docs/regressions-mrtydi-v1.1-th.md index f37c2d0a69..a9b32dc761 100644 --- a/docs/regressions-mrtydi-v1.1-th.md +++ b/docs/regressions-mrtydi-v1.1-th.md @@ -13,7 +13,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MrTyDiCollection \ -input /path/to/mrtydi-v1.1-th \ - -index indexes/lucene-index.mrtydi-v1.1-thai \ + -index indexes/lucene-index.mrtydi-v1.1-thai/ \ -generator DefaultLuceneDocumentGenerator \ -threads 1 -storePositions -storeDocvectors -storeRaw -language th \ >& logs/log.mrtydi-v1.1-th & @@ -28,17 +28,17 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-thai \ + -index indexes/lucene-index.mrtydi-v1.1-thai/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-th.train.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-th.bm25.topics.mrtydi-v1.1-th.train.txt.gz \ -bm25 -hits 100 -language th & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-thai \ + -index indexes/lucene-index.mrtydi-v1.1-thai/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-th.dev.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-th.bm25.topics.mrtydi-v1.1-th.dev.txt.gz \ -bm25 -hits 100 -language th & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.mrtydi-v1.1-thai \ + -index indexes/lucene-index.mrtydi-v1.1-thai/ \ -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-th.test.txt.gz -topicreader TsvInt \ -output runs/run.mrtydi-v1.1-th.bm25.topics.mrtydi-v1.1-th.test.txt.gz \ -bm25 -hits 100 -language th & diff --git a/docs/regressions-msmarco-doc-docTTTTTquery-per-passage-v3.md b/docs/regressions-msmarco-doc-docTTTTTquery-per-passage-v3.md deleted file mode 100644 index 02db374d6e..0000000000 --- a/docs/regressions-msmarco-doc-docTTTTTquery-per-passage-v3.md +++ /dev/null @@ -1,136 +0,0 @@ -# Anserini: Regressions for MS MARCO Document Ranking - -This page documents regression experiments for the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking), which is integrated into Anserini's regression testing framework. -Note that there are four different regression conditions for this task, and this page describes the following: - -+ **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing -+ **Expansion Condition:** doc2query-T5 - -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. - -**NOTE**: This is the `msmarco-doc-docTTTTTquery-per-passage-v3` variant (there's also `msmarco-doc-docTTTTTquery-per-passage`), see [this page](experiments-msmarco-doc-doc2query-details.md) for detailed notes about differences between these variants. - -The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-docTTTTTquery-per-passage-v3.yaml). -Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-passage-v3.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. - -## Indexing - -Typical indexing command: - -``` -target/appassembler/bin/IndexCollection \ - -collection JsonCollection \ - -input /path/to/msmarco-doc-docTTTTTquery-per-passage-v3 \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage-v3 \ - -generator DefaultLuceneDocumentGenerator \ - -threads 16 -storePositions -storeDocvectors -storeRaw \ - >& logs/log.msmarco-doc-docTTTTTquery-per-passage-v3 & -``` - -The directory `/path/to/msmarco-doc-docTTTTTquery-per-passage/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. - -For additional details, see explanation of [common indexing options](common-indexing-options.md). - -## Retrieval - -Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/). -The regression experiments here evaluate on the 5193 dev set questions. - -After indexing has completed, you should be able to perform retrieval as follows: - -``` -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage-v3 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage-v3.bm25-default.topics.msmarco-doc.dev.txt \ - -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage-v3 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage-v3.bm25-tuned.topics.msmarco-doc.dev.txt \ - -bm25 -bm25.k1 2.56 -bm25.b 0.59 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & -``` - -Evaluation can be performed using `trec_eval`: - -``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-docTTTTTquery-per-passage-v3.bm25-default.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-docTTTTTquery-per-passage-v3.bm25-tuned.topics.msmarco-doc.dev.txt -``` - -## Effectiveness - -With the above commands, you should be able to reproduce the following results: - -MAP | BM25 (default)| BM25 (tuned)| -:---------------------------------------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.3184 | 0.3213 | - - -R@100 | BM25 (default)| BM25 (tuned)| -:---------------------------------------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.8479 | 0.8627 | - - -R@1000 | BM25 (default)| BM25 (tuned)| -:---------------------------------------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.9490 | 0.9530 | - -Explanation of settings: - -+ The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`. -+ The setting "tuned" refers to `k1=2.56`, `b=0.59`, tuned to optimize for recall@100 (i.e., for first-stage retrieval) on 2019/12. - -In these runs, we are retrieving the top 1000 hits for each query and using `trec_eval` to evaluate all 1000 hits. -Since we're in the passage condition, we fetch the 10000 passages and select the top 1000 documents using MaxP. -This lets us measure R@100 and R@1000; the latter is particularly important when these runs are used as first-stage retrieval. -Beware, an official MS MARCO document ranking task leaderboard submission comprises only 100 hits per query. -See [this page](experiments-msmarco-doc-leaderboard.md) for details on Anserini baseline runs that were submitted to the official leaderboard. - -The MaxP passage retrieval functionality is available in `SearchCollection`. -To generate an MS MARCO submission with the BM25 default parameters, corresponding to "BM25 (default)" above: - -```bash -$ target/appassembler/bin/SearchCollection -topicreader TsvString \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage-v3.pos+docvectors+raw \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage-v3.bm25-default.txt -format msmarco \ - -bm25 -bm25.k1 0.9 -bm25.b 0.4 -hits 1000 \ - -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 - -$ python tools/scripts/msmarco/msmarco_doc_eval.py \ - --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc-docTTTTTquery-per-passage-v3.bm25-default.txt - -##################### -MRR @100: 0.317905445196054 -QueriesRanked: 5193 -##################### -``` - -Note that the above command uses `-format msmarco` to directly generate a run in the MS MARCO output format. - -To generate an MS MARCO submission with the BM25 tuned parameters, corresponding to "BM25 (tuned)" above: - -```bash -$ target/appassembler/bin/SearchCollection -topicreader TsvString \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage-v3.pos+docvectors+raw \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage-v3.bm25-tuned.txt -format msmarco \ - -bm25 -bm25.k1 2.56 -bm25.b 0.59 -hits 1000 \ - -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 - -$ python tools/scripts/msmarco/msmarco_doc_eval.py \ - --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc-docTTTTTquery-per-passage-v3.bm25-tuned.txt - -##################### -MRR @100: 0.3209184381409182 -QueriesRanked: 5193 -##################### -``` - -Again, note that the above command uses `-format msmarco` to directly generate a run in the MS MARCO output format. \ No newline at end of file diff --git a/docs/regressions-msmarco-doc-docTTTTTquery-per-doc.md b/docs/regressions-msmarco-doc-docTTTTTquery.md similarity index 76% rename from docs/regressions-msmarco-doc-docTTTTTquery-per-doc.md rename to docs/regressions-msmarco-doc-docTTTTTquery.md index 102ce83759..029149c8b8 100644 --- a/docs/regressions-msmarco-doc-docTTTTTquery-per-doc.md +++ b/docs/regressions-msmarco-doc-docTTTTTquery.md @@ -6,10 +6,14 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is treated as a unit of indexing + **Expansion Condition:** doc2query-T5 -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. -The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-docTTTTTquery-per-doc.yaml). -Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-doc.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-docTTTTTquery.yaml). +Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. + +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. ## Indexing @@ -18,14 +22,15 @@ Typical indexing command: ``` target/appassembler/bin/IndexCollection \ -collection JsonCollection \ - -input /path/to/msmarco-doc-docTTTTTquery-per-doc \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -input /path/to/msmarco-doc-docTTTTTquery \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -generator DefaultLuceneDocumentGenerator \ - -threads 1 -storePositions -storeDocvectors -storeRaw \ - >& logs/log.msmarco-doc-docTTTTTquery-per-doc & + -threads 7 -storePositions -storeDocvectors -storeRaw \ + >& logs/log.msmarco-doc-docTTTTTquery & ``` -The directory `/path/to/msmarco-doc-docTTTTTquery-per-doc/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-docTTTTTquery/` should be a directory containing the expanded document corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -38,24 +43,24 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-default.topics.msmarco-doc.dev.txt \ + -output runs/run.msmarco-doc-docTTTTTquery.bm25-default.topics.msmarco-doc.dev.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc \ + -index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-tuned.topics.msmarco-doc.dev.txt \ + -output runs/run.msmarco-doc-docTTTTTquery.bm25-tuned.topics.msmarco-doc.dev.txt \ -bm25 -bm25.k1 4.68 -bm25.b 0.87 & ``` Evaluation can be performed using `trec_eval`: ``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-default.topics.msmarco-doc.dev.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-docTTTTTquery.bm25-default.topics.msmarco-doc.dev.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-tuned.topics.msmarco-doc.dev.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-docTTTTTquery.bm25-tuned.topics.msmarco-doc.dev.txt ``` ## Effectiveness @@ -64,12 +69,12 @@ With the above commands, you should be able to reproduce the following results: MAP | BM25 (default)| BM25 (tuned)| :---------------------------------------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.2886 | 0.3270 | +[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.2886 | 0.3273 | R@100 | BM25 (default)| BM25 (tuned)| :---------------------------------------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.7990 | 0.8608 | +[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.7993 | 0.8612 | R@1000 | BM25 (default)| BM25 (tuned)| diff --git a/docs/regressions-msmarco-doc-per-passage-v2.md b/docs/regressions-msmarco-doc-per-passage-v2.md deleted file mode 100644 index a798f521cd..0000000000 --- a/docs/regressions-msmarco-doc-per-passage-v2.md +++ /dev/null @@ -1,184 +0,0 @@ -# Anserini: Regressions for MS MARCO Document Ranking - -This page documents regression experiments for the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking), which is integrated into Anserini's regression testing framework. -Note that there are four different regression conditions for this task, and this page describes the following: - -+ **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing -+ **Expansion Condition:** none - -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. - -**NOTE**: This is the `msmarco-doc-per-passage-v2` variant (there's also `msmarco-doc-per-passage` and `msmarco-doc-per-passage-v3`), see [this page](experiments-msmarco-doc-doc2query-details.md) for detailed notes about differences between these variants. - -The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-per-passage-v2.yaml). -Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-per-passage-v2.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. - -## Indexing - -Typical indexing command: - -``` -target/appassembler/bin/IndexCollection \ - -collection JsonCollection \ - -input /path/to/msmarco-doc-per-passage-v2 \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2 \ - -generator DefaultLuceneDocumentGenerator \ - -threads 16 -storePositions -storeDocvectors -storeRaw \ - >& logs/log.msmarco-doc-per-passage-v2 & -``` - -The directory `/path/to/msmarco-doc-per-passage/` should be a directory containing the segmented paragraph collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. - -For additional details, see explanation of [common indexing options](common-indexing-options.md). - -## Retrieval - -Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/). -The regression experiments here evaluate on the 5193 dev set questions. - -After indexing has completed, you should be able to perform retrieval as follows: - -``` -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v2.bm25-default.topics.msmarco-doc.dev.txt \ - -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v2.bm25-default+rm3.topics.msmarco-doc.dev.txt \ - -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v2.bm25-default+ax.topics.msmarco-doc.dev.txt \ - -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v2.bm25-default+prf.topics.msmarco-doc.dev.txt \ - -bm25 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v2.bm25-tuned.topics.msmarco-doc.dev.txt \ - -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v2.bm25-tuned+rm3.topics.msmarco-doc.dev.txt \ - -bm25 -bm25.k1 2.16 -bm25.b 0.61 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v2.bm25-tuned+ax.topics.msmarco-doc.dev.txt \ - -bm25 -bm25.k1 2.16 -bm25.b 0.61 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v2.bm25-tuned+prf.topics.msmarco-doc.dev.txt \ - -bm25 -bm25.k1 2.16 -bm25.b 0.61 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & -``` - -Evaluation can be performed using `trec_eval`: - -``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v2.bm25-default.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v2.bm25-default+rm3.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v2.bm25-default+ax.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v2.bm25-default+prf.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v2.bm25-tuned.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v2.bm25-tuned+rm3.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v2.bm25-tuned+ax.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v2.bm25-tuned+prf.topics.msmarco-doc.dev.txt -``` - -## Effectiveness - -With the above commands, you should be able to reproduce the following results: - -MAP | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | -:---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.2609 | 0.2324 | 0.2170 | 0.2189 | 0.2639 | 0.2342 | 0.2250 | 0.2184 | - - -R@100 | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | -:---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.7737 | 0.7768 | 0.7578 | 0.7570 | 0.7884 | 0.7793 | 0.7730 | 0.7520 | - - -R@1000 | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | -:---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.9095 | 0.9266 | 0.9207 | 0.9135 | 0.9222 | 0.9239 | 0.9268 | 0.9101 | - -Explanation of settings: - -+ The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`. -+ The setting "tuned" refers to `k1=2.16`, `b=0.61`, tuned to optimize for recall@100 (i.e., for first-stage retrieval) on 2019/12. - -In these runs, we are retrieving the top 1000 hits for each query and using `trec_eval` to evaluate all 1000 hits. -Since we're in the passage condition, we fetch the 10000 passages and select the top 1000 documents using MaxP. -This lets us measure R@100 and R@1000; the latter is particularly important when these runs are used as first-stage retrieval. -Beware, an official MS MARCO document ranking task leaderboard submission comprises only 100 hits per query. -See [this page](experiments-msmarco-doc-leaderboard.md) for details on Anserini baseline runs that were submitted to the official leaderboard. - -The MaxP passage retrieval functionality is available in `SearchCollection`. -To generate an MS MARCO submission with the BM25 default parameters, corresponding to "BM25 (default)" above: - -```bash -$ target/appassembler/bin/SearchCollection -topicreader TsvString \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2.pos+docvectors+raw \ - -output runs/run.msmarco-doc-per-passage-v2.bm25-default.txt -format msmarco \ - -bm25 -bm25.k1 0.9 -bm25.b 0.4 -hits 1000 \ - -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 - -$ python tools/scripts/msmarco/msmarco_doc_eval.py \ - --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc-per-passage-v2.bm25-default.txt - -##################### -MRR @100: 0.26029445206377066 -QueriesRanked: 5193 -##################### -``` - -Note that the above command uses `-format msmarco` to directly generate a run in the MS MARCO output format. - -To generate an MS MARCO submission with the BM25 tuned parameters, corresponding to "BM25 (tuned)" above: - -```bash -$ target/appassembler/bin/SearchCollection -topicreader TsvString \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2.pos+docvectors+raw \ - -output runs/run.msmarco-doc-per-passage-v2.bm25-tuned.txt -format msmarco \ - -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 1000 \ - -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 - -$ python tools/scripts/msmarco/msmarco_doc_eval.py \ - --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc-per-passage-v2.bm25-tuned.txt - -##################### -MRR @100: 0.2633426142578288 -QueriesRanked: 5193 -##################### -``` - -Again, note that the above command uses `-format msmarco` to directly generate a run in the MS MARCO output format. diff --git a/docs/regressions-msmarco-doc-per-passage-v3.md b/docs/regressions-msmarco-doc-per-passage-v3.md deleted file mode 100644 index a4c8ec5259..0000000000 --- a/docs/regressions-msmarco-doc-per-passage-v3.md +++ /dev/null @@ -1,184 +0,0 @@ -# Anserini: Regressions for MS MARCO Document Ranking - -This page documents regression experiments for the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking), which is integrated into Anserini's regression testing framework. -Note that there are four different regression conditions for this task, and this page describes the following: - -+ **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing -+ **Expansion Condition:** none - -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. - -**NOTE**: This is the `msmarco-doc-per-passage-v3` variant (there's also `msmarco-doc-per-passage` and `msmarco-doc-per-passage-v2`), see [this page](experiments-msmarco-doc-doc2query-details.md) for detailed notes about differences between these variants. - -The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-per-passage-v3.yaml). -Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-per-passage-v3.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. - -## Indexing - -Typical indexing command: - -``` -target/appassembler/bin/IndexCollection \ - -collection JsonCollection \ - -input /path/to/msmarco-doc-per-passage-v3 \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3 \ - -generator DefaultLuceneDocumentGenerator \ - -threads 16 -storePositions -storeDocvectors -storeRaw \ - >& logs/log.msmarco-doc-per-passage-v3 & -``` - -The directory `/path/to/msmarco-doc-per-passage/` should be a directory containing the segmented paragraph collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. - -For additional details, see explanation of [common indexing options](common-indexing-options.md). - -## Retrieval - -Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/). -The regression experiments here evaluate on the 5193 dev set questions. - -After indexing has completed, you should be able to perform retrieval as follows: - -``` -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v3.bm25-default.topics.msmarco-doc.dev.txt \ - -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v3.bm25-default+rm3.topics.msmarco-doc.dev.txt \ - -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v3.bm25-default+ax.topics.msmarco-doc.dev.txt \ - -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v3.bm25-default+prf.topics.msmarco-doc.dev.txt \ - -bm25 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v3.bm25-tuned.topics.msmarco-doc.dev.txt \ - -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v3.bm25-tuned+rm3.topics.msmarco-doc.dev.txt \ - -bm25 -bm25.k1 2.16 -bm25.b 0.61 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v3.bm25-tuned+ax.topics.msmarco-doc.dev.txt \ - -bm25 -bm25.k1 2.16 -bm25.b 0.61 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & - -target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3 \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage-v3.bm25-tuned+prf.topics.msmarco-doc.dev.txt \ - -bm25 -bm25.k1 2.16 -bm25.b 0.61 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & -``` - -Evaluation can be performed using `trec_eval`: - -``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v3.bm25-default.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v3.bm25-default+rm3.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v3.bm25-default+ax.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v3.bm25-default+prf.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v3.bm25-tuned.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v3.bm25-tuned+rm3.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v3.bm25-tuned+ax.topics.msmarco-doc.dev.txt - -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage-v3.bm25-tuned+prf.topics.msmarco-doc.dev.txt -``` - -## Effectiveness - -With the above commands, you should be able to reproduce the following results: - -MAP | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | -:---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.2690 | 0.2419 | 0.2208 | 0.2325 | 0.2762 | 0.2450 | 0.2330 | 0.2276 | - - -R@100 | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | -:---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.7847 | 0.7882 | 0.7710 | 0.7722 | 0.8013 | 0.7961 | 0.7888 | 0.7687 | - - -R@1000 | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | -:---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.9178 | 0.9355 | 0.9264 | 0.9185 | 0.9311 | 0.9363 | 0.9353 | 0.9157 | - -Explanation of settings: - -+ The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`. -+ The setting "tuned" refers to `k1=2.16`, `b=0.61`, tuned to optimize for recall@100 (i.e., for first-stage retrieval) on 2019/12. - -In these runs, we are retrieving the top 1000 hits for each query and using `trec_eval` to evaluate all 1000 hits. -Since we're in the passage condition, we fetch the 10000 passages and select the top 1000 documents using MaxP. -This lets us measure R@100 and R@1000; the latter is particularly important when these runs are used as first-stage retrieval. -Beware, an official MS MARCO document ranking task leaderboard submission comprises only 100 hits per query. -See [this page](experiments-msmarco-doc-leaderboard.md) for details on Anserini baseline runs that were submitted to the official leaderboard. - -The MaxP passage retrieval functionality is available in `SearchCollection`. -To generate an MS MARCO submission with the BM25 default parameters, corresponding to "BM25 (default)" above: - -```bash -$ target/appassembler/bin/SearchCollection -topicreader TsvString \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3.pos+docvectors+raw \ - -output runs/run.msmarco-doc-per-passage-v3.bm25-default.txt -format msmarco \ - -bm25 -bm25.k1 0.9 -bm25.b 0.4 -hits 1000 \ - -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 - -$ python tools/scripts/msmarco/msmarco_doc_eval.py \ - --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc-per-passage-v3.bm25-default.txt - -##################### -MRR @100: 0.26851990908986706 -QueriesRanked: 5193 -##################### -``` - -Note that the above command uses `-format msmarco` to directly generate a run in the MS MARCO output format. - -To generate an MS MARCO submission with the BM25 tuned parameters, corresponding to "BM25 (tuned)" above: - -```bash -$ target/appassembler/bin/SearchCollection -topicreader TsvString \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3.pos+docvectors+raw \ - -output runs/run.msmarco-doc-per-passage-v3.bm25-tuned.txt -format msmarco \ - -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 1000 \ - -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 - -$ python tools/scripts/msmarco/msmarco_doc_eval.py \ - --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc-per-passage-v3.bm25-tuned.txt - -##################### -MRR @100: 0.27551963417683756 -QueriesRanked: 5193 -##################### -``` - -Again, note that the above command uses `-format msmarco` to directly generate a run in the MS MARCO output format. diff --git a/docs/regressions-msmarco-doc-docTTTTTquery-per-passage.md b/docs/regressions-msmarco-doc-segmented-docTTTTTquery.md similarity index 75% rename from docs/regressions-msmarco-doc-docTTTTTquery-per-passage.md rename to docs/regressions-msmarco-doc-segmented-docTTTTTquery.md index 860aabbc19..32910387f4 100644 --- a/docs/regressions-msmarco-doc-docTTTTTquery-per-passage.md +++ b/docs/regressions-msmarco-doc-segmented-docTTTTTquery.md @@ -6,13 +6,15 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing + **Expansion Condition:** doc2query-T5 -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +In the passage (i.e., segment) indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -**NOTE**: This is the `msmarco-doc-docTTTTTquery-per-passage` variant (there's also `msmarco-doc-docTTTTTquery-per-passage-v3`), see [this page](experiments-msmarco-doc-doc2query-details.md) for detailed notes about differences between these variants. +The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-segmented-docTTTTTquery.yaml). +Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-segmented-docTTTTTquery.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. -The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-docTTTTTquery-per-passage.yaml). -Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-passage.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. ## Indexing @@ -21,14 +23,15 @@ Typical indexing command: ``` target/appassembler/bin/IndexCollection \ -collection JsonCollection \ - -input /path/to/msmarco-doc-docTTTTTquery-per-passage \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -input /path/to/msmarco-doc-segmented-docTTTTTquery \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -generator DefaultLuceneDocumentGenerator \ - -threads 1 -storePositions -storeDocvectors -storeRaw \ - >& logs/log.msmarco-doc-docTTTTTquery-per-passage & + -threads 16 -storePositions -storeDocvectors -storeRaw \ + >& logs/log.msmarco-doc-segmented-docTTTTTquery & ``` -The directory `/path/to/msmarco-doc-docTTTTTquery-per-passage/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-segmented-docTTTTTquery/` should be a directory containing the expanded segmented corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -41,24 +44,24 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-default.topics.msmarco-doc.dev.txt \ + -output runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-default.topics.msmarco-doc.dev.txt \ -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned.topics.msmarco-doc.dev.txt \ + -output runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-tuned.topics.msmarco-doc.dev.txt \ -bm25 -bm25.k1 2.56 -bm25.b 0.59 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & ``` Evaluation can be performed using `trec_eval`: ``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-default.topics.msmarco-doc.dev.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-default.topics.msmarco-doc.dev.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned.topics.msmarco-doc.dev.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-tuned.topics.msmarco-doc.dev.txt ``` ## Effectiveness @@ -67,12 +70,12 @@ With the above commands, you should be able to reproduce the following results: MAP | BM25 (default)| BM25 (tuned)| :---------------------------------------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.3182 | 0.3211 | +[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.3184 | 0.3213 | R@100 | BM25 (default)| BM25 (tuned)| :---------------------------------------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.8481 | 0.8627 | +[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.8479 | 0.8627 | R@1000 | BM25 (default)| BM25 (tuned)| diff --git a/docs/regressions-msmarco-doc-per-passage.md b/docs/regressions-msmarco-doc-segmented.md similarity index 73% rename from docs/regressions-msmarco-doc-per-passage.md rename to docs/regressions-msmarco-doc-segmented.md index 6e31874cef..060dc4ca89 100644 --- a/docs/regressions-msmarco-doc-per-passage.md +++ b/docs/regressions-msmarco-doc-segmented.md @@ -6,13 +6,15 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing + **Expansion Condition:** none -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. +In the passage (i.e., segment) indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -**NOTE**: This is the `msmarco-doc-per-passage` variant (there's also `msmarco-doc-per-passage-v2` and `msmarco-doc-per-passage-v3`), see [this page](experiments-msmarco-doc-doc2query-details.md) for detailed notes about differences between these variants. +The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-segmented.yaml). +Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-segmented.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. -The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-per-passage.yaml). -Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-per-passage.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. ## Indexing @@ -21,14 +23,15 @@ Typical indexing command: ``` target/appassembler/bin/IndexCollection \ -collection JsonCollection \ - -input /path/to/msmarco-doc-per-passage \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -input /path/to/msmarco-doc-segmented \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -generator DefaultLuceneDocumentGenerator \ - -threads 1 -storePositions -storeDocvectors -storeRaw \ - >& logs/log.msmarco-doc-per-passage & + -threads 16 -storePositions -storeDocvectors -storeRaw \ + >& logs/log.msmarco-doc-segmented & ``` -The directory `/path/to/msmarco-doc-per-passage/` should be a directory containing the segmented paragraph collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-segmented/` should be a directory containing the segmented corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -41,72 +44,72 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-default.topics.msmarco-doc.dev.txt \ + -output runs/run.msmarco-doc-segmented.bm25-default.topics.msmarco-doc.dev.txt \ -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-default+rm3.topics.msmarco-doc.dev.txt \ + -output runs/run.msmarco-doc-segmented.bm25-default+rm3.topics.msmarco-doc.dev.txt \ -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-default+ax.topics.msmarco-doc.dev.txt \ + -output runs/run.msmarco-doc-segmented.bm25-default+ax.topics.msmarco-doc.dev.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-default+prf.topics.msmarco-doc.dev.txt \ + -output runs/run.msmarco-doc-segmented.bm25-default+prf.topics.msmarco-doc.dev.txt \ -bm25 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-tuned.topics.msmarco-doc.dev.txt \ + -output runs/run.msmarco-doc-segmented.bm25-tuned.topics.msmarco-doc.dev.txt \ -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-tuned+rm3.topics.msmarco-doc.dev.txt \ + -output runs/run.msmarco-doc-segmented.bm25-tuned+rm3.topics.msmarco-doc.dev.txt \ -bm25 -bm25.k1 2.16 -bm25.b 0.61 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-tuned+ax.topics.msmarco-doc.dev.txt \ + -output runs/run.msmarco-doc-segmented.bm25-tuned+ax.topics.msmarco-doc.dev.txt \ -bm25 -bm25.k1 2.16 -bm25.b 0.61 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc-per-passage \ + -index indexes/lucene-index.msmarco-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ - -output runs/run.msmarco-doc-per-passage.bm25-tuned+prf.topics.msmarco-doc.dev.txt \ + -output runs/run.msmarco-doc-segmented.bm25-tuned+prf.topics.msmarco-doc.dev.txt \ -bm25 -bm25.k1 2.16 -bm25.b 0.61 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & ``` Evaluation can be performed using `trec_eval`: ``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage.bm25-default.topics.msmarco-doc.dev.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-segmented.bm25-default.topics.msmarco-doc.dev.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage.bm25-default+rm3.topics.msmarco-doc.dev.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-segmented.bm25-default+rm3.topics.msmarco-doc.dev.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage.bm25-default+ax.topics.msmarco-doc.dev.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-segmented.bm25-default+ax.topics.msmarco-doc.dev.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage.bm25-default+prf.topics.msmarco-doc.dev.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-segmented.bm25-default+prf.topics.msmarco-doc.dev.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage.bm25-tuned.topics.msmarco-doc.dev.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-segmented.bm25-tuned.topics.msmarco-doc.dev.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage.bm25-tuned+rm3.topics.msmarco-doc.dev.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-segmented.bm25-tuned+rm3.topics.msmarco-doc.dev.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage.bm25-tuned+ax.topics.msmarco-doc.dev.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-segmented.bm25-tuned+ax.topics.msmarco-doc.dev.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-per-passage.bm25-tuned+prf.topics.msmarco-doc.dev.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recall.100 -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-segmented.bm25-tuned+prf.topics.msmarco-doc.dev.txt ``` ## Effectiveness @@ -115,17 +118,17 @@ With the above commands, you should be able to reproduce the following results: MAP | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.2688 | 0.2416 | 0.2229 | 0.2325 | 0.2756 | 0.2443 | 0.2350 | 0.2271 | +[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.2690 | 0.2419 | 0.2208 | 0.2325 | 0.2762 | 0.2450 | 0.2330 | 0.2276 | R@100 | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.7849 | 0.7876 | 0.7703 | 0.7714 | 0.8009 | 0.7955 | 0.7909 | 0.7685 | +[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.7847 | 0.7882 | 0.7710 | 0.7722 | 0.8013 | 0.7961 | 0.7888 | 0.7687 | R@1000 | BM25 (default)| +RM3 | +Ax | +PRF | BM25 (tuned)| +RM3 | +Ax | +PRF | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.9180 | 0.9355 | 0.9266 | 0.9187 | 0.9311 | 0.9359 | 0.9341 | 0.9162 | +[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.9178 | 0.9355 | 0.9264 | 0.9185 | 0.9311 | 0.9363 | 0.9353 | 0.9157 | Explanation of settings: diff --git a/docs/regressions-msmarco-doc.md b/docs/regressions-msmarco-doc.md index e157d6d213..487cd4f46b 100644 --- a/docs/regressions-msmarco-doc.md +++ b/docs/regressions-msmarco-doc.md @@ -6,26 +6,31 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is treated as a unit of indexing + **Expansion Condition:** none -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc.yaml). Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: ``` target/appassembler/bin/IndexCollection \ - -collection CleanTrecCollection \ + -collection JsonCollection \ -input /path/to/msmarco-doc \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -generator DefaultLuceneDocumentGenerator \ - -threads 1 -storePositions -storeDocvectors -storeRaw \ + -threads 7 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-doc & ``` -The directory `/path/to/msmarco-doc/` should be a directory containing the official document collection (a single file), in TREC format. +The directory `/path/to/msmarco-doc/` should be a directory containing the document corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -38,37 +43,37 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-default.topics.msmarco-doc.dev.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-default+rm3.topics.msmarco-doc.dev.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-tuned.topics.msmarco-doc.dev.txt \ -bm25 -bm25.k1 3.44 -bm25.b 0.87 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-tuned+rm3.topics.msmarco-doc.dev.txt \ -bm25 -bm25.k1 3.44 -bm25.b 0.87 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-tuned2.topics.msmarco-doc.dev.txt \ -bm25 -bm25.k1 4.46 -bm25.b 0.82 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-doc \ + -index indexes/lucene-index.msmarco-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-doc.bm25-tuned2+rm3.topics.msmarco-doc.dev.txt \ -bm25 -bm25.k1 4.46 -bm25.b 0.82 -rm3 & @@ -96,17 +101,17 @@ With the above commands, you should be able to reproduce the following results: MAP | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | BM25 (tuned2)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.2310 | 0.1632 | 0.2788 | 0.2289 | 0.2775 | 0.2238 | +[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.2305 | 0.1631 | 0.2784 | 0.2289 | 0.2774 | 0.2239 | R@100 | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | BM25 (tuned2)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.7279 | 0.6765 | 0.8065 | 0.7872 | 0.8076 | 0.7789 | +[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.7281 | 0.6767 | 0.8069 | 0.7878 | 0.8070 | 0.7791 | R@1000 | BM25 (default)| +RM3 | BM25 (tuned)| +RM3 | BM25 (tuned2)| +RM3 | :---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------| -[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.8856 | 0.8785 | 0.9326 | 0.9320 | 0.9357 | 0.9307 | +[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.8856 | 0.8791 | 0.9324 | 0.9314 | 0.9357 | 0.9305 | Explanation of settings: diff --git a/docs/regressions-msmarco-passage-deepimpact.md b/docs/regressions-msmarco-passage-deepimpact.md index 0f7008c4e7..3c913074ce 100644 --- a/docs/regressions-msmarco-passage-deepimpact.md +++ b/docs/regressions-msmarco-passage-deepimpact.md @@ -18,7 +18,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-passage-deepimpact \ - -index indexes/lucene-index.msmarco-passage-deepimpact \ + -index indexes/lucene-index.msmarco-passage-deepimpact/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -impact -pretokenized \ >& logs/log.msmarco-passage-deepimpact & @@ -38,7 +38,7 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-deepimpact \ + -index indexes/lucene-index.msmarco-passage-deepimpact/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.deepimpact.tsv.gz -topicreader TsvInt \ -output runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.tsv.gz \ -impact -pretokenized & diff --git a/docs/regressions-msmarco-passage-distill-splade-max.md b/docs/regressions-msmarco-passage-distill-splade-max.md index 36ecaeecc2..9b3d9df809 100644 --- a/docs/regressions-msmarco-passage-distill-splade-max.md +++ b/docs/regressions-msmarco-passage-distill-splade-max.md @@ -18,7 +18,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-passage-distill-splade-max \ - -index indexes/lucene-index.msmarco-passage-distill-splade-max \ + -index indexes/lucene-index.msmarco-passage-distill-splade-max/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -impact -pretokenized \ >& logs/log.msmarco-passage-distill-splade-max & @@ -38,7 +38,7 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-distill-splade-max \ + -index indexes/lucene-index.msmarco-passage-distill-splade-max/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz -topicreader TsvInt \ -output runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz \ -impact -pretokenized & diff --git a/docs/regressions-msmarco-passage-doc2query.md b/docs/regressions-msmarco-passage-doc2query.md index 76c39f0842..24052505fc 100644 --- a/docs/regressions-msmarco-passage-doc2query.md +++ b/docs/regressions-msmarco-passage-doc2query.md @@ -18,7 +18,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonCollection \ -input /path/to/msmarco-passage-doc2query \ - -index indexes/lucene-index.msmarco-passage-doc2query \ + -index indexes/lucene-index.msmarco-passage-doc2query/ \ -generator DefaultLuceneDocumentGenerator \ -threads 9 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-passage-doc2query & @@ -38,25 +38,25 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-doc2query \ + -index indexes/lucene-index.msmarco-passage-doc2query/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-doc2query.bm25-default.topics.msmarco-passage.dev-subset.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-doc2query \ + -index indexes/lucene-index.msmarco-passage-doc2query/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-doc2query.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-doc2query \ + -index indexes/lucene-index.msmarco-passage-doc2query/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-doc2query.bm25-tuned.topics.msmarco-passage.dev-subset.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-doc2query \ + -index indexes/lucene-index.msmarco-passage-doc2query/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-doc2query.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -rm3 & diff --git a/docs/regressions-msmarco-passage-docTTTTTquery.md b/docs/regressions-msmarco-passage-docTTTTTquery.md index 162b772ef8..6b37c997d8 100644 --- a/docs/regressions-msmarco-passage-docTTTTTquery.md +++ b/docs/regressions-msmarco-passage-docTTTTTquery.md @@ -17,7 +17,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonCollection \ -input /path/to/msmarco-passage-docTTTTTquery \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -generator DefaultLuceneDocumentGenerator \ -threads 9 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-passage-docTTTTTquery & @@ -37,37 +37,37 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-default.topics.msmarco-passage.dev-subset.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-tuned.topics.msmarco-passage.dev-subset.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-tuned2.topics.msmarco-passage.dev-subset.txt \ -bm25 -bm25.k1 2.18 -bm25.b 0.86 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-docTTTTTquery \ + -index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage-docTTTTTquery.bm25-tuned2+rm3.topics.msmarco-passage.dev-subset.txt \ -bm25 -bm25.k1 2.18 -bm25.b 0.86 -rm3 & diff --git a/docs/regressions-msmarco-passage-unicoil-tilde-expansion.md b/docs/regressions-msmarco-passage-unicoil-tilde-expansion.md index a323047c50..5fc200f55f 100644 --- a/docs/regressions-msmarco-passage-unicoil-tilde-expansion.md +++ b/docs/regressions-msmarco-passage-unicoil-tilde-expansion.md @@ -18,7 +18,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-passage-unicoil-tilde-expansion \ - -index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion \ + -index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -impact -pretokenized \ >& logs/log.msmarco-passage-unicoil-tilde-expansion & @@ -38,7 +38,7 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion \ + -index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz -topicreader TsvInt \ -output runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz \ -impact -pretokenized & diff --git a/docs/regressions-msmarco-passage-unicoil.md b/docs/regressions-msmarco-passage-unicoil.md index 4c4510b648..426a149450 100644 --- a/docs/regressions-msmarco-passage-unicoil.md +++ b/docs/regressions-msmarco-passage-unicoil.md @@ -18,7 +18,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-passage-unicoil \ - -index indexes/lucene-index.msmarco-passage-unicoil \ + -index indexes/lucene-index.msmarco-passage-unicoil/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -impact -pretokenized \ >& logs/log.msmarco-passage-unicoil & @@ -38,7 +38,7 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage-unicoil \ + -index indexes/lucene-index.msmarco-passage-unicoil/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil.tsv.gz -topicreader TsvInt \ -output runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.tsv.gz \ -impact -pretokenized & diff --git a/docs/regressions-msmarco-passage.md b/docs/regressions-msmarco-passage.md index 9326eca732..b30a865bb1 100644 --- a/docs/regressions-msmarco-passage.md +++ b/docs/regressions-msmarco-passage.md @@ -14,7 +14,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonCollection \ -input /path/to/msmarco-passage \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -generator DefaultLuceneDocumentGenerator \ -threads 9 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-passage & @@ -34,49 +34,49 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-default+ax.topics.msmarco-passage.dev-subset.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-default+prf.topics.msmarco-passage.dev-subset.txt \ -bm25 -bm25prf & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-tuned.topics.msmarco-passage.dev-subset.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-tuned+ax.topics.msmarco-passage.dev-subset.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-passage \ + -index indexes/lucene-index.msmarco-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -topicreader TsvInt \ -output runs/run.msmarco-passage.bm25-tuned+prf.topics.msmarco-passage.dev-subset.txt \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -bm25prf & diff --git a/docs/regressions-msmarco-v2-doc-segmented-unicoil-noexp-0shot.md b/docs/regressions-msmarco-v2-doc-segmented-unicoil-noexp-0shot.md index 3da1c35b49..fc48d0930c 100644 --- a/docs/regressions-msmarco-v2-doc-segmented-unicoil-noexp-0shot.md +++ b/docs/regressions-msmarco-v2-doc-segmented-unicoil-noexp-0shot.md @@ -14,7 +14,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-v2-doc-segmented-unicoil-noexp-0shot \ - -index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot \ + -index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot/ \ -generator DefaultLuceneDocumentGenerator \ -threads 18 -impact -pretokenized \ >& logs/log.msmarco-v2-doc-segmented-unicoil-noexp-0shot & @@ -32,12 +32,12 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot \ + -index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev.unicoil-noexp.0shot.tsv.gz -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented-unicoil-noexp-0shot.unicoil-noexp-0shot.topics.msmarco-v2-doc.dev.unicoil-noexp.0shot.tsv.gz \ -impact -pretokenized -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot \ + -index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev2.unicoil-noexp.0shot.tsv.gz -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented-unicoil-noexp-0shot.unicoil-noexp-0shot.topics.msmarco-v2-doc.dev2.unicoil-noexp.0shot.tsv.gz \ -impact -pretokenized -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & diff --git a/docs/regressions-msmarco-v2-doc-segmented.md b/docs/regressions-msmarco-v2-doc-segmented.md index 01b4e60d1e..e684aa4216 100644 --- a/docs/regressions-msmarco-v2-doc-segmented.md +++ b/docs/regressions-msmarco-v2-doc-segmented.md @@ -15,7 +15,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MsMarcoV2DocCollection \ -input /path/to/msmarco-v2-doc-segmented \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -generator DefaultLuceneDocumentGenerator \ -threads 18 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-v2-doc-segmented & @@ -35,45 +35,45 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented.bm25-default.topics.msmarco-v2-doc.dev.txt \ -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented.bm25-default.topics.msmarco-v2-doc.dev2.txt \ -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented.bm25-default+rm3.topics.msmarco-v2-doc.dev.txt \ -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented.bm25-default+rm3.topics.msmarco-v2-doc.dev2.txt \ -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented.bm25-default+ax.topics.msmarco-v2-doc.dev.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented.bm25-default+ax.topics.msmarco-v2-doc.dev2.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented.bm25-default+prf.topics.msmarco-v2-doc.dev.txt \ -bm25 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc-segmented \ + -index indexes/lucene-index.msmarco-v2-doc-segmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc-segmented.bm25-default+prf.topics.msmarco-v2-doc.dev2.txt \ -bm25 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 & diff --git a/docs/regressions-msmarco-v2-doc.md b/docs/regressions-msmarco-v2-doc.md index 8ae40241af..4f79faf401 100644 --- a/docs/regressions-msmarco-v2-doc.md +++ b/docs/regressions-msmarco-v2-doc.md @@ -15,7 +15,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MsMarcoV2DocCollection \ -input /path/to/msmarco-v2-doc \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -generator DefaultLuceneDocumentGenerator \ -threads 18 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-v2-doc & @@ -35,45 +35,45 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc.bm25-default.topics.msmarco-v2-doc.dev.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc.bm25-default.topics.msmarco-v2-doc.dev2.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc.bm25-default+rm3.topics.msmarco-v2-doc.dev.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc.bm25-default+rm3.topics.msmarco-v2-doc.dev2.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc.bm25-default+ax.topics.msmarco-v2-doc.dev.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc.bm25-default+ax.topics.msmarco-v2-doc.dev2.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc.bm25-default+prf.topics.msmarco-v2-doc.dev.txt \ -bm25 -bm25prf & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-doc \ + -index indexes/lucene-index.msmarco-v2-doc/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-doc.bm25-default+prf.topics.msmarco-v2-doc.dev2.txt \ -bm25 -bm25prf & diff --git a/docs/regressions-msmarco-v2-passage-augmented.md b/docs/regressions-msmarco-v2-passage-augmented.md index bb3c586fe1..a330c7082f 100644 --- a/docs/regressions-msmarco-v2-passage-augmented.md +++ b/docs/regressions-msmarco-v2-passage-augmented.md @@ -14,7 +14,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MsMarcoV2PassageCollection \ -input /path/to/msmarco-v2-passage-augmented \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -generator DefaultLuceneDocumentGenerator \ -threads 18 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-v2-passage-augmented & @@ -34,45 +34,45 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-augmented.bm25-default.topics.msmarco-v2-passage.dev.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-augmented.bm25-default.topics.msmarco-v2-passage.dev2.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-augmented.bm25-default+rm3.topics.msmarco-v2-passage.dev.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-augmented.bm25-default+rm3.topics.msmarco-v2-passage.dev2.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-augmented.bm25-default+ax.topics.msmarco-v2-passage.dev.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-augmented.bm25-default+ax.topics.msmarco-v2-passage.dev2.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-augmented.bm25-default+prf.topics.msmarco-v2-passage.dev.txt \ -bm25 -bm25prf & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-augmented \ + -index indexes/lucene-index.msmarco-v2-passage-augmented/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-augmented.bm25-default+prf.topics.msmarco-v2-passage.dev2.txt \ -bm25 -bm25prf & diff --git a/docs/regressions-msmarco-v2-passage-unicoil-noexp-0shot.md b/docs/regressions-msmarco-v2-passage-unicoil-noexp-0shot.md index 17764a47c0..7f1e90b0b4 100644 --- a/docs/regressions-msmarco-v2-passage-unicoil-noexp-0shot.md +++ b/docs/regressions-msmarco-v2-passage-unicoil-noexp-0shot.md @@ -14,7 +14,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-v2-passage-unicoil-noexp-0shot \ - -index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot \ + -index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ \ -generator DefaultLuceneDocumentGenerator \ -threads 18 -impact -pretokenized \ >& logs/log.msmarco-v2-passage-unicoil-noexp-0shot & @@ -32,12 +32,12 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot \ + -index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev.unicoil-noexp.0shot.tsv.gz -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-unicoil-noexp-0shot.unicoil-noexp-0shot.topics.msmarco-v2-passage.dev.unicoil-noexp.0shot.tsv.gz \ -impact -pretokenized & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot \ + -index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev2.unicoil-noexp.0shot.tsv.gz -topicreader TsvInt \ -output runs/run.msmarco-v2-passage-unicoil-noexp-0shot.unicoil-noexp-0shot.topics.msmarco-v2-passage.dev2.unicoil-noexp.0shot.tsv.gz \ -impact -pretokenized & diff --git a/docs/regressions-msmarco-v2-passage.md b/docs/regressions-msmarco-v2-passage.md index 1150be220d..de3de6107c 100644 --- a/docs/regressions-msmarco-v2-passage.md +++ b/docs/regressions-msmarco-v2-passage.md @@ -15,7 +15,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection MsMarcoV2PassageCollection \ -input /path/to/msmarco-v2-passage \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -generator DefaultLuceneDocumentGenerator \ -threads 18 -storePositions -storeDocvectors -storeRaw \ >& logs/log.msmarco-v2-passage & @@ -35,45 +35,45 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage.bm25-default.topics.msmarco-v2-passage.dev.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage.bm25-default.topics.msmarco-v2-passage.dev2.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage.bm25-default+rm3.topics.msmarco-v2-passage.dev.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage.bm25-default+rm3.topics.msmarco-v2-passage.dev2.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage.bm25-default+ax.topics.msmarco-v2-passage.dev.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage.bm25-default+ax.topics.msmarco-v2-passage.dev2.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage.bm25-default+prf.topics.msmarco-v2-passage.dev.txt \ -bm25 -bm25prf & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.msmarco-v2-passage \ + -index indexes/lucene-index.msmarco-v2-passage/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev2.txt -topicreader TsvInt \ -output runs/run.msmarco-v2-passage.bm25-default+prf.topics.msmarco-v2-passage.dev2.txt \ -bm25 -bm25prf & diff --git a/docs/regressions-ntcir8-zh.md b/docs/regressions-ntcir8-zh.md index 7fdb6c5e19..9dab361776 100644 --- a/docs/regressions-ntcir8-zh.md +++ b/docs/regressions-ntcir8-zh.md @@ -14,7 +14,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection CleanTrecCollection \ -input /path/to/ntcir8-zh \ - -index indexes/lucene-index.ntcir8-zh \ + -index indexes/lucene-index.ntcir8-zh/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -storePositions -storeDocvectors -storeRaw -language zh -uniqueDocid -optimize \ >& logs/log.ntcir8-zh & @@ -38,7 +38,7 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.ntcir8-zh \ + -index indexes/lucene-index.ntcir8-zh/ \ -topics src/main/resources/topics-and-qrels/topics.ntcir8zh.eval.txt -topicreader TsvString \ -output runs/run.ntcir8-zh.bm25.topics.ntcir8zh.eval.txt \ -bm25 -language zh & diff --git a/docs/regressions-robust05.md b/docs/regressions-robust05.md index 1d22933aa4..074b3abcc8 100644 --- a/docs/regressions-robust05.md +++ b/docs/regressions-robust05.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection TrecCollection \ -input /path/to/robust05 \ - -index indexes/lucene-index.robust05 \ + -index indexes/lucene-index.robust05/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -storePositions -storeDocvectors -storeRaw \ >& logs/log.robust05 & @@ -33,37 +33,37 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.robust05 \ + -index indexes/lucene-index.robust05/ \ -topics src/main/resources/topics-and-qrels/topics.robust05.txt -topicreader Trec \ -output runs/run.robust05.bm25.topics.robust05.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.robust05 \ + -index indexes/lucene-index.robust05/ \ -topics src/main/resources/topics-and-qrels/topics.robust05.txt -topicreader Trec \ -output runs/run.robust05.bm25+rm3.topics.robust05.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.robust05 \ + -index indexes/lucene-index.robust05/ \ -topics src/main/resources/topics-and-qrels/topics.robust05.txt -topicreader Trec \ -output runs/run.robust05.bm25+ax.topics.robust05.txt \ -bm25 -axiom -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.robust05 \ + -index indexes/lucene-index.robust05/ \ -topics src/main/resources/topics-and-qrels/topics.robust05.txt -topicreader Trec \ -output runs/run.robust05.ql.topics.robust05.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.robust05 \ + -index indexes/lucene-index.robust05/ \ -topics src/main/resources/topics-and-qrels/topics.robust05.txt -topicreader Trec \ -output runs/run.robust05.ql+rm3.topics.robust05.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.robust05 \ + -index indexes/lucene-index.robust05/ \ -topics src/main/resources/topics-and-qrels/topics.robust05.txt -topicreader Trec \ -output runs/run.robust05.ql+ax.topics.robust05.txt \ -qld -axiom -axiom.deterministic -rerankCutoff 20 & diff --git a/docs/regressions-trec02-ar.md b/docs/regressions-trec02-ar.md index 161bf4a766..8f6342feb5 100644 --- a/docs/regressions-trec02-ar.md +++ b/docs/regressions-trec02-ar.md @@ -14,7 +14,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection CleanTrecCollection \ -input /path/to/trec02-ar \ - -index indexes/lucene-index.trec02-ar \ + -index indexes/lucene-index.trec02-ar/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -storePositions -storeDocvectors -storeRaw -language ar \ >& logs/log.trec02-ar & @@ -38,7 +38,7 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.trec02-ar \ + -index indexes/lucene-index.trec02-ar/ \ -topics src/main/resources/topics-and-qrels/topics.trec02ar-ar.txt -topicreader Trec \ -output runs/run.trec02-ar.bm25.topics.trec02ar-ar.txt \ -bm25 -language ar & diff --git a/docs/regressions-wt10g.md b/docs/regressions-wt10g.md index f4b20896be..a56782de18 100644 --- a/docs/regressions-wt10g.md +++ b/docs/regressions-wt10g.md @@ -12,7 +12,7 @@ Typical indexing command: target/appassembler/bin/IndexCollection \ -collection TrecwebCollection \ -input /path/to/wt10g \ - -index indexes/lucene-index.wt10g \ + -index indexes/lucene-index.wt10g/ \ -generator DefaultLuceneDocumentGenerator \ -threads 16 -storePositions -storeDocvectors -storeRaw \ >& logs/log.wt10g & @@ -33,37 +33,37 @@ After indexing has completed, you should be able to perform retrieval as follows ``` target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wt10g \ + -index indexes/lucene-index.wt10g/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.451-550.txt -topicreader Trec \ -output runs/run.wt10g.bm25.topics.adhoc.451-550.txt \ -bm25 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wt10g \ + -index indexes/lucene-index.wt10g/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.451-550.txt -topicreader Trec \ -output runs/run.wt10g.bm25+rm3.topics.adhoc.451-550.txt \ -bm25 -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wt10g \ + -index indexes/lucene-index.wt10g/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.451-550.txt -topicreader Trec \ -output runs/run.wt10g.bm25+ax.topics.adhoc.451-550.txt \ -bm25 -axiom -axiom.beta 0.1 -axiom.deterministic -rerankCutoff 20 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wt10g \ + -index indexes/lucene-index.wt10g/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.451-550.txt -topicreader Trec \ -output runs/run.wt10g.ql.topics.adhoc.451-550.txt \ -qld & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wt10g \ + -index indexes/lucene-index.wt10g/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.451-550.txt -topicreader Trec \ -output runs/run.wt10g.ql+rm3.topics.adhoc.451-550.txt \ -qld -rm3 & target/appassembler/bin/SearchCollection \ - -index indexes/lucene-index.wt10g \ + -index indexes/lucene-index.wt10g/ \ -topics src/main/resources/topics-and-qrels/topics.adhoc.451-550.txt -topicreader Trec \ -output runs/run.wt10g.ql+ax.topics.adhoc.451-550.txt \ -qld -axiom -axiom.beta 0.1 -axiom.deterministic -rerankCutoff 20 & diff --git a/src/main/python/run_regression.py b/src/main/python/run_regression.py index b61e50e2ce..e2ec0f5c57 100644 --- a/src/main/python/run_regression.py +++ b/src/main/python/run_regression.py @@ -135,6 +135,7 @@ def construct_search_commands(yaml_data): def evaluate_and_verify(yaml_data, dry_run): fail_str = '\033[91m[FAIL]\033[0m ' ok_str = ' [OK] ' + failures = False logger.info('='*10 + ' Verifying Results: ' + yaml_data['corpus'] + ' ' + '='*10) for model in yaml_data['models']: @@ -161,13 +162,14 @@ def evaluate_and_verify(yaml_data, dry_run): if is_close(expected, actual): logger.info(ok_str + result_str) else: - # Fail fast. - logger.error(fail_str + result_str + ' - Failure encountered. Aborting!') - sys.exit() + logger.error(fail_str + result_str) + failures = True - # If we've gotten to here and it's not a dry run, then all the runs have passed. if not dry_run: - logger.info("All Tests Passed!") + if failures: + logger.info('\033[91mFailed tests!\033[0m') + else: + logger.info("All Tests Passed!") def run_search(cmd): diff --git a/src/main/resources/docgen/templates/dl19-doc-docTTTTTquery-per-doc.template b/src/main/resources/docgen/templates/dl19-doc-docTTTTTquery.template similarity index 79% rename from src/main/resources/docgen/templates/dl19-doc-docTTTTTquery-per-doc.template rename to src/main/resources/docgen/templates/dl19-doc-docTTTTTquery.template index d6aa6c1899..4fc44e5b0d 100644 --- a/src/main/resources/docgen/templates/dl19-doc-docTTTTTquery-per-doc.template +++ b/src/main/resources/docgen/templates/dl19-doc-docTTTTTquery.template @@ -1,4 +1,4 @@ -# Anserini: Regressions for [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) w/ per-doc docTTTTTquery +# Anserini: Regressions for [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) w/ docTTTTTquery This page describes experiments, integrated into Anserini's regression testing framework, for the TREC 2019 Deep Learning Track (Document Ranking Task) on the MS MARCO document collection using relevance judgments from NIST. @@ -10,11 +10,15 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is treated as a unit of indexing + **Expansion Condition:** doc2query-T5 -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: @@ -23,7 +27,8 @@ Typical indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc-docTTTTTquery-per-doc/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-docTTTTTquery/` should be a directory containing the expanded document corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). diff --git a/src/main/resources/docgen/templates/dl19-doc-docTTTTTquery-per-passage.template b/src/main/resources/docgen/templates/dl19-doc-segmented-docTTTTTquery.template similarity index 74% rename from src/main/resources/docgen/templates/dl19-doc-docTTTTTquery-per-passage.template rename to src/main/resources/docgen/templates/dl19-doc-segmented-docTTTTTquery.template index 2e1051c649..46a06a0edd 100644 --- a/src/main/resources/docgen/templates/dl19-doc-docTTTTTquery-per-passage.template +++ b/src/main/resources/docgen/templates/dl19-doc-segmented-docTTTTTquery.template @@ -1,4 +1,4 @@ -# Anserini: Regressions for [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) w/ per-passage docTTTTTquery +# Anserini: Regressions for [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) Segmented w/ docTTTTTquery This page describes experiments, integrated into Anserini's regression testing framework, for the TREC 2019 Deep Learning Track (Document Ranking Task) on the MS MARCO document collection using relevance judgments from NIST. @@ -10,12 +10,16 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing + **Expansion Condition:** doc2query-T5 -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. +In the passage (i.e., segment) indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: @@ -24,7 +28,8 @@ Typical indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc-docTTTTTquery-per-passage/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-segmented-docTTTTTquery/` should be a directory containing the expanded segmented corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). diff --git a/src/main/resources/docgen/templates/dl19-doc-per-passage.template b/src/main/resources/docgen/templates/dl19-doc-segmented.template similarity index 75% rename from src/main/resources/docgen/templates/dl19-doc-per-passage.template rename to src/main/resources/docgen/templates/dl19-doc-segmented.template index 138f722c43..2f225d5b82 100644 --- a/src/main/resources/docgen/templates/dl19-doc-per-passage.template +++ b/src/main/resources/docgen/templates/dl19-doc-segmented.template @@ -1,4 +1,4 @@ -# Anserini: Regressions for [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) +# Anserini: Regressions for [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) Segmented This page describes experiments, integrated into Anserini's regression testing framework, for the TREC 2019 Deep Learning Track (Document Ranking Task) on the MS MARCO document collection using relevance judgments from NIST. @@ -10,12 +10,16 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing + **Expansion Condition:** none -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. +In the passage (i.e., segment) indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: @@ -24,7 +28,8 @@ Typical indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc-per-passage/` should be a directory containing the segmented paragraph collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-segmented/` should be a directory containing the segmented corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). diff --git a/src/main/resources/docgen/templates/dl19-doc.template b/src/main/resources/docgen/templates/dl19-doc.template index bebfa2142d..5a5292ee7b 100644 --- a/src/main/resources/docgen/templates/dl19-doc.template +++ b/src/main/resources/docgen/templates/dl19-doc.template @@ -10,11 +10,15 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is treated as a unit of indexing + **Expansion Condition:** none -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: @@ -23,7 +27,8 @@ Typical indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc/` should be a directory containing the official document collection (a single file), in TREC format. +The directory `/path/to/msmarco-doc/` should be a directory containing the document corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). diff --git a/src/main/resources/docgen/templates/dl20-doc-docTTTTTquery-per-doc.template b/src/main/resources/docgen/templates/dl20-doc-docTTTTTquery.template similarity index 80% rename from src/main/resources/docgen/templates/dl20-doc-docTTTTTquery-per-doc.template rename to src/main/resources/docgen/templates/dl20-doc-docTTTTTquery.template index 5864d15613..01541f4074 100644 --- a/src/main/resources/docgen/templates/dl20-doc-docTTTTTquery-per-doc.template +++ b/src/main/resources/docgen/templates/dl20-doc-docTTTTTquery.template @@ -1,4 +1,4 @@ -# Anserini: Regressions for [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) w/ per-doc docTTTTTquery +# Anserini: Regressions for [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) w/ docTTTTTquery This page describes experiments, integrated into Anserini's regression testing framework, for the TREC 2020 Deep Learning Track (Document Ranking Task) on the MS MARCO document collection using relevance judgments from NIST. @@ -10,11 +10,15 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is treated as a unit of indexing + **Expansion Condition:** doc2query-T5 -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: @@ -23,7 +27,8 @@ Typical indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc-docTTTTTquery-per-doc/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-docTTTTTquery/` should be a directory containing the expanded document corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). diff --git a/src/main/resources/docgen/templates/dl20-doc-docTTTTTquery-per-passage.template b/src/main/resources/docgen/templates/dl20-doc-segmented-docTTTTTquery.template similarity index 74% rename from src/main/resources/docgen/templates/dl20-doc-docTTTTTquery-per-passage.template rename to src/main/resources/docgen/templates/dl20-doc-segmented-docTTTTTquery.template index 9dcb73b3f9..a5fafdbcf4 100644 --- a/src/main/resources/docgen/templates/dl20-doc-docTTTTTquery-per-passage.template +++ b/src/main/resources/docgen/templates/dl20-doc-segmented-docTTTTTquery.template @@ -1,4 +1,4 @@ -# Anserini: Regressions for [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) w/ per-passage docTTTTTquery +# Anserini: Regressions for [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) Segmented w/ docTTTTTquery This page describes experiments, integrated into Anserini's regression testing framework, for the TREC 2020 Deep Learning Track (Document Ranking Task) on the MS MARCO document collection using relevance judgments from NIST. @@ -10,12 +10,16 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing + **Expansion Condition:** doc2query-T5 -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. +In the passage (i.e., segment) indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: @@ -24,7 +28,8 @@ Typical indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc-docTTTTTquery-per-passage/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-segmented-docTTTTTquery/` should be a directory containing the expanded segmented corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). diff --git a/src/main/resources/docgen/templates/dl20-doc-per-passage.template b/src/main/resources/docgen/templates/dl20-doc-segmented.template similarity index 75% rename from src/main/resources/docgen/templates/dl20-doc-per-passage.template rename to src/main/resources/docgen/templates/dl20-doc-segmented.template index 9d3093c0fa..27bbeaa5f9 100644 --- a/src/main/resources/docgen/templates/dl20-doc-per-passage.template +++ b/src/main/resources/docgen/templates/dl20-doc-segmented.template @@ -1,4 +1,4 @@ -# Anserini: Regressions for [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) +# Anserini: Regressions for [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) Segmented This page describes experiments, integrated into Anserini's regression testing framework, for the TREC 2020 Deep Learning Track (Document Ranking Task) on the MS MARCO document collection using relevance judgments from NIST. @@ -10,12 +10,16 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing + **Expansion Condition:** none -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. +In the passage (i.e., segment) indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: @@ -24,7 +28,8 @@ Typical indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc-per-passage/` should be a directory containing the segmented paragraph collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-segmented/` should be a directory containing the segmented corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). diff --git a/src/main/resources/docgen/templates/dl20-doc.template b/src/main/resources/docgen/templates/dl20-doc.template index 679201a0bc..d39af387e9 100644 --- a/src/main/resources/docgen/templates/dl20-doc.template +++ b/src/main/resources/docgen/templates/dl20-doc.template @@ -10,11 +10,15 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is treated as a unit of indexing + **Expansion Condition:** none -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: @@ -23,7 +27,8 @@ Typical indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc/` should be a directory containing the official document collection (a single file), in TREC format. +The directory `/path/to/msmarco-doc/` should be a directory containing the document corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). diff --git a/src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-passage-v3.template b/src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-passage-v3.template deleted file mode 100644 index 77ff70f2dd..0000000000 --- a/src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-passage-v3.template +++ /dev/null @@ -1,106 +0,0 @@ -# Anserini: Regressions for MS MARCO Document Ranking - -This page documents regression experiments for the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking), which is integrated into Anserini's regression testing framework. -Note that there are four different regression conditions for this task, and this page describes the following: - -+ **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing -+ **Expansion Condition:** doc2query-T5 - -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. - -**NOTE**: This is the `msmarco-doc-docTTTTTquery-per-passage-v3` variant (there's also `msmarco-doc-docTTTTTquery-per-passage`), see [this page](experiments-msmarco-doc-doc2query-details.md) for detailed notes about differences between these variants. - -The exact configurations for these regressions are stored in [this YAML file](${yaml}). -Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. - -## Indexing - -Typical indexing command: - -``` -${index_cmds} -``` - -The directory `/path/to/msmarco-doc-docTTTTTquery-per-passage/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. - -For additional details, see explanation of [common indexing options](common-indexing-options.md). - -## Retrieval - -Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/). -The regression experiments here evaluate on the 5193 dev set questions. - -After indexing has completed, you should be able to perform retrieval as follows: - -``` -${ranking_cmds} -``` - -Evaluation can be performed using `trec_eval`: - -``` -${eval_cmds} -``` - -## Effectiveness - -With the above commands, you should be able to reproduce the following results: - -${effectiveness} - -Explanation of settings: - -+ The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`. -+ The setting "tuned" refers to `k1=2.56`, `b=0.59`, tuned to optimize for recall@100 (i.e., for first-stage retrieval) on 2019/12. - -In these runs, we are retrieving the top 1000 hits for each query and using `trec_eval` to evaluate all 1000 hits. -Since we're in the passage condition, we fetch the 10000 passages and select the top 1000 documents using MaxP. -This lets us measure R@100 and R@1000; the latter is particularly important when these runs are used as first-stage retrieval. -Beware, an official MS MARCO document ranking task leaderboard submission comprises only 100 hits per query. -See [this page](experiments-msmarco-doc-leaderboard.md) for details on Anserini baseline runs that were submitted to the official leaderboard. - -The MaxP passage retrieval functionality is available in `SearchCollection`. -To generate an MS MARCO submission with the BM25 default parameters, corresponding to "BM25 (default)" above: - -```bash -$ target/appassembler/bin/SearchCollection -topicreader TsvString \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage-v3.pos+docvectors+raw \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage-v3.bm25-default.txt -format msmarco \ - -bm25 -bm25.k1 0.9 -bm25.b 0.4 -hits 1000 \ - -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 - -$ python tools/scripts/msmarco/msmarco_doc_eval.py \ - --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc-docTTTTTquery-per-passage-v3.bm25-default.txt - -##################### -MRR @100: 0.317905445196054 -QueriesRanked: 5193 -##################### -``` - -Note that the above command uses `-format msmarco` to directly generate a run in the MS MARCO output format. - -To generate an MS MARCO submission with the BM25 tuned parameters, corresponding to "BM25 (tuned)" above: - -```bash -$ target/appassembler/bin/SearchCollection -topicreader TsvString \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ - -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage-v3.pos+docvectors+raw \ - -output runs/run.msmarco-doc-docTTTTTquery-per-passage-v3.bm25-tuned.txt -format msmarco \ - -bm25 -bm25.k1 2.56 -bm25.b 0.59 -hits 1000 \ - -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 - -$ python tools/scripts/msmarco/msmarco_doc_eval.py \ - --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc-docTTTTTquery-per-passage-v3.bm25-tuned.txt - -##################### -MRR @100: 0.3209184381409182 -QueriesRanked: 5193 -##################### -``` - -Again, note that the above command uses `-format msmarco` to directly generate a run in the MS MARCO output format. \ No newline at end of file diff --git a/src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-doc.template b/src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery.template similarity index 85% rename from src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-doc.template rename to src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery.template index e447ab706d..17cff0dab3 100644 --- a/src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-doc.template +++ b/src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery.template @@ -6,11 +6,15 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is treated as a unit of indexing + **Expansion Condition:** doc2query-T5 -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: @@ -19,7 +23,8 @@ Typical indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc-docTTTTTquery-per-doc/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-docTTTTTquery/` should be a directory containing the expanded document corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). diff --git a/src/main/resources/docgen/templates/msmarco-doc-per-passage-v2.template b/src/main/resources/docgen/templates/msmarco-doc-per-passage-v2.template deleted file mode 100644 index 9af82d7f6c..0000000000 --- a/src/main/resources/docgen/templates/msmarco-doc-per-passage-v2.template +++ /dev/null @@ -1,106 +0,0 @@ -# Anserini: Regressions for MS MARCO Document Ranking - -This page documents regression experiments for the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking), which is integrated into Anserini's regression testing framework. -Note that there are four different regression conditions for this task, and this page describes the following: - -+ **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing -+ **Expansion Condition:** none - -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. - -**NOTE**: This is the `msmarco-doc-per-passage-v2` variant (there's also `msmarco-doc-per-passage` and `msmarco-doc-per-passage-v3`), see [this page](experiments-msmarco-doc-doc2query-details.md) for detailed notes about differences between these variants. - -The exact configurations for these regressions are stored in [this YAML file](${yaml}). -Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. - -## Indexing - -Typical indexing command: - -``` -${index_cmds} -``` - -The directory `/path/to/msmarco-doc-per-passage/` should be a directory containing the segmented paragraph collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. - -For additional details, see explanation of [common indexing options](common-indexing-options.md). - -## Retrieval - -Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/). -The regression experiments here evaluate on the 5193 dev set questions. - -After indexing has completed, you should be able to perform retrieval as follows: - -``` -${ranking_cmds} -``` - -Evaluation can be performed using `trec_eval`: - -``` -${eval_cmds} -``` - -## Effectiveness - -With the above commands, you should be able to reproduce the following results: - -${effectiveness} - -Explanation of settings: - -+ The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`. -+ The setting "tuned" refers to `k1=2.16`, `b=0.61`, tuned to optimize for recall@100 (i.e., for first-stage retrieval) on 2019/12. - -In these runs, we are retrieving the top 1000 hits for each query and using `trec_eval` to evaluate all 1000 hits. -Since we're in the passage condition, we fetch the 10000 passages and select the top 1000 documents using MaxP. -This lets us measure R@100 and R@1000; the latter is particularly important when these runs are used as first-stage retrieval. -Beware, an official MS MARCO document ranking task leaderboard submission comprises only 100 hits per query. -See [this page](experiments-msmarco-doc-leaderboard.md) for details on Anserini baseline runs that were submitted to the official leaderboard. - -The MaxP passage retrieval functionality is available in `SearchCollection`. -To generate an MS MARCO submission with the BM25 default parameters, corresponding to "BM25 (default)" above: - -```bash -$ target/appassembler/bin/SearchCollection -topicreader TsvString \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2.pos+docvectors+raw \ - -output runs/run.msmarco-doc-per-passage-v2.bm25-default.txt -format msmarco \ - -bm25 -bm25.k1 0.9 -bm25.b 0.4 -hits 1000 \ - -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 - -$ python tools/scripts/msmarco/msmarco_doc_eval.py \ - --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc-per-passage-v2.bm25-default.txt - -##################### -MRR @100: 0.26029445206377066 -QueriesRanked: 5193 -##################### -``` - -Note that the above command uses `-format msmarco` to directly generate a run in the MS MARCO output format. - -To generate an MS MARCO submission with the BM25 tuned parameters, corresponding to "BM25 (tuned)" above: - -```bash -$ target/appassembler/bin/SearchCollection -topicreader TsvString \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ - -index indexes/lucene-index.msmarco-doc-per-passage-v2.pos+docvectors+raw \ - -output runs/run.msmarco-doc-per-passage-v2.bm25-tuned.txt -format msmarco \ - -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 1000 \ - -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 - -$ python tools/scripts/msmarco/msmarco_doc_eval.py \ - --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc-per-passage-v2.bm25-tuned.txt - -##################### -MRR @100: 0.2633426142578288 -QueriesRanked: 5193 -##################### -``` - -Again, note that the above command uses `-format msmarco` to directly generate a run in the MS MARCO output format. diff --git a/src/main/resources/docgen/templates/msmarco-doc-per-passage-v3.template b/src/main/resources/docgen/templates/msmarco-doc-per-passage-v3.template deleted file mode 100644 index 92c4ec6507..0000000000 --- a/src/main/resources/docgen/templates/msmarco-doc-per-passage-v3.template +++ /dev/null @@ -1,106 +0,0 @@ -# Anserini: Regressions for MS MARCO Document Ranking - -This page documents regression experiments for the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking), which is integrated into Anserini's regression testing framework. -Note that there are four different regression conditions for this task, and this page describes the following: - -+ **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing -+ **Expansion Condition:** none - -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. - -**NOTE**: This is the `msmarco-doc-per-passage-v3` variant (there's also `msmarco-doc-per-passage` and `msmarco-doc-per-passage-v2`), see [this page](experiments-msmarco-doc-doc2query-details.md) for detailed notes about differences between these variants. - -The exact configurations for these regressions are stored in [this YAML file](${yaml}). -Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. - -## Indexing - -Typical indexing command: - -``` -${index_cmds} -``` - -The directory `/path/to/msmarco-doc-per-passage/` should be a directory containing the segmented paragraph collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. - -For additional details, see explanation of [common indexing options](common-indexing-options.md). - -## Retrieval - -Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/). -The regression experiments here evaluate on the 5193 dev set questions. - -After indexing has completed, you should be able to perform retrieval as follows: - -``` -${ranking_cmds} -``` - -Evaluation can be performed using `trec_eval`: - -``` -${eval_cmds} -``` - -## Effectiveness - -With the above commands, you should be able to reproduce the following results: - -${effectiveness} - -Explanation of settings: - -+ The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`. -+ The setting "tuned" refers to `k1=2.16`, `b=0.61`, tuned to optimize for recall@100 (i.e., for first-stage retrieval) on 2019/12. - -In these runs, we are retrieving the top 1000 hits for each query and using `trec_eval` to evaluate all 1000 hits. -Since we're in the passage condition, we fetch the 10000 passages and select the top 1000 documents using MaxP. -This lets us measure R@100 and R@1000; the latter is particularly important when these runs are used as first-stage retrieval. -Beware, an official MS MARCO document ranking task leaderboard submission comprises only 100 hits per query. -See [this page](experiments-msmarco-doc-leaderboard.md) for details on Anserini baseline runs that were submitted to the official leaderboard. - -The MaxP passage retrieval functionality is available in `SearchCollection`. -To generate an MS MARCO submission with the BM25 default parameters, corresponding to "BM25 (default)" above: - -```bash -$ target/appassembler/bin/SearchCollection -topicreader TsvString \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3.pos+docvectors+raw \ - -output runs/run.msmarco-doc-per-passage-v3.bm25-default.txt -format msmarco \ - -bm25 -bm25.k1 0.9 -bm25.b 0.4 -hits 1000 \ - -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 - -$ python tools/scripts/msmarco/msmarco_doc_eval.py \ - --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc-per-passage-v3.bm25-default.txt - -##################### -MRR @100: 0.26851990908986706 -QueriesRanked: 5193 -##################### -``` - -Note that the above command uses `-format msmarco` to directly generate a run in the MS MARCO output format. - -To generate an MS MARCO submission with the BM25 tuned parameters, corresponding to "BM25 (tuned)" above: - -```bash -$ target/appassembler/bin/SearchCollection -topicreader TsvString \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ - -index indexes/lucene-index.msmarco-doc-per-passage-v3.pos+docvectors+raw \ - -output runs/run.msmarco-doc-per-passage-v3.bm25-tuned.txt -format msmarco \ - -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 1000 \ - -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 - -$ python tools/scripts/msmarco/msmarco_doc_eval.py \ - --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc-per-passage-v3.bm25-tuned.txt - -##################### -MRR @100: 0.27551963417683756 -QueriesRanked: 5193 -##################### -``` - -Again, note that the above command uses `-format msmarco` to directly generate a run in the MS MARCO output format. diff --git a/src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-passage.template b/src/main/resources/docgen/templates/msmarco-doc-segmented-docTTTTTquery.template similarity index 85% rename from src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-passage.template rename to src/main/resources/docgen/templates/msmarco-doc-segmented-docTTTTTquery.template index 31e1f226c5..b4a8243859 100644 --- a/src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-passage.template +++ b/src/main/resources/docgen/templates/msmarco-doc-segmented-docTTTTTquery.template @@ -6,14 +6,16 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing + **Expansion Condition:** doc2query-T5 -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. - -**NOTE**: This is the `msmarco-doc-docTTTTTquery-per-passage` variant (there's also `msmarco-doc-docTTTTTquery-per-passage-v3`), see [this page](experiments-msmarco-doc-doc2query-details.md) for detailed notes about differences between these variants. +In the passage (i.e., segment) indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: @@ -22,7 +24,8 @@ Typical indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc-docTTTTTquery-per-passage/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-segmented-docTTTTTquery/` should be a directory containing the expanded segmented corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). diff --git a/src/main/resources/docgen/templates/msmarco-doc-per-passage.template b/src/main/resources/docgen/templates/msmarco-doc-segmented.template similarity index 84% rename from src/main/resources/docgen/templates/msmarco-doc-per-passage.template rename to src/main/resources/docgen/templates/msmarco-doc-segmented.template index d2bed3f107..b3336f9915 100644 --- a/src/main/resources/docgen/templates/msmarco-doc-per-passage.template +++ b/src/main/resources/docgen/templates/msmarco-doc-segmented.template @@ -6,14 +6,16 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing + **Expansion Condition:** none -In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. - -**NOTE**: This is the `msmarco-doc-per-passage` variant (there's also `msmarco-doc-per-passage-v2` and `msmarco-doc-per-passage-v3`), see [this page](experiments-msmarco-doc-doc2query-details.md) for detailed notes about differences between these variants. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. +In the passage (i.e., segment) indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique. The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: @@ -22,7 +24,8 @@ Typical indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc-per-passage/` should be a directory containing the segmented paragraph collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection. +The directory `/path/to/msmarco-doc-segmented/` should be a directory containing the segmented corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). diff --git a/src/main/resources/docgen/templates/msmarco-doc.template b/src/main/resources/docgen/templates/msmarco-doc.template index 005a026f2d..371e266420 100644 --- a/src/main/resources/docgen/templates/msmarco-doc.template +++ b/src/main/resources/docgen/templates/msmarco-doc.template @@ -6,11 +6,15 @@ Note that there are four different regression conditions for this task, and this + **Indexing Condition:** each MS MARCO document is treated as a unit of indexing + **Expansion Condition:** none -All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5. +All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery), in the context of doc2query-T5. The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. +Note that in November 2021 we discovered issues in our regression tests, documented [here](experiments-msmarco-doc-doc2query-details.md). +As a result, we have had to rebuild all our regressions from the raw corpus. +These new versions yield end-to-end scores that are slightly different, so if numbers reported in a paper do not exactly match the numbers here, this may be the reason. + ## Indexing Typical indexing command: @@ -19,7 +23,8 @@ Typical indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc/` should be a directory containing the official document collection (a single file), in TREC format. +The directory `/path/to/msmarco-doc/` should be a directory containing the document corpus in Anserini's jsonl format. +See [this page](experiments-msmarco-doc-doc2query-details.md) for how to prepare the corpus. For additional details, see explanation of [common indexing options](common-indexing-options.md). diff --git a/src/main/resources/regression/backgroundlinking18.yaml b/src/main/resources/regression/backgroundlinking18.yaml index ecd2fe3aab..8d1f41050a 100644 --- a/src/main/resources/regression/backgroundlinking18.yaml +++ b/src/main/resources/regression/backgroundlinking18.yaml @@ -2,7 +2,7 @@ corpus: wapo.v2 corpus_path: collections/newswire/WashingtonPost.v2/data/ -index_path: indexes/lucene-index.wapo.v2 +index_path: indexes/lucene-index.wapo.v2/ collection_class: WashingtonPostCollection generator_class: WashingtonPostGenerator index_threads: 1 diff --git a/src/main/resources/regression/backgroundlinking19.yaml b/src/main/resources/regression/backgroundlinking19.yaml index 3b4ba22c70..8b41e4efb9 100644 --- a/src/main/resources/regression/backgroundlinking19.yaml +++ b/src/main/resources/regression/backgroundlinking19.yaml @@ -2,7 +2,7 @@ corpus: wapo.v2 corpus_path: collections/newswire/WashingtonPost.v2/data/ -index_path: indexes/lucene-index.wapo.v2 +index_path: indexes/lucene-index.wapo.v2/ collection_class: WashingtonPostCollection generator_class: WashingtonPostGenerator index_threads: 1 diff --git a/src/main/resources/regression/backgroundlinking20.yaml b/src/main/resources/regression/backgroundlinking20.yaml index 6f968ba68c..071d7faa41 100644 --- a/src/main/resources/regression/backgroundlinking20.yaml +++ b/src/main/resources/regression/backgroundlinking20.yaml @@ -2,7 +2,7 @@ corpus: wapo.v3 corpus_path: collections/newswire/WashingtonPost.v3/data/ -index_path: indexes/lucene-index.wapo.v3 +index_path: indexes/lucene-index.wapo.v3/ collection_class: WashingtonPostCollection generator_class: WashingtonPostGenerator index_threads: 1 diff --git a/src/main/resources/regression/cacm.yaml b/src/main/resources/regression/cacm.yaml index 900ef7ff84..1095f84b6a 100644 --- a/src/main/resources/regression/cacm.yaml +++ b/src/main/resources/regression/cacm.yaml @@ -2,7 +2,7 @@ corpus: cacm corpus_path: src/main/resources/cacm/ -index_path: indexes/lucene-index.cacm +index_path: indexes/lucene-index.cacm/ collection_class: HtmlCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 8 diff --git a/src/main/resources/regression/car17v1.5.yaml b/src/main/resources/regression/car17v1.5.yaml index caac95df2c..4ddbea3337 100644 --- a/src/main/resources/regression/car17v1.5.yaml +++ b/src/main/resources/regression/car17v1.5.yaml @@ -2,7 +2,7 @@ corpus: car-paragraphCorpus.v1.5 corpus_path: collections/car/paragraphCorpus.v1.5/ -index_path: indexes/lucene-index.car-paragraphCorpus.v1.5 +index_path: indexes/lucene-index.car-paragraphCorpus.v1.5/ collection_class: CarCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/car17v2.0-doc2query.yaml b/src/main/resources/regression/car17v2.0-doc2query.yaml index ee8274e868..6a52c9c5d1 100644 --- a/src/main/resources/regression/car17v2.0-doc2query.yaml +++ b/src/main/resources/regression/car17v2.0-doc2query.yaml @@ -2,7 +2,7 @@ corpus: car-paragraphCorpus.v2.0-doc2query corpus_path: collections/car/paragraphCorpus.v2.0-expanded-topk10/ -index_path: indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query +index_path: indexes/lucene-index.car-paragraphCorpus.v2.0-doc2query/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 30 diff --git a/src/main/resources/regression/car17v2.0.yaml b/src/main/resources/regression/car17v2.0.yaml index eed8c60a59..95ae7e54b9 100644 --- a/src/main/resources/regression/car17v2.0.yaml +++ b/src/main/resources/regression/car17v2.0.yaml @@ -2,7 +2,7 @@ corpus: car-paragraphCorpus.v2.0 corpus_path: collections/car/paragraphCorpus.v2.0/ -index_path: indexes/lucene-index.car-paragraphCorpus.v2.0 +index_path: indexes/lucene-index.car-paragraphCorpus.v2.0/ collection_class: CarCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/clef06-fr.yaml b/src/main/resources/regression/clef06-fr.yaml index fbe2abc5f5..731aff95d5 100644 --- a/src/main/resources/regression/clef06-fr.yaml +++ b/src/main/resources/regression/clef06-fr.yaml @@ -1,8 +1,8 @@ --- corpus: clef06-fr -corpus_path: collections/newswire/clir/clef2006-fr.json +corpus_path: collections/newswire/clir/clef2006-fr.json/ -index_path: indexes/lucene-index.clef06-fr +index_path: indexes/lucene-index.clef06-fr/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/core17.yaml b/src/main/resources/regression/core17.yaml index 7f21294f28..b078c52eb6 100644 --- a/src/main/resources/regression/core17.yaml +++ b/src/main/resources/regression/core17.yaml @@ -2,7 +2,7 @@ corpus: nyt corpus_path: collections/newswire/NYTcorpus/ -index_path: indexes/lucene-index.nyt +index_path: indexes/lucene-index.nyt/ collection_class: NewYorkTimesCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/core18.yaml b/src/main/resources/regression/core18.yaml index 4220c6d75a..f925461df8 100644 --- a/src/main/resources/regression/core18.yaml +++ b/src/main/resources/regression/core18.yaml @@ -2,7 +2,7 @@ corpus: wapo.v2 corpus_path: collections/newswire/WashingtonPost.v2/data/ -index_path: indexes/lucene-index.wapo.v2 +index_path: indexes/lucene-index.wapo.v2/ collection_class: WashingtonPostCollection generator_class: WashingtonPostGenerator index_threads: 1 diff --git a/src/main/resources/regression/cw09b.yaml b/src/main/resources/regression/cw09b.yaml index cbf8c973af..55f84805c7 100644 --- a/src/main/resources/regression/cw09b.yaml +++ b/src/main/resources/regression/cw09b.yaml @@ -2,7 +2,7 @@ corpus: cw09b corpus_path: collections/web/ClueWeb09b/ -index_path: indexes/lucene-index.cw09b +index_path: indexes/lucene-index.cw09b/ collection_class: ClueWeb09Collection generator_class: DefaultLuceneDocumentGenerator index_threads: 44 diff --git a/src/main/resources/regression/cw12.yaml b/src/main/resources/regression/cw12.yaml index 0fdc024cbe..754f52d502 100644 --- a/src/main/resources/regression/cw12.yaml +++ b/src/main/resources/regression/cw12.yaml @@ -2,7 +2,7 @@ corpus: cw12 corpus_path: collections/web/ClueWeb12/ -index_path: indexes/lucene-index.cw12 +index_path: indexes/lucene-index.cw12/ collection_class: ClueWeb12Collection generator_class: DefaultLuceneDocumentGenerator index_threads: 44 diff --git a/src/main/resources/regression/cw12b13.yaml b/src/main/resources/regression/cw12b13.yaml index 8ac14fb729..284ae49312 100644 --- a/src/main/resources/regression/cw12b13.yaml +++ b/src/main/resources/regression/cw12b13.yaml @@ -2,7 +2,7 @@ corpus: cw12b13 corpus_path: collections/web/ClueWeb12-B13/ -index_path: indexes/lucene-index.cw12b13 +index_path: indexes/lucene-index.cw12b13/ collection_class: ClueWeb12Collection generator_class: DefaultLuceneDocumentGenerator index_threads: 44 diff --git a/src/main/resources/regression/disk12.yaml b/src/main/resources/regression/disk12.yaml index a5dd12302e..a8dcd9e810 100644 --- a/src/main/resources/regression/disk12.yaml +++ b/src/main/resources/regression/disk12.yaml @@ -2,7 +2,7 @@ corpus: disk12 corpus_path: collections/newswire/disk12/ -index_path: indexes/lucene-index.disk12 +index_path: indexes/lucene-index.disk12/ collection_class: TrecCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/disk45.yaml b/src/main/resources/regression/disk45.yaml index 4d4e893635..9fdf8fc7cc 100644 --- a/src/main/resources/regression/disk45.yaml +++ b/src/main/resources/regression/disk45.yaml @@ -2,7 +2,7 @@ corpus: disk45 corpus_path: collections/newswire/disk45/ -index_path: indexes/lucene-index.disk45 +index_path: indexes/lucene-index.disk45/ collection_class: TrecCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/dl19-doc-docTTTTTquery-per-doc.yaml b/src/main/resources/regression/dl19-doc-docTTTTTquery.yaml similarity index 83% rename from src/main/resources/regression/dl19-doc-docTTTTTquery-per-doc.yaml rename to src/main/resources/regression/dl19-doc-docTTTTTquery.yaml index a47bc460f8..bdb464b733 100644 --- a/src/main/resources/regression/dl19-doc-docTTTTTquery-per-doc.yaml +++ b/src/main/resources/regression/dl19-doc-docTTTTTquery.yaml @@ -1,16 +1,16 @@ --- -corpus: msmarco-doc-docTTTTTquery-per-doc -corpus_path: collections/msmarco/doc-docTTTTTquery-per-doc +corpus: msmarco-doc-docTTTTTquery +corpus_path: collections/msmarco/msmarco-doc-docTTTTTquery/ -index_path: indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc +index_path: indexes/lucene-index.msmarco-doc-docTTTTTquery/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator -index_threads: 1 +index_threads: 7 index_options: -storePositions -storeDocvectors -storeRaw index_stats: - documents: 3213834 - documents (non-empty): 3213834 - total terms: 3748332076 + documents: 3213835 + documents (non-empty): 3213835 + total terms: 3748333319 metrics: - metric: MAP @@ -50,7 +50,7 @@ models: params: -bm25 -hits 100 # Note, this is different DL 2019 passage ranking! results: MAP: - - 0.2699 + - 0.2700 nDCG@10: - 0.5968 R@100: @@ -60,9 +60,9 @@ models: params: -bm25 -rm3 -hits 100 # Note, this is different DL 2019 passage ranking! results: MAP: - - 0.3044 + - 0.3045 nDCG@10: - - 0.5895 + - 0.5897 R@100: - 0.4465 - name: bm25-tuned @@ -72,7 +72,7 @@ models: MAP: - 0.2620 nDCG@10: - - 0.5967 + - 0.5972 R@100: - 0.3992 - name: bm25-tuned+rm3 @@ -80,8 +80,8 @@ models: params: -bm25 -bm25.k1 4.68 -bm25.b 0.87 -rm3 -hits 100 # Note, this is different DL 2019 passage ranking! results: MAP: - - 0.2812 + - 0.2814 nDCG@10: - - 0.6075 + - 0.6080 R@100: - 0.4119 \ No newline at end of file diff --git a/src/main/resources/regression/dl19-doc-docTTTTTquery-per-passage.yaml b/src/main/resources/regression/dl19-doc-segmented-docTTTTTquery.yaml similarity index 79% rename from src/main/resources/regression/dl19-doc-docTTTTTquery-per-passage.yaml rename to src/main/resources/regression/dl19-doc-segmented-docTTTTTquery.yaml index f6f2b63fc1..25253cda3a 100644 --- a/src/main/resources/regression/dl19-doc-docTTTTTquery-per-passage.yaml +++ b/src/main/resources/regression/dl19-doc-segmented-docTTTTTquery.yaml @@ -1,16 +1,16 @@ --- -corpus: msmarco-doc-docTTTTTquery-per-passage -corpus_path: collections/msmarco/doc-docTTTTTquery-per-passage +corpus: msmarco-doc-segmented-docTTTTTquery +corpus_path: collections/msmarco/msmarco-doc-segmented-docTTTTTquery/ -index_path: indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage +index_path: indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator -index_threads: 1 +index_threads: 16 index_options: -storePositions -storeDocvectors -storeRaw index_stats: - documents: 20544550 - documents (non-empty): 20544550 - total terms: 4203956960 + documents: 20545677 + documents (non-empty): 20545677 + total terms: 4206639543 metrics: - metric: MAP @@ -50,38 +50,38 @@ models: params: -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.2791 + - 0.2798 nDCG@10: - - 0.6099 + - 0.6119 R@100: - - 0.4092 + - 0.4093 - name: bm25-default+rm3 display: +RM3 params: -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.3025 + - 0.3021 nDCG@10: - - 0.6318 + - 0.6297 R@100: - - 0.4394 + - 0.4392 - name: bm25-tuned display: BM25 (tuned) params: -bm25 -bm25.k1 2.56 -bm25.b 0.59 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.2655 + - 0.2658 nDCG@10: - - 0.6271 + - 0.6273 R@100: - - 0.4020 + - 0.4026 - name: bm25-tuned+rm3 display: +RM3 params: -bm25 -bm25.k1 2.56 -bm25.b 0.59 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.2895 + - 0.2893 nDCG@10: - - 0.6256 + - 0.6239 R@100: - - 0.4235 \ No newline at end of file + - 0.4237 \ No newline at end of file diff --git a/src/main/resources/regression/dl19-doc-per-passage.yaml b/src/main/resources/regression/dl19-doc-segmented.yaml similarity index 82% rename from src/main/resources/regression/dl19-doc-per-passage.yaml rename to src/main/resources/regression/dl19-doc-segmented.yaml index 478684d48d..ce5b32ce9d 100644 --- a/src/main/resources/regression/dl19-doc-per-passage.yaml +++ b/src/main/resources/regression/dl19-doc-segmented.yaml @@ -1,16 +1,16 @@ --- -corpus: msmarco-doc-per-passage -corpus_path: collections/msmarco/doc-per-passage/ +corpus: msmarco-doc-segmented +corpus_path: collections/msmarco/msmarco-doc-segmented/ -index_path: indexes/lucene-index.msmarco-doc-per-passage +index_path: indexes/lucene-index.msmarco-doc-segmented/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator -index_threads: 1 +index_threads: 16 index_options: -storePositions -storeDocvectors -storeRaw index_stats: - documents: 20544550 - documents (non-empty): 20544550 - total terms: 3197886407 + documents: 20545677 + documents (non-empty): 20545677 + total terms: 3200515914 metrics: - metric: MAP @@ -50,9 +50,9 @@ models: params: -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.2441 + - 0.2449 nDCG@10: - - 0.5276 + - 0.5302 R@100: - 0.3840 - name: bm25-default+rm3 @@ -60,39 +60,39 @@ models: params: -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.2880 + - 0.2884 nDCG@10: - - 0.5750 + - 0.5764 R@100: - - 0.4356 + - 0.4355 - name: bm25-default+ax display: +Ax params: -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.3015 + - 0.2981 nDCG@10: - - 0.5590 + - 0.5556 R@100: - - 0.4501 + - 0.4490 - name: bm25-default+prf display: +PRF params: -bm25 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.2821 + - 0.2827 nDCG@10: - - 0.5591 + - 0.5599 R@100: - - 0.4477 + - 0.4476 - name: bm25-tuned display: BM25 (tuned) params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.2394 + - 0.2398 nDCG@10: - - 0.5364 + - 0.5389 R@100: - 0.3903 - name: bm25-tuned+rm3 @@ -100,28 +100,28 @@ models: params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.2656 + - 0.2658 nDCG@10: - - 0.5379 + - 0.5405 R@100: - - 0.4126 + - 0.4133 - name: bm25-tuned+ax display: +Ax params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.2934 + - 0.2975 nDCG@10: - - 0.5546 + - 0.5574 R@100: - - 0.4437 + - 0.4491 - name: bm25-tuned+prf display: +PRF params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.2838 + - 0.2828 nDCG@10: - - 0.5478 + - 0.5476 R@100: - - 0.4362 \ No newline at end of file + - 0.4361 \ No newline at end of file diff --git a/src/main/resources/regression/dl19-doc.yaml b/src/main/resources/regression/dl19-doc.yaml index d72635896c..eb17d15689 100644 --- a/src/main/resources/regression/dl19-doc.yaml +++ b/src/main/resources/regression/dl19-doc.yaml @@ -1,16 +1,16 @@ --- corpus: msmarco-doc -corpus_path: collections/msmarco/doc/ +corpus_path: collections/msmarco/msmarco-doc/ -index_path: indexes/lucene-index.msmarco-doc -collection_class: CleanTrecCollection +index_path: indexes/lucene-index.msmarco-doc/ +collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator -index_threads: 1 +index_threads: 7 index_options: -storePositions -storeDocvectors -storeRaw index_stats: documents: 3213835 documents (non-empty): 3213835 - total terms: 2748636047 + total terms: 2742209690 metrics: - metric: MAP @@ -50,19 +50,19 @@ models: params: -bm25 -hits 100 # Note, this is different DL 2019 passage ranking! results: MAP: - - 0.2443 + - 0.2434 nDCG@10: - - 0.5190 + - 0.5176 R@100: - - 0.3948 + - 0.3949 - name: bm25-default+rm3 display: +RM3 params: -bm25 -rm3 -hits 100 # Note, this is different DL 2019 passage ranking! results: MAP: - - 0.2772 + - 0.2774 nDCG@10: - - 0.5169 + - 0.5170 R@100: - 0.4189 - name: bm25-default+ax @@ -70,11 +70,11 @@ models: params: -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 100 # Note, this is different DL 2019 passage ranking! results: MAP: - - 0.2452 + - 0.2454 nDCG@10: - - 0.4730 + - 0.4732 R@100: - - 0.3945 + - 0.3946 - name: bm25-default+prf display: +PRF params: -bm25 -bm25prf -hits 100 # Note, this is different DL 2019 passage ranking! @@ -82,46 +82,46 @@ models: MAP: - 0.2541 nDCG@10: - - 0.5105 + - 0.5107 R@100: - - 0.4004 + - 0.4003 - name: bm25-tuned display: BM25 (tuned) params: -bm25 -bm25.k1 3.44 -bm25.b 0.87 -hits 100 # Note, this is different DL 2019 passage ranking! results: MAP: - - 0.2318 + - 0.2311 nDCG@10: - - 0.5140 + - 0.5139 R@100: - - 0.3862 + - 0.3853 - name: bm25-tuned+rm3 display: +RM3 params: -bm25 -bm25.k1 3.44 -bm25.b 0.87 -rm3 -hits 100 # Note, this is different DL 2019 passage ranking! results: MAP: - - 0.2700 + - 0.2684 nDCG@10: - - 0.5485 + - 0.5445 R@100: - - 0.4193 + - 0.4186 - name: bm25-tuned+ax display: +Ax params: -bm25 -bm25.k1 3.44 -bm25.b 0.87 -axiom -axiom.deterministic -rerankCutoff 20 -hits 100 # Note, this is different DL 2019 passage ranking! results: MAP: - - 0.2816 + - 0.2792 nDCG@10: - - 0.5245 + - 0.5203 R@100: - - 0.4399 + - 0.4378 - name: bm25-tuned+prf display: +PRF params: -bm25 -bm25.k1 3.44 -bm25.b 0.87 -bm25prf -hits 100 # Note, this is different DL 2019 passage ranking! results: MAP: - - 0.2758 + - 0.2774 nDCG@10: - - 0.5280 + - 0.5294 R@100: - - 0.4287 + - 0.4295 diff --git a/src/main/resources/regression/dl19-passage-docTTTTTquery.yaml b/src/main/resources/regression/dl19-passage-docTTTTTquery.yaml index 053df1e3c9..d6e65da564 100644 --- a/src/main/resources/regression/dl19-passage-docTTTTTquery.yaml +++ b/src/main/resources/regression/dl19-passage-docTTTTTquery.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-passage-docTTTTTquery -corpus_path: collections/msmarco/passage-docTTTTTquery +corpus_path: collections/msmarco/passage-docTTTTTquery/ -index_path: indexes/lucene-index.msmarco-passage-docTTTTTquery +index_path: indexes/lucene-index.msmarco-passage-docTTTTTquery/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 9 diff --git a/src/main/resources/regression/dl19-passage.yaml b/src/main/resources/regression/dl19-passage.yaml index 87eb7b6577..517aefd6dc 100644 --- a/src/main/resources/regression/dl19-passage.yaml +++ b/src/main/resources/regression/dl19-passage.yaml @@ -2,7 +2,7 @@ corpus: msmarco-passage corpus_path: collections/msmarco/passage/ -index_path: indexes/lucene-index.msmarco-passage +index_path: indexes/lucene-index.msmarco-passage/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 9 diff --git a/src/main/resources/regression/dl20-doc-docTTTTTquery-per-doc.yaml b/src/main/resources/regression/dl20-doc-docTTTTTquery.yaml similarity index 87% rename from src/main/resources/regression/dl20-doc-docTTTTTquery-per-doc.yaml rename to src/main/resources/regression/dl20-doc-docTTTTTquery.yaml index b2b7a765ff..8cc0f78136 100644 --- a/src/main/resources/regression/dl20-doc-docTTTTTquery-per-doc.yaml +++ b/src/main/resources/regression/dl20-doc-docTTTTTquery.yaml @@ -1,16 +1,16 @@ --- -corpus: msmarco-doc-docTTTTTquery-per-doc -corpus_path: collections/msmarco/doc-docTTTTTquery-per-doc +corpus: msmarco-doc-docTTTTTquery +corpus_path: collections/msmarco/msmarco-doc-docTTTTTquery/ -index_path: indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc +index_path: indexes/lucene-index.msmarco-doc-docTTTTTquery/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator -index_threads: 1 +index_threads: 7 index_options: -storePositions -storeDocvectors -storeRaw index_stats: - documents: 3213834 - documents (non-empty): 3213834 - total terms: 3748332076 + documents: 3213835 + documents (non-empty): 3213835 + total terms: 3748333319 metrics: - metric: MAP @@ -63,13 +63,13 @@ models: MRR: - 0.9369 R@100: - - 0.6412 + - 0.6414 - name: bm25-default+rm3 display: +RM3 params: -bm25 -rm3 -hits 100 # Note, this is different DL 2020 passage ranking! results: MAP: - - 0.4228 + - 0.4229 nDCG@10: - 0.5407 MRR: @@ -81,7 +81,7 @@ models: params: -bm25 -bm25.k1 4.68 -bm25.b 0.87 -hits 100 # Note, this is different DL 2020 passage ranking! results: MAP: - - 0.4098 + - 0.4099 nDCG@10: - 0.5852 MRR: diff --git a/src/main/resources/regression/dl20-doc-docTTTTTquery-per-passage.yaml b/src/main/resources/regression/dl20-doc-segmented-docTTTTTquery.yaml similarity index 84% rename from src/main/resources/regression/dl20-doc-docTTTTTquery-per-passage.yaml rename to src/main/resources/regression/dl20-doc-segmented-docTTTTTquery.yaml index 70691d5bd9..d14df4363a 100644 --- a/src/main/resources/regression/dl20-doc-docTTTTTquery-per-passage.yaml +++ b/src/main/resources/regression/dl20-doc-segmented-docTTTTTquery.yaml @@ -1,16 +1,16 @@ --- -corpus: msmarco-doc-docTTTTTquery-per-passage -corpus_path: collections/msmarco/doc-docTTTTTquery-per-passage +corpus: msmarco-doc-segmented-docTTTTTquery +corpus_path: collections/msmarco/msmarco-doc-segmented-docTTTTTquery/ -index_path: indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage +index_path: indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator -index_threads: 1 +index_threads: 16 index_options: -storePositions -storeDocvectors -storeRaw index_stats: - documents: 20544550 - documents (non-empty): 20544550 - total terms: 4203956960 + documents: 20545677 + documents (non-empty): 20545677 + total terms: 4206639543 metrics: - metric: MAP @@ -69,9 +69,9 @@ models: params: -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.4269 + - 0.4268 nDCG@10: - - 0.5848 + - 0.5850 MRR: - 0.8944 R@100: @@ -81,22 +81,22 @@ models: params: -bm25 -bm25.k1 2.56 -bm25.b 0.59 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.4042 + - 0.4047 nDCG@10: - - 0.5931 + - 0.5943 MRR: - 0.9469 R@100: - - 0.6192 + - 0.6195 - name: bm25-tuned+rm3 display: +RM3 params: -bm25 -bm25.k1 2.56 -bm25.b 0.59 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.4023 + - 0.4025 nDCG@10: - - 0.5723 + - 0.5724 MRR: - 0.9150 R@100: - - 0.6392 \ No newline at end of file + - 0.6394 \ No newline at end of file diff --git a/src/main/resources/regression/dl20-doc-per-passage.yaml b/src/main/resources/regression/dl20-doc-segmented.yaml similarity index 83% rename from src/main/resources/regression/dl20-doc-per-passage.yaml rename to src/main/resources/regression/dl20-doc-segmented.yaml index c8a6cbbc1e..23b70854eb 100644 --- a/src/main/resources/regression/dl20-doc-per-passage.yaml +++ b/src/main/resources/regression/dl20-doc-segmented.yaml @@ -1,16 +1,16 @@ --- -corpus: msmarco-doc-per-passage -corpus_path: collections/msmarco/doc-per-passage/ +corpus: msmarco-doc-segmented +corpus_path: collections/msmarco/msmarco-doc-segmented/ -index_path: indexes/lucene-index.msmarco-doc-per-passage +index_path: indexes/lucene-index.msmarco-doc-segmented/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator -index_threads: 1 +index_threads: 16 index_options: -storePositions -storeDocvectors -storeRaw index_stats: - documents: 20544550 - documents (non-empty): 20544550 - total terms: 3197886407 + documents: 20545677 + documents (non-empty): 20545677 + total terms: 3200515914 metrics: - metric: MAP @@ -57,9 +57,9 @@ models: params: -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.3584 + - 0.3586 nDCG@10: - - 0.5271 + - 0.5281 MRR: - 0.8479 R@100: @@ -69,9 +69,9 @@ models: params: -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.3769 + - 0.3774 nDCG@10: - - 0.5159 + - 0.5179 MRR: - 0.8136 R@100: @@ -81,70 +81,70 @@ models: params: -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.3854 + - 0.3868 nDCG@10: - - 0.5250 + - 0.5227 MRR: - - 0.8123 + - 0.8028 R@100: - - 0.6332 + - 0.6362 - name: bm25-default+prf display: +PRF params: -bm25 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.3672 + - 0.3686 nDCG@10: - - 0.5217 + - 0.5238 MRR: - 0.7911 R@100: - - 0.5994 + - 0.6012 - name: bm25-tuned display: BM25 (tuned) params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.3456 + - 0.3458 nDCG@10: - 0.5213 MRR: - 0.8684 R@100: - - 0.5715 + - 0.5723 - name: bm25-tuned+rm3 display: +RM3 params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.3471 + - 0.3472 nDCG@10: - - 0.4983 + - 0.4979 MRR: - 0.7807 R@100: - - 0.6013 + - 0.6025 - name: bm25-tuned+ax display: +Ax params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.3495 + - 0.3486 nDCG@10: - - 0.4942 + - 0.4948 MRR: - - 0.8102 + - 0.8019 R@100: - - 0.6086 + - 0.6114 - name: bm25-tuned+prf display: +PRF params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 results: MAP: - - 0.3629 + - 0.3627 nDCG@10: - - 0.5260 + - 0.5251 MRR: - 0.8478 R@100: - - 0.6064 \ No newline at end of file + - 0.6048 \ No newline at end of file diff --git a/src/main/resources/regression/dl20-doc.yaml b/src/main/resources/regression/dl20-doc.yaml index 49e2285f12..bfdc842667 100644 --- a/src/main/resources/regression/dl20-doc.yaml +++ b/src/main/resources/regression/dl20-doc.yaml @@ -1,16 +1,16 @@ --- corpus: msmacro-doc -corpus_path: collections/msmarco/doc/ +corpus_path: collections/msmarco/msmarco-doc/ -index_path: indexes/lucene-index.msmarco-doc -collection_class: CleanTrecCollection +index_path: indexes/lucene-index.msmarco-doc/ +collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator -index_threads: 1 +index_threads: 7 index_options: -storePositions -storeDocvectors -storeRaw index_stats: documents: 3213835 documents (non-empty): 3213835 - total terms: 2748636047 + total terms: 2742209690 metrics: - metric: MAP @@ -57,9 +57,9 @@ models: params: -bm25 -hits 100 # Note, this is different DL 2020 passage ranking! results: MAP: - - 0.3791 + - 0.3793 nDCG@10: - - 0.5271 + - 0.5286 MRR: - 0.8521 R@100: @@ -69,47 +69,47 @@ models: params: -bm25 -rm3 -hits 100 # Note, this is different DL 2020 passage ranking! results: MAP: - - 0.4006 + - 0.4014 nDCG@10: - - 0.5248 + - 0.5225 MRR: - 0.8541 R@100: - - 0.6392 + - 0.6414 - name: bm25-tuned display: BM25 (tuned) params: -bm25 -bm25.k1 3.44 -bm25.b 0.87 -hits 100 # Note, this is different DL 2020 passage ranking! results: MAP: - - 0.3630 + - 0.3631 nDCG@10: - - 0.5087 + - 0.5070 MRR: - 0.8641 R@100: - - 0.5926 + - 0.5935 - name: bm25-tuned+rm3 display: +RM3 params: -bm25 -bm25.k1 3.44 -bm25.b 0.87 -rm3 -hits 100 # Note, this is different DL 2020 passage ranking! results: MAP: - - 0.3588 + - 0.3592 nDCG@10: - - 0.5117 + - 0.5124 MRR: - - 0.8188 + - 0.8186 R@100: - - 0.5983 + - 0.5977 - name: bm25-tuned2 display: BM25 (tuned2) params: -bm25 -bm25.k1 4.46 -bm25.b 0.82 -hits 100 # Note, this is different DL 2020 passage ranking! results: MAP: - - 0.3583 + - 0.3581 nDCG@10: - - 0.5078 + - 0.5061 MRR: - - 0.8541 + - 0.8522 R@100: - 0.5860 - name: bm25-tuned2+rm3 @@ -117,10 +117,10 @@ models: params: -bm25 -bm25.k1 4.46 -bm25.b 0.82 -rm3 -hits 100 # Note, this is different DL 2020 passage ranking! results: MAP: - - 0.3618 + - 0.3619 nDCG@10: - - 0.5202 + - 0.5238 MRR: - - 0.8458 + - 0.8582 R@100: - - 0.5998 + - 0.5995 diff --git a/src/main/resources/regression/dl20-passage-docTTTTTquery.yaml b/src/main/resources/regression/dl20-passage-docTTTTTquery.yaml index 3865d9d632..91e640d338 100644 --- a/src/main/resources/regression/dl20-passage-docTTTTTquery.yaml +++ b/src/main/resources/regression/dl20-passage-docTTTTTquery.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-passage-docTTTTTquery -corpus_path: collections/msmarco/passage-docTTTTTquery +corpus_path: collections/msmarco/passage-docTTTTTquery/ -index_path: indexes/lucene-index.msmarco-passage-docTTTTTquery +index_path: indexes/lucene-index.msmarco-passage-docTTTTTquery/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 9 diff --git a/src/main/resources/regression/dl20-passage.yaml b/src/main/resources/regression/dl20-passage.yaml index 1130fec069..b296ed495c 100644 --- a/src/main/resources/regression/dl20-passage.yaml +++ b/src/main/resources/regression/dl20-passage.yaml @@ -2,7 +2,7 @@ corpus: msmarco-passage corpus_path: collections/msmarco/passage/ -index_path: indexes/lucene-index.msmarco-passage +index_path: indexes/lucene-index.msmarco-passage/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 9 diff --git a/src/main/resources/regression/dl21-doc-segmented-unicoil-noexp-0shot.yaml b/src/main/resources/regression/dl21-doc-segmented-unicoil-noexp-0shot.yaml index 7d66b9fd82..5688f049c9 100644 --- a/src/main/resources/regression/dl21-doc-segmented-unicoil-noexp-0shot.yaml +++ b/src/main/resources/regression/dl21-doc-segmented-unicoil-noexp-0shot.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-v2-doc-segmented-unicoil-noexp-0shot -corpus_path: collections/msmarco/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8 +corpus_path: collections/msmarco/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8/ -index_path: indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot +index_path: indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 18 diff --git a/src/main/resources/regression/dl21-doc-segmented.yaml b/src/main/resources/regression/dl21-doc-segmented.yaml index 31c7aa366b..86253172d8 100644 --- a/src/main/resources/regression/dl21-doc-segmented.yaml +++ b/src/main/resources/regression/dl21-doc-segmented.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-v2-doc-segmented -corpus_path: collections/msmarco/msmarco_v2_doc_segmented +corpus_path: collections/msmarco/msmarco_v2_doc_segmented/ -index_path: indexes/lucene-index.msmarco-v2-doc-segmented +index_path: indexes/lucene-index.msmarco-v2-doc-segmented/ collection_class: MsMarcoV2DocCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 18 diff --git a/src/main/resources/regression/dl21-doc.yaml b/src/main/resources/regression/dl21-doc.yaml index c02ea04de6..42f2146191 100644 --- a/src/main/resources/regression/dl21-doc.yaml +++ b/src/main/resources/regression/dl21-doc.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-v2-doc -corpus_path: collections/msmarco/msmarco_v2_doc +corpus_path: collections/msmarco/msmarco_v2_doc/ -index_path: indexes/lucene-index.msmarco-v2-doc +index_path: indexes/lucene-index.msmarco-v2-doc/ collection_class: MsMarcoV2DocCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 18 diff --git a/src/main/resources/regression/dl21-passage-augmented.yaml b/src/main/resources/regression/dl21-passage-augmented.yaml index 2b6d689305..ac1d2d4422 100644 --- a/src/main/resources/regression/dl21-passage-augmented.yaml +++ b/src/main/resources/regression/dl21-passage-augmented.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-v2-passage-augmented -corpus_path: collections/msmarco/msmarco_v2_passage_augmented +corpus_path: collections/msmarco/msmarco_v2_passage_augmented/ -index_path: indexes/lucene-index.msmarco-v2-passage-augmented +index_path: indexes/lucene-index.msmarco-v2-passage-augmented/ collection_class: MsMarcoV2PassageCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 18 diff --git a/src/main/resources/regression/dl21-passage-unicoil-noexp-0shot.yaml b/src/main/resources/regression/dl21-passage-unicoil-noexp-0shot.yaml index 8c57021218..95dd185e12 100644 --- a/src/main/resources/regression/dl21-passage-unicoil-noexp-0shot.yaml +++ b/src/main/resources/regression/dl21-passage-unicoil-noexp-0shot.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-v2-passage-unicoil-noexp-0shot -corpus_path: collections/msmarco/msmarco-passage-v2-unicoil-noexp-0shot-b8 +corpus_path: collections/msmarco/msmarco-passage-v2-unicoil-noexp-0shot-b8/ -index_path: indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot +index_path: indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 18 diff --git a/src/main/resources/regression/dl21-passage.yaml b/src/main/resources/regression/dl21-passage.yaml index af093963f8..10acdb17d8 100644 --- a/src/main/resources/regression/dl21-passage.yaml +++ b/src/main/resources/regression/dl21-passage.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-v2-passage -corpus_path: collections/msmarco/msmarco_v2_passage +corpus_path: collections/msmarco/msmarco_v2_passage/ -index_path: indexes/lucene-index.msmarco-v2-passage +index_path: indexes/lucene-index.msmarco-v2-passage/ collection_class: MsMarcoV2PassageCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 18 diff --git a/src/main/resources/regression/fever.yaml b/src/main/resources/regression/fever.yaml index f888871f62..73e5d90b53 100644 --- a/src/main/resources/regression/fever.yaml +++ b/src/main/resources/regression/fever.yaml @@ -1,8 +1,8 @@ --- corpus: fever -corpus_path: collections/fever/wiki-pages +corpus_path: collections/fever/wiki-pages/ -index_path: indexes/lucene-index.fever-paragraph +index_path: indexes/lucene-index.fever-paragraph/ collection_class: FeverParagraphCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/fire12-bn.yaml b/src/main/resources/regression/fire12-bn.yaml index 6265ff5bb3..f65cec6f45 100644 --- a/src/main/resources/regression/fire12-bn.yaml +++ b/src/main/resources/regression/fire12-bn.yaml @@ -1,8 +1,8 @@ --- corpus: fire12-bn -corpus_path: collections/fire/bengali/bn.docs.2012.19032012 +corpus_path: collections/fire/bengali/bn.docs.2012.19032012/ -index_path: indexes/lucene-index.fire12-bn +index_path: indexes/lucene-index.fire12-bn/ collection_class: CleanTrecCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/fire12-en.yaml b/src/main/resources/regression/fire12-en.yaml index 267cb38073..fe275805a2 100644 --- a/src/main/resources/regression/fire12-en.yaml +++ b/src/main/resources/regression/fire12-en.yaml @@ -1,8 +1,8 @@ --- corpus: fire12-en -corpus_path: collections/fire/english/en.docs.2011 +corpus_path: collections/fire/english/en.docs.2011/ -index_path: indexes/lucene-index.fire12-en +index_path: indexes/lucene-index.fire12-en/ collection_class: CleanTrecCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/fire12-hi.yaml b/src/main/resources/regression/fire12-hi.yaml index 940e8c4cb0..e2f8bcd2c4 100644 --- a/src/main/resources/regression/fire12-hi.yaml +++ b/src/main/resources/regression/fire12-hi.yaml @@ -1,8 +1,8 @@ --- corpus: fire12-hi -corpus_path: collections/fire/hindi/hi.docs.2011 +corpus_path: collections/fire/hindi/hi.docs.2011/ -index_path: indexes/lucene-index.fire12-hi +index_path: indexes/lucene-index.fire12-hi/ collection_class: CleanTrecCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/gov2.yaml b/src/main/resources/regression/gov2.yaml index 2364099650..477865555b 100644 --- a/src/main/resources/regression/gov2.yaml +++ b/src/main/resources/regression/gov2.yaml @@ -2,7 +2,7 @@ corpus: gov2 corpus_path: collections/web/gov2/gov2-corpus/ -index_path: indexes/lucene-index.gov2 +index_path: indexes/lucene-index.gov2/ collection_class: TrecwebCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 44 diff --git a/src/main/resources/regression/mb11.yaml b/src/main/resources/regression/mb11.yaml index 9725e39130..875b9f7a0c 100644 --- a/src/main/resources/regression/mb11.yaml +++ b/src/main/resources/regression/mb11.yaml @@ -2,7 +2,7 @@ corpus: mb11 corpus_path: collections/twitter/Tweets2011-corpus/json.gold/ -index_path: indexes/lucene-index.mb11 +index_path: indexes/lucene-index.mb11/ collection_class: TweetCollection generator_class: TweetGenerator index_threads: 44 diff --git a/src/main/resources/regression/mb13.yaml b/src/main/resources/regression/mb13.yaml index 02dd4633ab..c736616dc8 100644 --- a/src/main/resources/regression/mb13.yaml +++ b/src/main/resources/regression/mb13.yaml @@ -2,7 +2,7 @@ corpus: mb13 corpus_path: collections/twitter/Tweets2013-corpus/data/ -index_path: indexes/lucene-index.mb13 +index_path: indexes/lucene-index.mb13/ collection_class: TweetCollection generator_class: TweetGenerator index_threads: 44 diff --git a/src/main/resources/regression/mrtydi-v1.1-ar.yaml b/src/main/resources/regression/mrtydi-v1.1-ar.yaml index 12129b11bc..122a46af00 100644 --- a/src/main/resources/regression/mrtydi-v1.1-ar.yaml +++ b/src/main/resources/regression/mrtydi-v1.1-ar.yaml @@ -1,8 +1,8 @@ --- corpus: mrtydi-v1.1-ar -corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-arabic +corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-arabic/ -index_path: indexes/lucene-index.mrtydi-v1.1-arabic +index_path: indexes/lucene-index.mrtydi-v1.1-arabic/ collection_class: MrTyDiCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/mrtydi-v1.1-bn.yaml b/src/main/resources/regression/mrtydi-v1.1-bn.yaml index b4ce928a6d..b083bc5706 100644 --- a/src/main/resources/regression/mrtydi-v1.1-bn.yaml +++ b/src/main/resources/regression/mrtydi-v1.1-bn.yaml @@ -1,8 +1,8 @@ --- corpus: mrtydi-v1.1-bn -corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-bengali +corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-bengali/ -index_path: indexes/lucene-index.mrtydi-v1.1-bengali +index_path: indexes/lucene-index.mrtydi-v1.1-bengali/ collection_class: MrTyDiCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/mrtydi-v1.1-en.yaml b/src/main/resources/regression/mrtydi-v1.1-en.yaml index c6960c7209..0c703b6c06 100644 --- a/src/main/resources/regression/mrtydi-v1.1-en.yaml +++ b/src/main/resources/regression/mrtydi-v1.1-en.yaml @@ -1,8 +1,8 @@ --- corpus: mrtydi-v1.1-en -corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-english +corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-english/ -index_path: indexes/lucene-index.mrtydi-v1.1-english +index_path: indexes/lucene-index.mrtydi-v1.1-english/ collection_class: MrTyDiCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/mrtydi-v1.1-fi.yaml b/src/main/resources/regression/mrtydi-v1.1-fi.yaml index bbaa676a9d..73b850ef7b 100644 --- a/src/main/resources/regression/mrtydi-v1.1-fi.yaml +++ b/src/main/resources/regression/mrtydi-v1.1-fi.yaml @@ -1,8 +1,8 @@ --- corpus: mrtydi-v1.1-fi -corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-finnish +corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-finnish/ -index_path: indexes/lucene-index.mrtydi-v1.1-finnish +index_path: indexes/lucene-index.mrtydi-v1.1-finnish/ collection_class: MrTyDiCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/mrtydi-v1.1-id.yaml b/src/main/resources/regression/mrtydi-v1.1-id.yaml index 09a1a0f128..22e39782a1 100644 --- a/src/main/resources/regression/mrtydi-v1.1-id.yaml +++ b/src/main/resources/regression/mrtydi-v1.1-id.yaml @@ -1,8 +1,8 @@ --- corpus: mrtydi-v1.1-id -corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-indonesian +corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-indonesian/ -index_path: indexes/lucene-index.mrtydi-v1.1-indonesian +index_path: indexes/lucene-index.mrtydi-v1.1-indonesian/ collection_class: MrTyDiCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/mrtydi-v1.1-ja.yaml b/src/main/resources/regression/mrtydi-v1.1-ja.yaml index a5e809ab16..a63271c9a6 100644 --- a/src/main/resources/regression/mrtydi-v1.1-ja.yaml +++ b/src/main/resources/regression/mrtydi-v1.1-ja.yaml @@ -1,8 +1,8 @@ --- corpus: mrtydi-v1.1-ja -corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-japanese +corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-japanese/ -index_path: indexes/lucene-index.mrtydi-v1.1-japanese +index_path: indexes/lucene-index.mrtydi-v1.1-japanese/ collection_class: MrTyDiCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/mrtydi-v1.1-ko.yaml b/src/main/resources/regression/mrtydi-v1.1-ko.yaml index 4fe56a2edc..8265665cbc 100644 --- a/src/main/resources/regression/mrtydi-v1.1-ko.yaml +++ b/src/main/resources/regression/mrtydi-v1.1-ko.yaml @@ -1,8 +1,8 @@ --- corpus: mrtydi-v1.1-ko -corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-korean +corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-korean/ -index_path: indexes/lucene-index.mrtydi-v1.1-korean +index_path: indexes/lucene-index.mrtydi-v1.1-korean/ collection_class: MrTyDiCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/mrtydi-v1.1-ru.yaml b/src/main/resources/regression/mrtydi-v1.1-ru.yaml index 3af935fd9c..51b8c6fce1 100644 --- a/src/main/resources/regression/mrtydi-v1.1-ru.yaml +++ b/src/main/resources/regression/mrtydi-v1.1-ru.yaml @@ -1,8 +1,8 @@ --- corpus: mrtydi-v1.1-ru -corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-russian +corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-russian/ -index_path: indexes/lucene-index.mrtydi-v1.1-russian +index_path: indexes/lucene-index.mrtydi-v1.1-russian/ collection_class: MrTyDiCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/mrtydi-v1.1-sw.yaml b/src/main/resources/regression/mrtydi-v1.1-sw.yaml index 7d13f89855..67350a3fda 100644 --- a/src/main/resources/regression/mrtydi-v1.1-sw.yaml +++ b/src/main/resources/regression/mrtydi-v1.1-sw.yaml @@ -1,8 +1,8 @@ --- corpus: mrtydi-v1.1-sw -corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-swahili +corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-swahili/ -index_path: indexes/lucene-index.mrtydi-v1.1-swahili +index_path: indexes/lucene-index.mrtydi-v1.1-swahili/ collection_class: MrTyDiCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/mrtydi-v1.1-te.yaml b/src/main/resources/regression/mrtydi-v1.1-te.yaml index 8d67d97c28..42930ef65a 100644 --- a/src/main/resources/regression/mrtydi-v1.1-te.yaml +++ b/src/main/resources/regression/mrtydi-v1.1-te.yaml @@ -1,8 +1,8 @@ --- corpus: mrtydi-v1.1-te -corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-telugu +corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-telugu/ -index_path: indexes/lucene-index.mrtydi-v1.1-telugu +index_path: indexes/lucene-index.mrtydi-v1.1-telugu/ collection_class: MrTyDiCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/mrtydi-v1.1-th.yaml b/src/main/resources/regression/mrtydi-v1.1-th.yaml index 2333601557..035efda906 100644 --- a/src/main/resources/regression/mrtydi-v1.1-th.yaml +++ b/src/main/resources/regression/mrtydi-v1.1-th.yaml @@ -1,8 +1,8 @@ --- corpus: mrtydi-v1.1-th -corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-thai +corpus_path: collections/mr-tydi-corpus/mrtydi-v1.1-thai/ -index_path: indexes/lucene-index.mrtydi-v1.1-thai +index_path: indexes/lucene-index.mrtydi-v1.1-thai/ collection_class: MrTyDiCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 1 diff --git a/src/main/resources/regression/msmarco-doc-docTTTTTquery-per-passage.yaml b/src/main/resources/regression/msmarco-doc-docTTTTTquery-per-passage.yaml deleted file mode 100644 index d24f633cbe..0000000000 --- a/src/main/resources/regression/msmarco-doc-docTTTTTquery-per-passage.yaml +++ /dev/null @@ -1,67 +0,0 @@ ---- -corpus: msmarco-doc-docTTTTTquery-per-passage -corpus_path: collections/msmarco/doc-docTTTTTquery-per-passage - -index_path: indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage -collection_class: JsonCollection -generator_class: DefaultLuceneDocumentGenerator -index_threads: 1 -index_options: -storePositions -storeDocvectors -storeRaw -index_stats: - documents: 20544550 - documents (non-empty): 20544550 - total terms: 4203956960 - -metrics: - - metric: MAP - command: tools/eval/trec_eval.9.0.4/trec_eval - params: -c -m map - separator: "\t" - parse_index: 2 - metric_precision: 4 - can_combine: true - - metric: R@100 - command: tools/eval/trec_eval.9.0.4/trec_eval - params: -c -m recall.100 - separator: "\t" - parse_index: 2 - metric_precision: 4 - can_combine: true - - metric: R@1000 - command: tools/eval/trec_eval.9.0.4/trec_eval - params: -c -m recall.1000 - separator: "\t" - parse_index: 2 - metric_precision: 4 - can_combine: true - -topic_reader: TsvInt -topic_root: src/main/resources/topics-and-qrels/ -qrels_root: src/main/resources/topics-and-qrels/ -topics: - - name: "[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)" - id: dev - path: topics.msmarco-doc.dev.txt - qrel: qrels.msmarco-doc.dev.txt - -models: - - name: bm25-default - display: BM25 (default) - params: -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.3182 - R@100: - - 0.8481 - R@1000: - - 0.9490 - - name: bm25-tuned - display: BM25 (tuned) - params: -bm25 -bm25.k1 2.56 -bm25.b 0.59 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.3211 - R@100: - - 0.8627 - R@1000: - - 0.9530 \ No newline at end of file diff --git a/src/main/resources/regression/msmarco-doc-docTTTTTquery-per-doc.yaml b/src/main/resources/regression/msmarco-doc-docTTTTTquery.yaml similarity index 80% rename from src/main/resources/regression/msmarco-doc-docTTTTTquery-per-doc.yaml rename to src/main/resources/regression/msmarco-doc-docTTTTTquery.yaml index b42a5e735d..7dfb87b655 100644 --- a/src/main/resources/regression/msmarco-doc-docTTTTTquery-per-doc.yaml +++ b/src/main/resources/regression/msmarco-doc-docTTTTTquery.yaml @@ -1,16 +1,16 @@ --- -corpus: msmarco-doc-docTTTTTquery-per-doc -corpus_path: collections/msmarco/doc-docTTTTTquery-per-doc +corpus: msmarco-doc-docTTTTTquery +corpus_path: collections/msmarco/msmarco-doc-docTTTTTquery/ -index_path: indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc +index_path: indexes/lucene-index.msmarco-doc-docTTTTTquery/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator -index_threads: 1 +index_threads: 7 index_options: -storePositions -storeDocvectors -storeRaw index_stats: - documents: 3213834 - documents (non-empty): 3213834 - total terms: 3748332076 + documents: 3213835 + documents (non-empty): 3213835 + total terms: 3748333319 metrics: - metric: MAP @@ -52,7 +52,7 @@ models: MAP: - 0.2886 R@100: - - 0.7990 + - 0.7993 R@1000: - 0.9259 - name: bm25-tuned @@ -60,8 +60,8 @@ models: params: -bm25 -bm25.k1 4.68 -bm25.b 0.87 results: MAP: - - 0.3270 + - 0.3273 R@100: - - 0.8608 + - 0.8612 R@1000: - 0.9553 \ No newline at end of file diff --git a/src/main/resources/regression/msmarco-doc-per-passage-v2.yaml b/src/main/resources/regression/msmarco-doc-per-passage-v2.yaml deleted file mode 100644 index 4bd83f7fef..0000000000 --- a/src/main/resources/regression/msmarco-doc-per-passage-v2.yaml +++ /dev/null @@ -1,127 +0,0 @@ ---- -corpus: msmarco-doc-per-passage-v2 -corpus_path: collections/msmarco/doc-per-passage-v2/ - -index_path: indexes/lucene-index.msmarco-doc-per-passage-v2 -collection_class: JsonCollection -generator_class: DefaultLuceneDocumentGenerator -index_threads: 16 -index_options: -storePositions -storeDocvectors -storeRaw -index_stats: - documents: 20545677 - documents (non-empty): 20545612 - total terms: 3056059952 - -metrics: - - metric: MAP - command: tools/eval/trec_eval.9.0.4/trec_eval - params: -c -m map - separator: "\t" - parse_index: 2 - metric_precision: 4 - can_combine: true - - metric: R@100 - command: tools/eval/trec_eval.9.0.4/trec_eval - params: -c -m recall.100 - separator: "\t" - parse_index: 2 - metric_precision: 4 - can_combine: true - - metric: R@1000 - command: tools/eval/trec_eval.9.0.4/trec_eval - params: -c -m recall.1000 - separator: "\t" - parse_index: 2 - metric_precision: 4 - can_combine: true - -topic_reader: TsvInt -topic_root: src/main/resources/topics-and-qrels/ -qrels_root: src/main/resources/topics-and-qrels/ -topics: - - name: "[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)" - id: dev - path: topics.msmarco-doc.dev.txt - qrel: qrels.msmarco-doc.dev.txt - -models: - - name: bm25-default - display: BM25 (default) - params: -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2609 - R@100: - - 0.7737 - R@1000: - - 0.9095 - - name: bm25-default+rm3 - display: +RM3 - params: -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2324 - R@100: - - 0.7768 - R@1000: - - 0.9266 - - name: bm25-default+ax - display: +Ax - params: -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2170 - R@100: - - 0.7578 - R@1000: - - 0.9207 - - name: bm25-default+prf - display: +PRF - params: -bm25 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2189 - R@100: - - 0.7570 - R@1000: - - 0.9135 - - name: bm25-tuned - display: BM25 (tuned) - params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2639 - R@100: - - 0.7884 - R@1000: - - 0.9222 - - name: bm25-tuned+rm3 - display: +RM3 - params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2342 - R@100: - - 0.7793 - R@1000: - - 0.9239 - - name: bm25-tuned+ax - display: +Ax - params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2250 - R@100: - - 0.7730 - R@1000: - - 0.9268 - - name: bm25-tuned+prf - display: +PRF - params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2184 - R@100: - - 0.7520 - R@1000: - - 0.9101 \ No newline at end of file diff --git a/src/main/resources/regression/msmarco-doc-per-passage.yaml b/src/main/resources/regression/msmarco-doc-per-passage.yaml deleted file mode 100644 index 65b483f69f..0000000000 --- a/src/main/resources/regression/msmarco-doc-per-passage.yaml +++ /dev/null @@ -1,127 +0,0 @@ ---- -corpus: msmarco-doc-per-passage -corpus_path: collections/msmarco/doc-per-passage/ - -index_path: indexes/lucene-index.msmarco-doc-per-passage -collection_class: JsonCollection -generator_class: DefaultLuceneDocumentGenerator -index_threads: 1 -index_options: -storePositions -storeDocvectors -storeRaw -index_stats: - documents: 20544550 - documents (non-empty): 20544550 - total terms: 3197886407 - -metrics: - - metric: MAP - command: tools/eval/trec_eval.9.0.4/trec_eval - params: -c -m map - separator: "\t" - parse_index: 2 - metric_precision: 4 - can_combine: true - - metric: R@100 - command: tools/eval/trec_eval.9.0.4/trec_eval - params: -c -m recall.100 - separator: "\t" - parse_index: 2 - metric_precision: 4 - can_combine: true - - metric: R@1000 - command: tools/eval/trec_eval.9.0.4/trec_eval - params: -c -m recall.1000 - separator: "\t" - parse_index: 2 - metric_precision: 4 - can_combine: true - -topic_reader: TsvInt -topic_root: src/main/resources/topics-and-qrels/ -qrels_root: src/main/resources/topics-and-qrels/ -topics: - - name: "[MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking)" - id: dev - path: topics.msmarco-doc.dev.txt - qrel: qrels.msmarco-doc.dev.txt - -models: - - name: bm25-default - display: BM25 (default) - params: -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2688 - R@100: - - 0.7849 - R@1000: - - 0.9180 - - name: bm25-default+rm3 - display: +RM3 - params: -bm25 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2416 - R@100: - - 0.7876 - R@1000: - - 0.9355 - - name: bm25-default+ax - display: +Ax - params: -bm25 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2229 - R@100: - - 0.7703 - R@1000: - - 0.9266 - - name: bm25-default+prf - display: +PRF - params: -bm25 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2325 - R@100: - - 0.7714 - R@1000: - - 0.9187 - - name: bm25-tuned - display: BM25 (tuned) - params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2756 - R@100: - - 0.8009 - R@1000: - - 0.9311 - - name: bm25-tuned+rm3 - display: +RM3 - params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -rm3 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2443 - R@100: - - 0.7955 - R@1000: - - 0.9359 - - name: bm25-tuned+ax - display: +Ax - params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -axiom -axiom.deterministic -rerankCutoff 20 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2350 - R@100: - - 0.7909 - R@1000: - - 0.9341 - - name: bm25-tuned+prf - display: +PRF - params: -bm25 -bm25.k1 2.16 -bm25.b 0.61 -bm25prf -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 - results: - MAP: - - 0.2271 - R@100: - - 0.7685 - R@1000: - - 0.9162 \ No newline at end of file diff --git a/src/main/resources/regression/msmarco-doc-docTTTTTquery-per-passage-v3.yaml b/src/main/resources/regression/msmarco-doc-segmented-docTTTTTquery.yaml similarity index 89% rename from src/main/resources/regression/msmarco-doc-docTTTTTquery-per-passage-v3.yaml rename to src/main/resources/regression/msmarco-doc-segmented-docTTTTTquery.yaml index 1092f9ba29..9ad09552e0 100644 --- a/src/main/resources/regression/msmarco-doc-docTTTTTquery-per-passage-v3.yaml +++ b/src/main/resources/regression/msmarco-doc-segmented-docTTTTTquery.yaml @@ -1,8 +1,8 @@ --- -corpus: msmarco-doc-docTTTTTquery-per-passage-v3 -corpus_path: collections/msmarco/doc-docTTTTTquery-per-passage-v3 +corpus: msmarco-doc-segmented-docTTTTTquery +corpus_path: collections/msmarco/msmarco-doc-segmented-docTTTTTquery/ -index_path: indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage-v3 +index_path: indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/msmarco-doc-per-passage-v3.yaml b/src/main/resources/regression/msmarco-doc-segmented.yaml similarity index 95% rename from src/main/resources/regression/msmarco-doc-per-passage-v3.yaml rename to src/main/resources/regression/msmarco-doc-segmented.yaml index cd9a7a41ff..785de81f82 100644 --- a/src/main/resources/regression/msmarco-doc-per-passage-v3.yaml +++ b/src/main/resources/regression/msmarco-doc-segmented.yaml @@ -1,8 +1,8 @@ --- -corpus: msmarco-doc-per-passage-v3 -corpus_path: collections/msmarco/doc-per-passage-v3/ +corpus: msmarco-doc-segmented +corpus_path: collections/msmarco/msmarco-doc-segmented/ -index_path: indexes/lucene-index.msmarco-doc-per-passage-v3 +index_path: indexes/lucene-index.msmarco-doc-segmented/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/msmarco-doc.yaml b/src/main/resources/regression/msmarco-doc.yaml index ae43147a13..59644ebb7c 100644 --- a/src/main/resources/regression/msmarco-doc.yaml +++ b/src/main/resources/regression/msmarco-doc.yaml @@ -1,16 +1,16 @@ --- corpus: msmarco-doc -corpus_path: collections/msmarco/doc/ +corpus_path: collections/msmarco/msmarco-doc/ -index_path: indexes/lucene-index.msmarco-doc -collection_class: CleanTrecCollection +index_path: indexes/lucene-index.msmarco-doc/ +collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator -index_threads: 1 +index_threads: 7 index_options: -storePositions -storeDocvectors -storeRaw index_stats: documents: 3213835 documents (non-empty): 3213835 - total terms: 2748636047 + total terms: 2742209690 metrics: - metric: MAP @@ -50,9 +50,9 @@ models: params: -bm25 results: MAP: - - 0.2310 + - 0.2305 R@100: - - 0.7279 + - 0.7281 R@1000: - 0.8856 - name: bm25-default+rm3 @@ -60,21 +60,21 @@ models: params: -bm25 -rm3 results: MAP: - - 0.1632 + - 0.1631 R@100: - - 0.6765 + - 0.6767 R@1000: - - 0.8785 + - 0.8791 - name: bm25-tuned display: BM25 (tuned) params: -bm25 -bm25.k1 3.44 -bm25.b 0.87 results: MAP: - - 0.2788 + - 0.2784 R@100: - - 0.8065 + - 0.8069 R@1000: - - 0.9326 + - 0.9324 - name: bm25-tuned+rm3 display: +RM3 params: -bm25 -bm25.k1 3.44 -bm25.b 0.87 -rm3 @@ -82,17 +82,17 @@ models: MAP: - 0.2289 R@100: - - 0.7872 + - 0.7878 R@1000: - - 0.9320 + - 0.9314 - name: bm25-tuned2 display: BM25 (tuned2) params: -bm25 -bm25.k1 4.46 -bm25.b 0.82 results: MAP: - - 0.2775 + - 0.2774 R@100: - - 0.8076 + - 0.8070 R@1000: - 0.9357 - name: bm25-tuned2+rm3 @@ -100,8 +100,8 @@ models: params: -bm25 -bm25.k1 4.46 -bm25.b 0.82 -rm3 results: MAP: - - 0.2238 + - 0.2239 R@100: - - 0.7789 + - 0.7791 R@1000: - - 0.9307 \ No newline at end of file + - 0.9305 \ No newline at end of file diff --git a/src/main/resources/regression/msmarco-passage-deepimpact.yaml b/src/main/resources/regression/msmarco-passage-deepimpact.yaml index 1b8a77470d..04a9fafc6b 100644 --- a/src/main/resources/regression/msmarco-passage-deepimpact.yaml +++ b/src/main/resources/regression/msmarco-passage-deepimpact.yaml @@ -2,7 +2,7 @@ corpus: msmarco-passage-deepimpact corpus_path: collections/msmarco/msmarco-passage-deepimpact-b8/ -index_path: indexes/lucene-index.msmarco-passage-deepimpact +index_path: indexes/lucene-index.msmarco-passage-deepimpact/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/msmarco-passage-distill-splade-max.yaml b/src/main/resources/regression/msmarco-passage-distill-splade-max.yaml index 584affe9a2..2ec05ae797 100644 --- a/src/main/resources/regression/msmarco-passage-distill-splade-max.yaml +++ b/src/main/resources/regression/msmarco-passage-distill-splade-max.yaml @@ -2,7 +2,7 @@ corpus: msmarco-passage-distill-splade-max corpus_path: collections/msmarco/msmarco-passage-distill-splade-max/ -index_path: indexes/lucene-index.msmarco-passage-distill-splade-max +index_path: indexes/lucene-index.msmarco-passage-distill-splade-max/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/msmarco-passage-doc2query.yaml b/src/main/resources/regression/msmarco-passage-doc2query.yaml index 39a0e6dc44..27c08fbb03 100644 --- a/src/main/resources/regression/msmarco-passage-doc2query.yaml +++ b/src/main/resources/regression/msmarco-passage-doc2query.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-passage-doc2query -corpus_path: collections/msmarco/passage-expanded-topk10 +corpus_path: collections/msmarco/passage-expanded-topk10/ -index_path: indexes/lucene-index.msmarco-passage-doc2query +index_path: indexes/lucene-index.msmarco-passage-doc2query/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 9 diff --git a/src/main/resources/regression/msmarco-passage-docTTTTTquery.yaml b/src/main/resources/regression/msmarco-passage-docTTTTTquery.yaml index 6990f8ca6c..b377dab96f 100644 --- a/src/main/resources/regression/msmarco-passage-docTTTTTquery.yaml +++ b/src/main/resources/regression/msmarco-passage-docTTTTTquery.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-passage-docTTTTTquery -corpus_path: collections/msmarco/passage-docTTTTTquery +corpus_path: collections/msmarco/passage-docTTTTTquery/ -index_path: indexes/lucene-index.msmarco-passage-docTTTTTquery +index_path: indexes/lucene-index.msmarco-passage-docTTTTTquery/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 9 diff --git a/src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml b/src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml index d9ac30c8ca..0e9ebbe91d 100644 --- a/src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml +++ b/src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml @@ -2,7 +2,7 @@ corpus: msmarco-passage-unicoil-tilde-expansion corpus_path: collections/msmarco/msmarco-passage-unicoil-tilde-expansion-b8/ -index_path: indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion +index_path: indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/msmarco-passage-unicoil.yaml b/src/main/resources/regression/msmarco-passage-unicoil.yaml index 8224af6066..f40c3cad6e 100644 --- a/src/main/resources/regression/msmarco-passage-unicoil.yaml +++ b/src/main/resources/regression/msmarco-passage-unicoil.yaml @@ -2,7 +2,7 @@ corpus: msmarco-passage-unicoil corpus_path: collections/msmarco/msmarco-passage-unicoil-b8/ -index_path: indexes/lucene-index.msmarco-passage-unicoil +index_path: indexes/lucene-index.msmarco-passage-unicoil/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/msmarco-passage.yaml b/src/main/resources/regression/msmarco-passage.yaml index e43719f712..0724306f9b 100644 --- a/src/main/resources/regression/msmarco-passage.yaml +++ b/src/main/resources/regression/msmarco-passage.yaml @@ -2,7 +2,7 @@ corpus: msmarco-passage corpus_path: collections/msmarco/passage/ -index_path: indexes/lucene-index.msmarco-passage +index_path: indexes/lucene-index.msmarco-passage/ collection_class: JsonCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 9 diff --git a/src/main/resources/regression/msmarco-v2-doc-segmented-unicoil-noexp-0shot.yaml b/src/main/resources/regression/msmarco-v2-doc-segmented-unicoil-noexp-0shot.yaml index 14da7a5dd3..e10485c8f9 100644 --- a/src/main/resources/regression/msmarco-v2-doc-segmented-unicoil-noexp-0shot.yaml +++ b/src/main/resources/regression/msmarco-v2-doc-segmented-unicoil-noexp-0shot.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-v2-doc-segmented-unicoil-noexp-0shot -corpus_path: collections/msmarco/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8 +corpus_path: collections/msmarco/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8/ -index_path: indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot +index_path: indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 18 diff --git a/src/main/resources/regression/msmarco-v2-doc-segmented.yaml b/src/main/resources/regression/msmarco-v2-doc-segmented.yaml index 332867d0fa..5972a7d33c 100644 --- a/src/main/resources/regression/msmarco-v2-doc-segmented.yaml +++ b/src/main/resources/regression/msmarco-v2-doc-segmented.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-v2-doc-segmented -corpus_path: collections/msmarco/msmarco_v2_doc_segmented +corpus_path: collections/msmarco/msmarco_v2_doc_segmented/ -index_path: indexes/lucene-index.msmarco-v2-doc-segmented +index_path: indexes/lucene-index.msmarco-v2-doc-segmented/ collection_class: MsMarcoV2DocCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 18 diff --git a/src/main/resources/regression/msmarco-v2-doc.yaml b/src/main/resources/regression/msmarco-v2-doc.yaml index 74b0c328f9..c890fccba8 100644 --- a/src/main/resources/regression/msmarco-v2-doc.yaml +++ b/src/main/resources/regression/msmarco-v2-doc.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-v2-doc -corpus_path: collections/msmarco/msmarco_v2_doc +corpus_path: collections/msmarco/msmarco_v2_doc/ -index_path: indexes/lucene-index.msmarco-v2-doc +index_path: indexes/lucene-index.msmarco-v2-doc/ collection_class: MsMarcoV2DocCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 18 diff --git a/src/main/resources/regression/msmarco-v2-passage-augmented.yaml b/src/main/resources/regression/msmarco-v2-passage-augmented.yaml index 02e091834e..8e158e44fe 100644 --- a/src/main/resources/regression/msmarco-v2-passage-augmented.yaml +++ b/src/main/resources/regression/msmarco-v2-passage-augmented.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-v2-passage-augmented -corpus_path: collections/msmarco/msmarco_v2_passage_augmented +corpus_path: collections/msmarco/msmarco_v2_passage_augmented/ -index_path: indexes/lucene-index.msmarco-v2-passage-augmented +index_path: indexes/lucene-index.msmarco-v2-passage-augmented/ collection_class: MsMarcoV2PassageCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 18 diff --git a/src/main/resources/regression/msmarco-v2-passage-unicoil-noexp-0shot.yaml b/src/main/resources/regression/msmarco-v2-passage-unicoil-noexp-0shot.yaml index bf3be720d6..8b81f03b3b 100644 --- a/src/main/resources/regression/msmarco-v2-passage-unicoil-noexp-0shot.yaml +++ b/src/main/resources/regression/msmarco-v2-passage-unicoil-noexp-0shot.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-v2-passage-unicoil-noexp-0shot -corpus_path: collections/msmarco/msmarco-passage-v2-unicoil-noexp-0shot-b8 +corpus_path: collections/msmarco/msmarco-passage-v2-unicoil-noexp-0shot-b8/ -index_path: indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot +index_path: indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 18 diff --git a/src/main/resources/regression/msmarco-v2-passage.yaml b/src/main/resources/regression/msmarco-v2-passage.yaml index a13b07deb5..e452109e59 100644 --- a/src/main/resources/regression/msmarco-v2-passage.yaml +++ b/src/main/resources/regression/msmarco-v2-passage.yaml @@ -1,8 +1,8 @@ --- corpus: msmarco-v2-passage -corpus_path: collections/msmarco/msmarco_v2_passage +corpus_path: collections/msmarco/msmarco_v2_passage/ -index_path: indexes/lucene-index.msmarco-v2-passage +index_path: indexes/lucene-index.msmarco-v2-passage/ collection_class: MsMarcoV2PassageCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 18 diff --git a/src/main/resources/regression/ntcir8-zh.yaml b/src/main/resources/regression/ntcir8-zh.yaml index 3daee543b5..119bcb4086 100644 --- a/src/main/resources/regression/ntcir8-zh.yaml +++ b/src/main/resources/regression/ntcir8-zh.yaml @@ -2,7 +2,7 @@ corpus: ntcir8-zh corpus_path: collections/newswire/clir/ntcir.zh/ntcir8-zh/ -index_path: indexes/lucene-index.ntcir8-zh +index_path: indexes/lucene-index.ntcir8-zh/ collection_class: CleanTrecCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/robust05.yaml b/src/main/resources/regression/robust05.yaml index f061608270..baec2eaf7d 100644 --- a/src/main/resources/regression/robust05.yaml +++ b/src/main/resources/regression/robust05.yaml @@ -2,7 +2,7 @@ corpus: robust05 corpus_path: collections/newswire/AQUAINT/ -index_path: indexes/lucene-index.robust05 +index_path: indexes/lucene-index.robust05/ collection_class: TrecCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/trec02-ar.yaml b/src/main/resources/regression/trec02-ar.yaml index 26e0fee49f..480777463a 100644 --- a/src/main/resources/regression/trec02-ar.yaml +++ b/src/main/resources/regression/trec02-ar.yaml @@ -1,8 +1,8 @@ --- corpus: trec02-ar -corpus_path: collections/newswire/clir/trec.ar/arabic_newswire_a_ldc2001t55/transcripts +corpus_path: collections/newswire/clir/trec.ar/arabic_newswire_a_ldc2001t55/transcripts/ -index_path: indexes/lucene-index.trec02-ar +index_path: indexes/lucene-index.trec02-ar/ collection_class: CleanTrecCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16 diff --git a/src/main/resources/regression/wt10g.yaml b/src/main/resources/regression/wt10g.yaml index 9d4867b542..4a9f01f5b7 100644 --- a/src/main/resources/regression/wt10g.yaml +++ b/src/main/resources/regression/wt10g.yaml @@ -2,7 +2,7 @@ corpus: wt10g corpus_path: collections/web/wt10g/ -index_path: indexes/lucene-index.wt10g +index_path: indexes/lucene-index.wt10g/ collection_class: TrecwebCollection generator_class: DefaultLuceneDocumentGenerator index_threads: 16